We are interested in summarizing and checking the correlation properties of the following methods: range-based, CVB, noise-based and combination-based.
Let's start with the range-based method. When a=b=0, the correlation is 0 between each row.
Rtask <- 10000
Rmach <- 100
mat <- generate_matrix_siegel_range(200, 200, Rtask, Rmach, a = 0, b = 0)
mean_cor_row(mat)
## [1] 0.009438
Let X be the first generated vector that follows U[1,Rtask] and Xi and Xj be two vectors that are multiplied by X and that follow U[1,Rmach]. The correlation between X * Xi and X * Xj is:
rho(X Xi, X Xj) = cov(X Xi, X Xj) / (sqrt(var(X Xi)) * sqrt(var(X Xj)))
rho(X Xi, X Xj) = (E[X Xi X Xj] - E[X Xi] E[X Xj]) /
(var(X) * var(Xi) + mean(X)^2 * var(Xi) + var(X) * mean(Xi)^2)
covar(XZ, XZ') = (E[X^2] E[Xi]^2 - E[X]^2 E[Xi]^2)
covar(XZ, XZ') = var(X) * E[Xi]^2
The corresponding means are : ½ * (1+Rtask) and ½ * (1+Rmach). The corresponding variances are : 1/12 * (Rtask-1)2 and 1/12 * (Rmach-1)2. Therefore, the correlation coefficient is:
1/12 * (Rtask-1)^2 * 1/4 * (1+Rmach)^2 /
(1/12 * (Rtask-1)^2 * 1/12 * (Rmach-1)^2 +
1/12 * (Rtask-1)^2 * 1/4 * (1+Rmach)^2 +
1/12 * (Rmach-1)^2 * 1/4 * (1+Rtask)^2)
(1+Rmach)^2 / (1/3 * (Rmach-1)^2 + (1+Rmach)^2 + (Rmach-1)^2 * (1+Rtask)^2 / (Rtask-1)^2)
Rtask <- 10000
Rmach <- 100
mat <- generate_matrix_siegel_range(200, 200, Rtask, Rmach, a = 0, b = 0)
mean_cor_col(mat)
## [1] 0.4349
(1 + Rmach)^2/(1/3 * (Rmach - 1)^2 + (1 + Rmach)^2 + (Rmach - 1)^2 * (1 + Rtask)^2/(Rtask -
1)^2)
## [1] 0.4383
When Rtask and Rmach are large, this tends to 3/7, which is around 0.43.
Now when all lines are sorted, then each pair of rows is correlated because values on each row increases. The column correlations are also 1 because each column is the product of the same variable (one of the quantile of U[1,Rmach]) by the same vector that follows U[1,Rtask].
Rtask <- 10000
Rmach <- 100
mat <- generate_matrix_siegel_range(200, 200, Rtask, Rmach, a = 1, b = 1)
mean_cor_row(mat)
## [1] 0.9962
mean_cor_col(mat)
## [1] 0.9695
First, let's try with a=b=½.
Rtask <- 10000
Rmach <- 100
mat <- generate_matrix_siegel_range(200, 200, Rtask, Rmach, a = 1/2, b = 1/2)
mean_cor_row(mat)
## [1] 0.1369
mean_cor_col(mat)
## [1] 0.4613
We conjecture that the row correlation is a2 * b and that the column correlation will be difficult to determine.
Rtask <- 10000
Rmach <- 100
prec <- NULL
for (a in seq(0, 1, 0.1)) for (b in seq(0, 1, 0.1)) {
mat <- generate_matrix_siegel_range(50, 50, Rtask, Rmach, a, b)
prec <- c(prec, mean_cor_row(mat) - a^2 * b)
}
summary(prec)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0212 0.0122 0.0261 0.0223 0.0354 0.0514
Let's see if we can work on the column correlation. Let's start with the case where a=1. It seems more easy because entire columns are affected. Then, there are bm columns that are correlated and (1-b)m with the previously computed correlation (around 0.43). The question is how are the sorted columns correlated with the others.
Let X be the first generated vector that follows U[1,Rtask] and Xi be a vector that is multiplied by X and that follows U[1,Rmach]. The correlation between X * Xi and X is:
rho(X Xi, X) = cov(X Xi, X) / (sqrt(var(X Xi)) * sqrt(var(X)))
rho(X Xi, X Xj) = (E[X Xi X] - E[X Xi] E[X]) /
(sqrt(var(X) * var(Xi) + mean(X)^2 * var(Xi)
+ var(X) * mean(Xi)^2) * sqrt(var(X)))
covar(XZ, XZ') = (E[X^2] E[Xi] - E[X]^2 E[Xi])
covar(XZ, XZ') = var(X) * E[Xi]
rho(X Xi, X Xj) = sqrt(var(X)) * E[Xi] /
sqrt(var(X) * var(Xi) + mean(X)^2 * var(Xi) + var(X) * mean(Xi)^2)
The corresponding means are : ½ * (1+Rtask) and ½ * (1+Rmach). The corresponding variances are : 1/12 * (Rtask-1)2 and 1/12 * (Rmach-1)2. Therefore, the correlation coefficient is:
1/sqrt(12) * (Rtask-1) * 1/2 * (1+Rmach) /
sqrt(1/12 * (Rtask-1)^2 * 1/12 * (Rmach-1)^2 +
1/12 * (Rtask-1)^2 * 1/4 * (1+Rmach)^2 +
1/12 * (Rmach-1)^2 * 1/4 * (1+Rtask)^2)
(1+Rmach) / sqrt(1/3 * (Rmach-1)^2 + (1+Rmach)^2 + (Rmach-1)^2 * (1+Rtask)^2 / (Rtask-1)^2)
When Rtask and Rmach are large, this tends to sqrt(3/7), which is around 0.65.
We conjecture that the correlation is thus:
b^2 + 2*b*(1-b) * (1+Rmach) / sqrt(1/3 * (Rmach-1)^2 + (1+Rmach)^2 +
(Rmach-1)^2 * (1+Rtask)^2 / (Rtask-1)^2) +
(1-b)^2 * (1+Rmach)^2 / (1/3 * (Rmach-1)^2 + (1+Rmach)^2 +
(Rmach-1)^2 * (1+Rtask)^2 / (Rtask-1)^2)
When Rtask and Rmach are large, this tends to \( b^2 * 3/7 + 2*b*(1-b) * sqrt(3/7) + (1-b)^2 \).
Rtask <- 10000
Rmach <- 100
prec <- NULL
for (b in seq(0, 1, 0.1)) {
mat <- generate_matrix_siegel_range(100, 100, Rtask, Rmach, a = 1, b)
cor_sorted <- (1 + Rmach)^2/(1/3 * (Rmach - 1)^2 + (1 + Rmach)^2 + (Rmach -
1)^2 * (1 + Rtask)^2/(Rtask - 1)^2)
cor_mix <- (1 + Rmach)/sqrt(1/3 * (Rmach - 1)^2 + (1 + Rmach)^2 + (Rmach -
1)^2 * (1 + Rtask)^2/(Rtask - 1)^2)
prec <- c(prec, mean_cor_col(mat) - (b^2 + 2 * b * (1 - b) * cor_mix + (1 -
b)^2 * cor_sorted))
}
summary(prec)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.08460 -0.06010 -0.03820 -0.03950 -0.01920 -0.00924
For arbitrary a, it smells like the failed analysis for the last heterogeneity measures. Let's skip this case.
It should be a simple adaptation. Let's directly consider the last case and let's see if the row correlation is again a2 * b.
Vtask <- 0.2
Vmach <- 0.3
prec <- NULL
for (a in seq(0, 1, 0.1)) for (b in seq(0, 1, 0.1)) {
mat <- generate_matrix_siegel_CV(50, 50, 10, 1, Vtask, Vmach, a, b)
prec <- c(prec, mean_cor_row(mat) - a^2 * b)
}
summary(prec)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.03930 0.00708 0.02520 0.01920 0.03560 0.04720
Now, let's derive the column correlation:
rho(X Xi, X Xj) = var(X) * E[Xi]^2 /
(var(X) * var(Xi) + mean(X)^2 * var(Xi) + var(X) * mean(Xi)^2)
The corresponding means are : mutask and 1. The corresponding variances are : mutask2 * Vtask2 and Vmach2.
Therefore, the correlation coefficient is:
mutask^2 * Vtask^2 /
(mutask^2 * Vtask^2 * Vmach^2 + mutask^2 * Vmach^2 + mutask^2 * Vtask^2)
1 / (Vmach^2 + Vmach^2 / Vtask^2 + 1)
And for the case when a=1:
rho(X Xi, X Xj) = sqrt(var(X)) * E[Xi] /
sqrt(var(X) * var(Xi) + mean(X)^2 * var(Xi) + var(X) * mean(Xi)^2)
And the correlation is:
sqrt(mutask^2 * Vtask^2) /
sqrt(mutask^2 * Vtask^2 * Vmach^2 + mutask^2 * Vmach^2 + mutask^2 * Vtask^2)
1 / sqrt(Vmach^2 + Vmach^2 / Vtask^2 + 1)
Let's test this:
Vtask <- 0.2
Vmach <- 0.3
prec <- NULL
for (b in seq(0, 1, 0.1)) {
mat <- generate_matrix_siegel_CV(100, 100, 10, 1, Vtask, Vmach, a = 1, b)
cor_sorted <- 1/(Vmach^2 + Vmach^2/Vtask^2 + 1)
cor_mix <- 1/sqrt(Vmach^2 + Vmach^2/Vtask^2 + 1)
prec <- c(prec, mean_cor_col(mat) - (b^2 + 2 * b * (1 - b) * cor_mix + (1 -
b)^2 * cor_sorted))
}
summary(prec)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0727 -0.0318 -0.0161 -0.0178 0.0015 0.0300
We already analyzed that the row and column correlations are \( 1 / (Vnoise^2 + Vnoise^2 / Vmach^2 + 1) \) and \( 1 / (Vnoise^2 + Vnoise^2 / Vtask^2 + 1) \).
Vtask <- 0.3
Vmach <- 0.2
Vnoise <- 0.1
task <- rgamma_cost(300, 1, Vtask)
proc <- rgamma_cost(300, 1, Vmach)
mat <- generate_heterogeneous_matrix_noise(task, proc, Vnoise)
mean_cor_row(mat)
## [1] 0.7837
1/(Vnoise^2 + Vnoise^2/Vmach^2 + 1)
## [1] 0.7937
mean_cor_col(mat)
## [1] 0.8906
1/(Vnoise^2 + Vnoise^2/Vtask^2 + 1)
## [1] 0.892
What is missing is a way to select the parameters given a row correlation rhoR and a column correlation rhoC.
rhoR = 1 / (Vnoise^2 + Vnoise^2 / Vmach^2 + 1)
rhoC = 1 / (Vnoise^2 + Vnoise^2 / Vtask^2 + 1)
Thus:
Vtask = Vnoise / sqrt(1 / rhoC - Vnoise^2 - 1)
Vmach = Vnoise / sqrt(1 / rhoR - Vnoise^2 - 1)
Necessarily, \( 1 / rhoR - Vnoise^2 - 1 > 0 \) and \( Vnoise < sqrt(1 / rhoR - 1) \).
We have several choices:
Let's do some test with \( Vnoise = min(sqrt(1 / rhoR - 1), sqrt(1 / rhoC - 1)) * k \). In this case, the maximum heterogeneity (when rhoR is close to 1) is \( k / sqrt(1 - k^2) \). Let Vmax be the maximum value, then:
Vmax = k / sqrt(1 - k^2)
Vmax^2 * (1 - k^2) - k^2 = 0
Vmax^2 = (Vmax^2 + 1) * k^2
k = Vmax / sqrt(Vmax^2 + 1)
for (rhoR in c(0.01, 0.99)) for (rhoC in c(0.01, 0.99)) {
Vmax <- 3
k <- Vmax/sqrt(Vmax^2 + 1)
Vnoise <- min(sqrt(1/rhoR - 1), sqrt(1/rhoC - 1)) * k
print(c(Vnoise, Vnoise/sqrt(1/rhoR - Vnoise^2 - 1), Vnoise/sqrt(1/rhoC -
Vnoise^2 - 1)))
}
## [1] 9.439 3.000 3.000
## [1] 0.095346 0.009583 3.000000
## [1] 0.095346 3.000000 0.009583
## [1] 0.09535 3.00000 3.00000
The only issue is that Vnoise should also be limited. Let's add another limit:
rhoR <- 0.01
rhoC <- 0.01
Vnoise <- min(Vmax, min(sqrt(1/rhoR - 1), sqrt(1/rhoC - 1)) * k)
c(Vnoise, Vnoise/sqrt(1/rhoR - Vnoise^2 - 1), Vnoise/sqrt(1/rhoC - Vnoise^2 -
1))
## [1] 3.0000 0.3162 0.3162
This looks good. Most heterogeneity values are between 0.01 and Vmax which is the area where things happen (the place to be).
for (rhoR in c(0.01, 0.99)) for (rhoC in c(0.01, 0.99)) {
mat <- generate_heterogeneous_matrix_noise_corr(200, 200, rgamma_cost, rhoR,
rhoC, 3)
print(c(mean_cor_row(mat), mean_cor_col(mat)))
}
## [1] 0.01867 0.01862
## [1] 0.01876 0.99213
## [1] 0.99172 0.01796
## [1] 0.9913 0.9920
The correlations are obviously the ones given as parameters. However, it would be interesting to analyse the heterogeneity of the computed matrices. The CV of each column and row is initially \( sigma/mean \). We can find the following heterogeneity by testing different values.
sigma <- 3
mu <- 10
rhoR <- 0.2
rhoC <- 0.3
mat <- generate_matrix_corr(1000, 1000, rgamma_cost, mu, sigma, rhoR, rhoC)
c(mean_CV_row(mat), sigma/mu * sqrt(1 - rhoC)/sqrt(1 - rhoR * rhoC))
## [1] 0.2644 0.2589
c(mean_CV_col(mat), sigma/mu * sqrt(1 - rhoR)/sqrt(1 - rhoR * rhoC))
## [1] 0.2810 0.2768
c(CV_mean_row(mat), sigma/mu * sqrt(rhoC) * sqrt(1 - rhoR)/sqrt(1 - rhoR * rhoC))
## [1] 0.1529 0.1516
c(CV_mean_col(mat), sigma/mu * sqrt(1 - rhoC) * sqrt(rhoR)/sqrt(1 - rhoR * rhoC))
## [1] 0.1159 0.1158
We have completed the correlation and the heterogeneity properties of two methods from the literature and two new methods. The last thing to do before launching the XP is to implement the TMA measure.