We showed that the shuffling method that preserves the minimum cost at each step is a good compromise that do not insert too much variation but still manage to decrease significantly the correlation. The distribution is however changed. Let's see if it is also possible to preserve the maximum cost.
The structure of uniform matrices makes it actually impossible to preserve both minimum and maximum costs.
Fixing the noise level is not trivial for the first heterogeneity measure. Let's analyse its correlation.
The values in a column are the product of y (a speed), X (the weights) and Z (the noise). Let's determine the correlation of XZ and XZ'. This is the ratio of covar(XZ,XZ') to the variance of XZ (which is var(X)var(Z)+var(X)+var(Z) because the means of X and Z are both 1).
covar(XZ, XZ') = E[XZXZ'] - E[XZ] E[XZ]
covar(XZ, XZ') = E[X^2] E[Z]^2 - E[X]^2 E[Z]^2
covar(XZ, XZ') = E[X^2] - E[X]^2 = var(X)
Thus the correlation between any pair of columns is 1/(var(Z)+var(Z)/var(X)+1)
CVX <- 0.1
CVY <- 0.2
CVZ <- 0.3
task <- rgamma_cost(1000, 1, CVX)
proc <- rgamma_cost(1000, 1, CVY)
mat <- generate_heterogeneous_matrix_noise(task, proc, CVZ)
mean(sapply(2:ncol(mat), function(j) {
cor.test(mat[, 1], mat[, j])$estimate
}))
## [1] 0.1199
1/(CVZ^2 + CVZ^2/CVX^2 + 1)
## [1] 0.09911
This is close. Let's test with other settings.
CVX <- 0.3
CVY <- 0.2
CVZ <- 0.1
task <- rgamma_cost(1000, 1, CVX)
proc <- rgamma_cost(1000, 1, CVY)
mat <- generate_heterogeneous_matrix_noise(task, proc, CVZ)
mean(sapply(2:ncol(mat), function(j) {
cor.test(mat[, 1], mat[, j])$estimate
}))
## [1] 0.8895
1/(CVZ^2 + CVZ^2/CVX^2 + 1)
## [1] 0.892
This is also quite close. As a final rule for fixing the noise, we may say that CVZ > CVX. In this case, the correlation is lower than 1/(CVZ2+2), which is lower than 0.5.