Improving shuffling method and noise-based method analysis (January 10, 2015)

Improving shuffling method

We showed that the shuffling method that preserves the minimum cost at each step is a good compromise that do not insert too much variation but still manage to decrease significantly the correlation. The distribution is however changed. Let's see if it is also possible to preserve the maximum cost.

The structure of uniform matrices makes it actually impossible to preserve both minimum and maximum costs.

Noise-based method analysis

Fixing the noise level is not trivial for the first heterogeneity measure. Let's analyse its correlation.

The values in a column are the product of y (a speed), X (the weights) and Z (the noise). Let's determine the correlation of XZ and XZ'. This is the ratio of covar(XZ,XZ') to the variance of XZ (which is var(X)var(Z)+var(X)+var(Z) because the means of X and Z are both 1).

covar(XZ, XZ') = E[XZXZ'] - E[XZ] E[XZ]
covar(XZ, XZ') = E[X^2] E[Z]^2 - E[X]^2 E[Z]^2
covar(XZ, XZ') = E[X^2] - E[X]^2 = var(X)

Thus the correlation between any pair of columns is 1/(var(Z)+var(Z)/var(X)+1)

CVX <- 0.1
CVY <- 0.2
CVZ <- 0.3
task <- rgamma_cost(1000, 1, CVX)
proc <- rgamma_cost(1000, 1, CVY)
mat <- generate_heterogeneous_matrix_noise(task, proc, CVZ)
mean(sapply(2:ncol(mat), function(j) {
    cor.test(mat[, 1], mat[, j])$estimate
}))

## [1] 0.1199

1/(CVZ^2 + CVZ^2/CVX^2 + 1)

## [1] 0.09911

This is close. Let's test with other settings.

CVX <- 0.3
CVY <- 0.2
CVZ <- 0.1
task <- rgamma_cost(1000, 1, CVX)
proc <- rgamma_cost(1000, 1, CVY)
mat <- generate_heterogeneous_matrix_noise(task, proc, CVZ)
mean(sapply(2:ncol(mat), function(j) {
    cor.test(mat[, 1], mat[, j])$estimate
}))

## [1] 0.8895

1/(CVZ^2 + CVZ^2/CVX^2 + 1)

## [1] 0.892

This is also quite close. As a final rule for fixing the noise, we may say that CVZ > CVX. In this case, the correlation is lower than 1/(CVZ^2+2), which is lower than 0.5.