Correlation analysis of existing methods (March 11 and 12, 2015)

We are interested in summarizing and checking the correlation properties of the following methods: range-based, CVB, noise-based and combination-based.

Range-based

Inconsistent

Let's start with the range-based method. When a=b=0, the correlation is 0 between each row.

Rtask <- 10000
Rmach <- 100
mat <- generate_matrix_siegel_range(200, 200, Rtask, Rmach, a = 0, b = 0)
mean_cor_row(mat)

## [1] 0.009438

Let X be the first generated vector that follows U[1,Rtask] and Xi and Xj be two vectors that are multiplied by X and that follow U[1,Rmach]. The correlation between X * Xi and X * Xj is:

rho(X Xi, X Xj) = cov(X Xi, X Xj) / (sqrt(var(X Xi)) * sqrt(var(X Xj)))
rho(X Xi, X Xj) = (E[X Xi X Xj] - E[X Xi] E[X Xj]) /
              (var(X) * var(Xi) + mean(X)^2 * var(Xi) + var(X) * mean(Xi)^2)
covar(XZ, XZ') = (E[X^2] E[Xi]^2 - E[X]^2 E[Xi]^2)
covar(XZ, XZ') = var(X) * E[Xi]^2

The corresponding means are : ½ * (1+Rtask) and ½ * (1+Rmach). The corresponding variances are : 1/12 * (Rtask-1)² and 1/12 * (Rmach-1)^2. Therefore, the correlation coefficient is:

1/12 * (Rtask-1)^2 * 1/4 * (1+Rmach)^2 /
      (1/12 * (Rtask-1)^2 * 1/12 * (Rmach-1)^2 +
        1/12 * (Rtask-1)^2 * 1/4 * (1+Rmach)^2 +
        1/12 * (Rmach-1)^2 * 1/4 * (1+Rtask)^2)
(1+Rmach)^2 / (1/3 * (Rmach-1)^2 + (1+Rmach)^2 + (Rmach-1)^2 * (1+Rtask)^2 / (Rtask-1)^2)

Rtask <- 10000
Rmach <- 100
mat <- generate_matrix_siegel_range(200, 200, Rtask, Rmach, a = 0, b = 0)
mean_cor_col(mat)

## [1] 0.4349

(1 + Rmach)^2/(1/3 * (Rmach - 1)^2 + (1 + Rmach)^2 + (Rmach - 1)^2 * (1 + Rtask)^2/(Rtask - 
    1)^2)

## [1] 0.4383

When Rtask and Rmach are large, this tends to 3/7, which is around 0.43.

Consistent

Now when all lines are sorted, then each pair of rows is correlated because values on each row increases. The column correlations are also 1 because each column is the product of the same variable (one of the quantile of U[1,Rmach]) by the same vector that follows U[1,Rtask].

Rtask <- 10000
Rmach <- 100
mat <- generate_matrix_siegel_range(200, 200, Rtask, Rmach, a = 1, b = 1)
mean_cor_row(mat)

## [1] 0.9962

mean_cor_col(mat)

## [1] 0.9695

Arbitrary consistency

First, let's try with a=b=½.

Rtask <- 10000
Rmach <- 100
mat <- generate_matrix_siegel_range(200, 200, Rtask, Rmach, a = 1/2, b = 1/2)
mean_cor_row(mat)

## [1] 0.1369

mean_cor_col(mat)

## [1] 0.4613

We conjecture that the row correlation is a² * b and that the column correlation will be difficult to determine.

Rtask <- 10000
Rmach <- 100
prec <- NULL
for (a in seq(0, 1, 0.1)) for (b in seq(0, 1, 0.1)) {
    mat <- generate_matrix_siegel_range(50, 50, Rtask, Rmach, a, b)
    prec <- c(prec, mean_cor_row(mat) - a^2 * b)
}
summary(prec)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0212  0.0122  0.0261  0.0223  0.0354  0.0514

Let's see if we can work on the column correlation. Let's start with the case where a=1. It seems more easy because entire columns are affected. Then, there are bm columns that are correlated and (1-b)m with the previously computed correlation (around 0.43). The question is how are the sorted columns correlated with the others.

Let X be the first generated vector that follows U[1,Rtask] and Xi be a vector that is multiplied by X and that follows U[1,Rmach]. The correlation between X * Xi and X is:

rho(X Xi, X) = cov(X Xi, X) / (sqrt(var(X Xi)) * sqrt(var(X)))
rho(X Xi, X Xj) = (E[X Xi X] - E[X Xi] E[X]) /
              (sqrt(var(X) * var(Xi) + mean(X)^2 * var(Xi)
                  + var(X) * mean(Xi)^2) * sqrt(var(X)))
covar(XZ, XZ') = (E[X^2] E[Xi] - E[X]^2 E[Xi])
covar(XZ, XZ') = var(X) * E[Xi]
rho(X Xi, X Xj) = sqrt(var(X)) * E[Xi] /
              sqrt(var(X) * var(Xi) + mean(X)^2 * var(Xi) + var(X) * mean(Xi)^2)

The corresponding means are : ½ * (1+Rtask) and ½ * (1+Rmach). The corresponding variances are : 1/12 * (Rtask-1)² and 1/12 * (Rmach-1)^2. Therefore, the correlation coefficient is:

1/sqrt(12) * (Rtask-1) * 1/2 * (1+Rmach) /
      sqrt(1/12 * (Rtask-1)^2 * 1/12 * (Rmach-1)^2 +
        1/12 * (Rtask-1)^2 * 1/4 * (1+Rmach)^2 +
        1/12 * (Rmach-1)^2 * 1/4 * (1+Rtask)^2)
(1+Rmach) / sqrt(1/3 * (Rmach-1)^2 + (1+Rmach)^2 + (Rmach-1)^2 * (1+Rtask)^2 / (Rtask-1)^2)

When Rtask and Rmach are large, this tends to sqrt(3/7), which is around 0.65.

We conjecture that the correlation is thus:

b^2 + 2*b*(1-b) * (1+Rmach) / sqrt(1/3 * (Rmach-1)^2 + (1+Rmach)^2 +
          (Rmach-1)^2 * (1+Rtask)^2 / (Rtask-1)^2) +
    (1-b)^2 * (1+Rmach)^2 / (1/3 * (Rmach-1)^2 + (1+Rmach)^2 +
          (Rmach-1)^2 * (1+Rtask)^2 / (Rtask-1)^2)

When Rtask and Rmach are large, this tends to \( b^2 * 3/7 + 2*b*(1-b) * sqrt(3/7) + (1-b)^2 \).

Rtask <- 10000
Rmach <- 100
prec <- NULL
for (b in seq(0, 1, 0.1)) {
    mat <- generate_matrix_siegel_range(100, 100, Rtask, Rmach, a = 1, b)
    cor_sorted <- (1 + Rmach)^2/(1/3 * (Rmach - 1)^2 + (1 + Rmach)^2 + (Rmach - 
        1)^2 * (1 + Rtask)^2/(Rtask - 1)^2)
    cor_mix <- (1 + Rmach)/sqrt(1/3 * (Rmach - 1)^2 + (1 + Rmach)^2 + (Rmach - 
        1)^2 * (1 + Rtask)^2/(Rtask - 1)^2)
    prec <- c(prec, mean_cor_col(mat) - (b^2 + 2 * b * (1 - b) * cor_mix + (1 - 
        b)^2 * cor_sorted))
}
summary(prec)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.08460 -0.06010 -0.03820 -0.03950 -0.01920 -0.00924

For arbitrary a, it smells like the failed analysis for the last heterogeneity measures. Let's skip this case.

CVB

It should be a simple adaptation. Let's directly consider the last case and let's see if the row correlation is again a² * b.

Vtask <- 0.2
Vmach <- 0.3
prec <- NULL
for (a in seq(0, 1, 0.1)) for (b in seq(0, 1, 0.1)) {
    mat <- generate_matrix_siegel_CV(50, 50, 10, 1, Vtask, Vmach, a, b)
    prec <- c(prec, mean_cor_row(mat) - a^2 * b)
}
summary(prec)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.03930  0.00708  0.02520  0.01920  0.03560  0.04720

Now, let's derive the column correlation:

rho(X Xi, X Xj) = var(X) * E[Xi]^2 /
              (var(X) * var(Xi) + mean(X)^2 * var(Xi) + var(X) * mean(Xi)^2)

The corresponding means are : mutask and 1. The corresponding variances are : mutask² * Vtask² and Vmach^2.

Therefore, the correlation coefficient is:

mutask^2 * Vtask^2 /
              (mutask^2 * Vtask^2 * Vmach^2 + mutask^2 * Vmach^2 + mutask^2 * Vtask^2)
1 / (Vmach^2 + Vmach^2 / Vtask^2 + 1)

And for the case when a=1:

rho(X Xi, X Xj) = sqrt(var(X)) * E[Xi] /
              sqrt(var(X) * var(Xi) + mean(X)^2 * var(Xi) + var(X) * mean(Xi)^2)

And the correlation is:

sqrt(mutask^2 * Vtask^2) /
              sqrt(mutask^2 * Vtask^2 * Vmach^2 + mutask^2 * Vmach^2 + mutask^2 * Vtask^2)
1 / sqrt(Vmach^2 + Vmach^2 / Vtask^2 + 1)

Let's test this:

Vtask <- 0.2
Vmach <- 0.3
prec <- NULL
for (b in seq(0, 1, 0.1)) {
    mat <- generate_matrix_siegel_CV(100, 100, 10, 1, Vtask, Vmach, a = 1, b)
    cor_sorted <- 1/(Vmach^2 + Vmach^2/Vtask^2 + 1)
    cor_mix <- 1/sqrt(Vmach^2 + Vmach^2/Vtask^2 + 1)
    prec <- c(prec, mean_cor_col(mat) - (b^2 + 2 * b * (1 - b) * cor_mix + (1 - 
        b)^2 * cor_sorted))
}
summary(prec)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0727 -0.0318 -0.0161 -0.0178  0.0015  0.0300

Noise-based

We already analyzed that the row and column correlations are \( 1 / (Vnoise^2 + Vnoise^2 / Vmach^2 + 1) \) and \( 1 / (Vnoise^2 + Vnoise^2 / Vtask^2 + 1) \).

Vtask <- 0.3
Vmach <- 0.2
Vnoise <- 0.1
task <- rgamma_cost(300, 1, Vtask)
proc <- rgamma_cost(300, 1, Vmach)
mat <- generate_heterogeneous_matrix_noise(task, proc, Vnoise)
mean_cor_row(mat)

## [1] 0.7837

1/(Vnoise^2 + Vnoise^2/Vmach^2 + 1)

## [1] 0.7937

mean_cor_col(mat)

## [1] 0.8906

1/(Vnoise^2 + Vnoise^2/Vtask^2 + 1)

## [1] 0.892

What is missing is a way to select the parameters given a row correlation rhoR and a column correlation rhoC.

rhoR = 1 / (Vnoise^2 + Vnoise^2 / Vmach^2 + 1)
rhoC = 1 / (Vnoise^2 + Vnoise^2 / Vtask^2 + 1)

Thus:

Vtask = Vnoise / sqrt(1 / rhoC - Vnoise^2 - 1)
Vmach = Vnoise / sqrt(1 / rhoR - Vnoise^2 - 1)

Necessarily, \( 1 / rhoR - Vnoise^2 - 1 > 0 \) and \( Vnoise < sqrt(1 / rhoR - 1) \).

We have several choices:

Either fix the minimum heterogeneity (between Vtask and Vmach) to some value because most interesting behaviors occur when the heterogeneity is higher than some threshold. This would however means that the maximum heterogeneity can be arbitrarily high.
Or limit the maximum heterogeneity (between Vtask and Vmach). This is more important because heterogeneity higher than 1 causes generation problems.

Let's do some test with \( Vnoise = min(sqrt(1 / rhoR - 1), sqrt(1 / rhoC - 1)) * k \). In this case, the maximum heterogeneity (when rhoR is close to 1) is \( k / sqrt(1 - k^2) \). Let Vmax be the maximum value, then:

Vmax = k / sqrt(1 - k^2)
Vmax^2 * (1 - k^2) - k^2 = 0
Vmax^2 = (Vmax^2 + 1) * k^2
k = Vmax / sqrt(Vmax^2 + 1)

for (rhoR in c(0.01, 0.99)) for (rhoC in c(0.01, 0.99)) {
    Vmax <- 3
    k <- Vmax/sqrt(Vmax^2 + 1)
    Vnoise <- min(sqrt(1/rhoR - 1), sqrt(1/rhoC - 1)) * k
    print(c(Vnoise, Vnoise/sqrt(1/rhoR - Vnoise^2 - 1), Vnoise/sqrt(1/rhoC - 
        Vnoise^2 - 1)))
}

## [1] 9.439 3.000 3.000
## [1] 0.095346 0.009583 3.000000
## [1] 0.095346 3.000000 0.009583
## [1] 0.09535 3.00000 3.00000

The only issue is that Vnoise should also be limited. Let's add another limit:

rhoR <- 0.01
rhoC <- 0.01
Vnoise <- min(Vmax, min(sqrt(1/rhoR - 1), sqrt(1/rhoC - 1)) * k)
c(Vnoise, Vnoise/sqrt(1/rhoR - Vnoise^2 - 1), Vnoise/sqrt(1/rhoC - Vnoise^2 - 
    1))

## [1] 3.0000 0.3162 0.3162

This looks good. Most heterogeneity values are between 0.01 and Vmax which is the area where things happen (the place to be).

for (rhoR in c(0.01, 0.99)) for (rhoC in c(0.01, 0.99)) {
    mat <- generate_heterogeneous_matrix_noise_corr(200, 200, rgamma_cost, rhoR, 
        rhoC, 3)
    print(c(mean_cor_row(mat), mean_cor_col(mat)))
}

## [1] 0.01867 0.01862
## [1] 0.01876 0.99213
## [1] 0.99172 0.01796
## [1] 0.9913 0.9920

Combination-based

The correlations are obviously the ones given as parameters. However, it would be interesting to analyse the heterogeneity of the computed matrices. The CV of each column and row is initially \( sigma/mean \). We can find the following heterogeneity by testing different values.

sigma <- 3
mu <- 10
rhoR <- 0.2
rhoC <- 0.3
mat <- generate_matrix_corr(1000, 1000, rgamma_cost, mu, sigma, rhoR, rhoC)
c(mean_CV_row(mat), sigma/mu * sqrt(1 - rhoC)/sqrt(1 - rhoR * rhoC))

## [1] 0.2644 0.2589

c(mean_CV_col(mat), sigma/mu * sqrt(1 - rhoR)/sqrt(1 - rhoR * rhoC))

## [1] 0.2810 0.2768

c(CV_mean_row(mat), sigma/mu * sqrt(rhoC) * sqrt(1 - rhoR)/sqrt(1 - rhoR * rhoC))

## [1] 0.1529 0.1516

c(CV_mean_col(mat), sigma/mu * sqrt(1 - rhoC) * sqrt(rhoR)/sqrt(1 - rhoR * rhoC))

## [1] 0.1159 0.1158

Conclusion

We have completed the correlation and the heterogeneity properties of two methods from the literature and two new methods. The last thing to do before launching the XP is to implement the TMA measure.