More robust generation methods with given correlation (March 19, 2015)

We need to adjust the combination-based method to avoid negative costs when used with a large CV. Also, we will make sure that extreme values (0,0) and (1,1) are correctly handled by the noise-based and the combination-based method. We will finally see what can be done for the following cases: (0,.), (1,.), (.,0) and (.,1).

Positive costs for the combination-based method

Let's generate the extreme values with the current method. We will artificially put one of the value in each generated vector to the minimum one (zero).

n <- 2
m <- 2
rdist <- rgamma_cost
mu <- 1
sigma <- 0.1
rhoR <- 0.3
rhoC <- 0.6

combine <- function(X, Y, rho, mean) {
    X[1] <- min(X[1], 0)
    Y[1] <- min(Y[1], 0)
    sqrt(rho) * X + sqrt(1 - rho) * Y + (1 - sqrt(rho) - sqrt(1 - rho)) * mean
}

C <- rdist(n, mu, sigma)
Z <- sapply(1:m, function(x) return(combine(C, rdist(length(C), mu, sigma), 
    rhoC, mu)))
R <- rdist(m, mu, sqrt(1 - rhoC) * sigma)
for (i in 1:nrow(Z)) Z[i, ] <- combine(R, Z[i, ], rhoR, mu)
(Z - mu)/sqrt(1 - rhoR * rhoC) + mu
##         [,1]   [,2]
## [1,] -0.9049 -0.279
## [2,] -0.5288  1.020

Analytically, the minimum value is:

(sqrt(1 - rhoR * rhoC) - sqrt(1-rhoR) * (sqrt(rhoC)+sqrt(1-rhoC)) - sqrt(rhoR))
      * mu / sqrt(1 - rhoR * rhoC)
(sqrt(1 - rhoR * rhoC) - sqrt(1 - rhoR) * (sqrt(rhoC) + sqrt(1 - rhoC)) - sqrt(rhoR)) * 
    mu/sqrt(1 - rhoR * rhoC)
## [1] -0.9049

Therefore, for some values of correlations, it is not possible to guarantee positive values. To remove them, we can insert an additive value in the result. However, the mean is increased and the CV is thus decreased.

Another approach is to change the parameters to fix only the CV of the resulting matrix.

large <- generate_matrix_corr_positive(1000, 1000, rgamma_cost, 1, 1, rhoR, 
    rhoC)
min(large)
## [1] 3.166e-05
mean(large)
## [1] 0.9977
sqrt(var(as.vector(large)))/mean(large)
## [1] 1.033
medium <- generate_matrix_corr_positive(100, 100, rgamma_cost, 1, 1, rhoR, rhoC)
plot.ecdf(medium)

plot of chunk unnamed-chunk-3

The final method takes correlations, CV and mean as parameters. We must, however, update the heterogeneity analysis:

rhoR <- 0.2
rhoC <- 0.3
mu <- 1
CV <- 0.1
large <- generate_matrix_corr_positive(1000, 1000, rgamma_cost, mu, CV, rhoR, 
    rhoC)
c(mean_CV_row(large), CV * (sqrt(rhoC) + sqrt(1 - rhoC)) * (sqrt(rhoR) + sqrt(1 - 
    rhoR)) * sqrt(1 - rhoC)/(sqrt(rhoR) + sqrt(1 - rhoR) * (sqrt(rhoC) + sqrt(1 - 
    rhoC))))
## [1] 0.09191 0.09220
c(mean_CV_col(large), CV * (sqrt(rhoC) + sqrt(1 - rhoC)) * (sqrt(rhoR) + sqrt(1 - 
    rhoR)) * sqrt(1 - rhoR)/(sqrt(rhoR) + sqrt(1 - rhoR) * (sqrt(rhoC) + sqrt(1 - 
    rhoC))))
## [1] 0.09985 0.09857
c(CV_mean_row(large), CV * (sqrt(rhoC) + sqrt(1 - rhoC)) * (sqrt(rhoR) + sqrt(1 - 
    rhoR)) * sqrt(rhoC) * sqrt(1 - rhoR)/(sqrt(rhoR) + sqrt(1 - rhoR) * (sqrt(rhoC) + 
    sqrt(1 - rhoC))))
## [1] 0.05667 0.05399
c(CV_mean_col(large), CV * (sqrt(rhoC) + sqrt(1 - rhoC)) * (sqrt(rhoR) + sqrt(1 - 
    rhoR)) * sqrt(1 - rhoC) * sqrt(rhoR)/(sqrt(rhoR) + sqrt(1 - rhoR) * (sqrt(rhoC) + 
    sqrt(1 - rhoC))))
## [1] 0.04085 0.04123

This is a bit more complicated, but this is closer to what we want. Let's just check that the correlations are still OK.

medium <- generate_matrix_corr_positive(300, 300, rgamma_cost, 1, 0.1, rhoR, 
    rhoC)
mean_cor_row(medium)
## [1] 0.2012
mean_cor_col(medium)
## [1] 0.2983

Dealing with extreme cases

Let's complete our two methods (generate_heterogeneous_matrix_noise_corr and generate_matrix_corr_positive) for the extreme cases (0,0) and (1,1).

n <- 100
m <- 100
mu <- 1
CV <- 1
Vmax <- 1
rdist <- rgamma_cost
rhoR <- 0
rhoC <- 0
mat <- generate_matrix_corr_positive(n, m, rdist, mu, CV, rhoR, rhoC)
mean_cor_row(mat)
## [1] 0.01987
mean_cor_col(mat)
## [1] 0.02018
mat <- generate_heterogeneous_matrix_noise_corr(n, m, rdist, rhoR, rhoC, Vmax)
mean_cor_row(mat)
## [1] 0.01796
mean_cor_col(mat)
## [1] 0.01994

It seems to work fine. Now, for (1,1):

rhoR <- 1
rhoC <- 1
mat <- generate_matrix_corr_positive(n, m, rdist, mu, CV, rhoR, rhoC)
mean_cor_row(mat)
## [1] 1
mean_cor_col(mat)
## [1] 1
mat <- generate_heterogeneous_matrix_noise_corr(n, m, rdist, rhoR, rhoC, Vmax)
mean_cor_row(mat)
## [1] 1
mean_cor_col(mat)
## [1] 1

What remains are the following cases: (0,.), (1,.), (.,0) and (.,1). Let's start with (0,.):

rhoR <- 0
rhoC <- 0.5
mat <- generate_matrix_corr_positive(n, m, rdist, mu, CV, rhoR, rhoC)
mean_cor_row(mat)
## [1] 0.01861
mean_cor_col(mat)
## [1] 0.5516
mat <- generate_heterogeneous_matrix_noise_corr(n, m, rdist, rhoR, rhoC, Vmax)
mean_cor_row(mat)
## [1] 0.02197
mean_cor_col(mat)
## [1] 0.5117

It works without problem. For (.,0):

rhoR <- 0.5
rhoC <- 0
mat <- generate_matrix_corr_positive(n, m, rdist, mu, CV, rhoR, rhoC)
mean_cor_row(mat)
## [1] 0.5745
mean_cor_col(mat)
## [1] 0.02142
mat <- generate_heterogeneous_matrix_noise_corr(n, m, rdist, rhoR, rhoC, Vmax)
mean_cor_row(mat)
## [1] 0.5023
mean_cor_col(mat)
## [1] 0.01829

Like a charm. Now, what about (1,.) and more specifically (1,0). It is easy to have simple matrices with perfectly correlated rows and columns that have only a high correlation. However, it is difficult to see a solution for controling this last correaltion. We propose to fail gracefully in this case.

rhoR <- 1
rhoC <- 0
mat <- generate_matrix_corr_positive(n, m, rdist, mu, CV, rhoR, rhoC)
## Error: No method implemented for this case
mat <- generate_heterogeneous_matrix_noise_corr(n, m, rdist, rhoR, rhoC, Vmax)
## Error: No method implemented for this case

Conclusion

Both methods are now ready to be used in experiments and to be described in the research report.