More robust generation methods with given correlation (March 19, 2015)

We need to adjust the combination-based method to avoid negative costs when used with a large CV. Also, we will make sure that extreme values (0,0) and (1,1) are correctly handled by the noise-based and the combination-based method. We will finally see what can be done for the following cases: (0,.), (1,.), (.,0) and (.,1).

Positive costs for the combination-based method

Let's generate the extreme values with the current method. We will artificially put one of the value in each generated vector to the minimum one (zero).

n <- 2
m <- 2
rdist <- rgamma_cost
mu <- 1
sigma <- 0.1
rhoR <- 0.3
rhoC <- 0.6

combine <- function(X, Y, rho, mean) {
    X[1] <- min(X[1], 0)
    Y[1] <- min(Y[1], 0)
    sqrt(rho) * X + sqrt(1 - rho) * Y + (1 - sqrt(rho) - sqrt(1 - rho)) * mean
}

C <- rdist(n, mu, sigma)
Z <- sapply(1:m, function(x) return(combine(C, rdist(length(C), mu, sigma), 
    rhoC, mu)))
R <- rdist(m, mu, sqrt(1 - rhoC) * sigma)
for (i in 1:nrow(Z)) Z[i, ] <- combine(R, Z[i, ], rhoR, mu)
(Z - mu)/sqrt(1 - rhoR * rhoC) + mu

##         [,1]   [,2]
## [1,] -0.9049 -0.279
## [2,] -0.5288  1.020

Analytically, the minimum value is:

(sqrt(1 - rhoR * rhoC) - sqrt(1-rhoR) * (sqrt(rhoC)+sqrt(1-rhoC)) - sqrt(rhoR))
      * mu / sqrt(1 - rhoR * rhoC)

(sqrt(1 - rhoR * rhoC) - sqrt(1 - rhoR) * (sqrt(rhoC) + sqrt(1 - rhoC)) - sqrt(rhoR)) * 
    mu/sqrt(1 - rhoR * rhoC)

## [1] -0.9049

Therefore, for some values of correlations, it is not possible to guarantee positive values. To remove them, we can insert an additive value in the result. However, the mean is increased and the CV is thus decreased.

Another approach is to change the parameters to fix only the CV of the resulting matrix.

large <- generate_matrix_corr_positive(1000, 1000, rgamma_cost, 1, 1, rhoR, 
    rhoC)
min(large)

## [1] 3.166e-05

mean(large)

## [1] 0.9977

sqrt(var(as.vector(large)))/mean(large)

## [1] 1.033

medium <- generate_matrix_corr_positive(100, 100, rgamma_cost, 1, 1, rhoR, rhoC)
plot.ecdf(medium)

plot of chunk unnamed-chunk-3

The final method takes correlations, CV and mean as parameters. We must, however, update the heterogeneity analysis:

rhoR <- 0.2
rhoC <- 0.3
mu <- 1
CV <- 0.1
large <- generate_matrix_corr_positive(1000, 1000, rgamma_cost, mu, CV, rhoR, 
    rhoC)
c(mean_CV_row(large), CV * (sqrt(rhoC) + sqrt(1 - rhoC)) * (sqrt(rhoR) + sqrt(1 - 
    rhoR)) * sqrt(1 - rhoC)/(sqrt(rhoR) + sqrt(1 - rhoR) * (sqrt(rhoC) + sqrt(1 - 
    rhoC))))

## [1] 0.09191 0.09220

c(mean_CV_col(large), CV * (sqrt(rhoC) + sqrt(1 - rhoC)) * (sqrt(rhoR) + sqrt(1 - 
    rhoR)) * sqrt(1 - rhoR)/(sqrt(rhoR) + sqrt(1 - rhoR) * (sqrt(rhoC) + sqrt(1 - 
    rhoC))))

## [1] 0.09985 0.09857

c(CV_mean_row(large), CV * (sqrt(rhoC) + sqrt(1 - rhoC)) * (sqrt(rhoR) + sqrt(1 - 
    rhoR)) * sqrt(rhoC) * sqrt(1 - rhoR)/(sqrt(rhoR) + sqrt(1 - rhoR) * (sqrt(rhoC) + 
    sqrt(1 - rhoC))))

## [1] 0.05667 0.05399

c(CV_mean_col(large), CV * (sqrt(rhoC) + sqrt(1 - rhoC)) * (sqrt(rhoR) + sqrt(1 - 
    rhoR)) * sqrt(1 - rhoC) * sqrt(rhoR)/(sqrt(rhoR) + sqrt(1 - rhoR) * (sqrt(rhoC) + 
    sqrt(1 - rhoC))))

## [1] 0.04085 0.04123

This is a bit more complicated, but this is closer to what we want. Let's just check that the correlations are still OK.

medium <- generate_matrix_corr_positive(300, 300, rgamma_cost, 1, 0.1, rhoR, 
    rhoC)
mean_cor_row(medium)

## [1] 0.2012

mean_cor_col(medium)

## [1] 0.2983

Dealing with extreme cases

Let's complete our two methods (generate_heterogeneous_matrix_noise_corr and generate_matrix_corr_positive) for the extreme cases (0,0) and (1,1).

n <- 100
m <- 100
mu <- 1
CV <- 1
Vmax <- 1
rdist <- rgamma_cost
rhoR <- 0
rhoC <- 0
mat <- generate_matrix_corr_positive(n, m, rdist, mu, CV, rhoR, rhoC)
mean_cor_row(mat)

## [1] 0.01987

mean_cor_col(mat)

## [1] 0.02018

mat <- generate_heterogeneous_matrix_noise_corr(n, m, rdist, rhoR, rhoC, Vmax)
mean_cor_row(mat)

## [1] 0.01796

mean_cor_col(mat)

## [1] 0.01994

It seems to work fine. Now, for (1,1):

rhoR <- 1
rhoC <- 1
mat <- generate_matrix_corr_positive(n, m, rdist, mu, CV, rhoR, rhoC)
mean_cor_row(mat)

## [1] 1

mean_cor_col(mat)

## [1] 1

mat <- generate_heterogeneous_matrix_noise_corr(n, m, rdist, rhoR, rhoC, Vmax)
mean_cor_row(mat)

## [1] 1

mean_cor_col(mat)

## [1] 1

What remains are the following cases: (0,.), (1,.), (.,0) and (.,1). Let's start with (0,.):

rhoR <- 0
rhoC <- 0.5
mat <- generate_matrix_corr_positive(n, m, rdist, mu, CV, rhoR, rhoC)
mean_cor_row(mat)

## [1] 0.01861

mean_cor_col(mat)

## [1] 0.5516

mat <- generate_heterogeneous_matrix_noise_corr(n, m, rdist, rhoR, rhoC, Vmax)
mean_cor_row(mat)

## [1] 0.02197

mean_cor_col(mat)

## [1] 0.5117

It works without problem. For (.,0):

rhoR <- 0.5
rhoC <- 0
mat <- generate_matrix_corr_positive(n, m, rdist, mu, CV, rhoR, rhoC)
mean_cor_row(mat)

## [1] 0.5745

mean_cor_col(mat)

## [1] 0.02142

mat <- generate_heterogeneous_matrix_noise_corr(n, m, rdist, rhoR, rhoC, Vmax)
mean_cor_row(mat)

## [1] 0.5023

mean_cor_col(mat)

## [1] 0.01829

Like a charm. Now, what about (1,.) and more specifically (1,0). It is easy to have simple matrices with perfectly correlated rows and columns that have only a high correlation. However, it is difficult to see a solution for controling this last correaltion. We propose to fail gracefully in this case.

rhoR <- 1
rhoC <- 0
mat <- generate_matrix_corr_positive(n, m, rdist, mu, CV, rhoR, rhoC)

## Error: No method implemented for this case

mat <- generate_heterogeneous_matrix_noise_corr(n, m, rdist, rhoR, rhoC, Vmax)

## Error: No method implemented for this case

Conclusion

Both methods are now ready to be used in experiments and to be described in the research report.