Aggregation method (April 11, 2014)

Objectives

Let's improve upon Siegel method with an aggregation method that set the correlation between the rows and the columns. The rationale for having the same standard deviation for the rows and the columns is that it is the same for the uniform case.

Same correlations

generate_matrix_aggregation <- function(R, C, rdistr, rdistc, rho) {
    if (length(R) == 1) 
        R <- rdistr(R)
    if (length(C) == 1) 
        C <- rdistc(C)

    Z <- sapply(1:length(R), function(j) return(C))
    for (i in 1:nrow(Z)) Z[i, ] <- Z[i, ] * R

    Zc <- sapply(1:length(R), function(j) return(rdistc(length(C))))
    for (i in 1:nrow(Z)) Zc[i, ] <- Zc[i, ] * rdistr(length(R))

    alpha <- sqrt(rho)/muR
    beta <- sqrt(1 - alpha^2)
    gamma <- mean(R) * mean(C) * (1 - alpha - beta)

    return(alpha * Z + beta * Zc + gamma)
}

set.seed(0)

rdistr <- function(n) {
    return(runif(n, 1, 10))
}
rdistc <- function(n) {
    return(runif(n, 1, 10))
}

muR <- mean(rdistr(1000))
muC <- mean(rdistc(1000))
sigmaR <- sqrt(var(rdistr(1000)))
sigmaC <- sqrt(var(rdistc(1000)))

rho <- 0.9

Z <- generate_matrix_aggregation(100, 100, rdistr, rdistc, rho)

c(muR * muC, mean(Z))
## [1] 29.76 30.29
c(sqrt(sigmaC^2 * sigmaR^2 + sigmaC^2 * muR^2 + muC^2 * sigmaR^2), sqrt(var(as.vector(Z))), 
    sqrt(var(Z[1, ])), sqrt(var(Z[, 1])))
## [1] 21.74 21.37 19.72 19.89
summary(sapply(1:100, function(x) return(cor.test(Z[1, ], Z[x, ])$estimate)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2380 -0.0538  0.0037  0.0183  0.0809  1.0000
summary(sapply(1:100, function(x) return(cor.test(Z[, 1], Z[, x])$estimate)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2180 -0.0736  0.0083  0.0215  0.0802  1.0000
c(rho, cor.test(Z[1, ], Z[2, ])$estimate, cor.test(Z[, 1], Z[, 2])$estimate)
##                 cor       cor 
##  0.900000 -0.057971  0.006723

That does not work because the correlation between two columns or rows is harder to analyse than expected. It appears the formulas for the correlation between two sums are incorrect. This derivation would require more work.

The correlation is

alpha^2*r*r'*sigmaC / 
    sqrt((r^2*sigmaC^2+sigmaR^2*sigmaC^2+sigmaR^2*muC^2+sigmaC^2*muR^2) * 
      (r'^2*sigmaC^2+sigmaR^2*sigmaC^2+sigmaR^2*muC^2+sigmaC^2*muR^2))

where r and r' are random variable distributed according to the rdistr. The expected value for this quantity seems complicated to derive because the random variable is also present is the denominator.