Empirical Study of Siegel's Method (March 30, 2014)

Motivation

The previous study (March 14, 2013) was a failure. Let's evaluate Siegel's method before considering a new simpler aggregation method.

Analysis

The final matrix has several characteristics:

The default method (following methods are mere adaptations that uses gamma distribution) has a mean that is muR * muC and it standard deviation is sqrt(sigmaC2 * sigmaR2 + sigmaC2 * muR2 + muC2 * sigmaR2).

The expected standard deviation of each row is muC * sigmaR and the expected standard deviation of each column is the same as the standard deviation of all values.

Finally, the correlation between row is 0 and the correlation between column is muR2 * sigmaC2 / (sigmaC2 * sigmaR2 + sigmaC2 * muR2 + muC2 * sigmaR2)

Implementation

generate_matrix_siegel <- function(R, C, rdistr, rdistc) {
    if (length(R) == 1) 
        R <- rdistr(R)
    if (length(C) == 1) 
        C <- rdistc(C)

    Z <- sapply(1:length(C), function(j) return(R))
    for (i in 1:nrow(Z)) Z[i, ] <- Z[i, ] * rdistc(length(C))

    return(Z)
}

set.seed(0)

rdistr <- function(n) {
    return(runif(n, 1, 50))
}
rdistc <- function(n) {
    return(runif(n, 1, 1000))
}

muR <- mean(rdistr(1000))
muC <- mean(rdistc(1000))
sigmaR <- sqrt(var(rdistr(1000)))
sigmaC <- sqrt(var(rdistc(1000)))

Z <- generate_matrix_siegel(1000, 1000, rdistr, rdistc)

c(muR * muC, mean(Z))
## [1] 12511 12972
c(sqrt(sigmaC^2 * sigmaR^2 + sigmaC^2 * muR^2 + muC^2 * sigmaR^2), sqrt(var(as.vector(Z))))
## [1] 11223 11151
c(muC * sigmaR, sqrt(var(Z[1, ])))
## [1] 6967 9817
c(sqrt(sigmaC^2 * sigmaR^2 + sigmaC^2 * muR^2 + muC^2 * sigmaR^2), sqrt(var(Z[, 
    1])))
## [1] 11223 11350
c(0, cor.test(Z[1, ], Z[2, ])$estimate)
##               cor 
##  0.00000 -0.02325
c(muR^2 * sigmaC^2/(sigmaC^2 * sigmaR^2 + sigmaC^2 * muR^2 + muC^2 * sigmaR^2), 
    cor.test(Z[, 1], Z[, 2])$estimate)
##           cor 
## 0.4691 0.3704

Given this, changing the input distributions will change only four parameters: muR, muC, sigmaR and sigmaC. This is insufficient to control all 6 characteristics (additionally, rhoR will always be zero).

This motivates a new symmetric approach that control for at least the two correlations and the global mean and standard deviation.