Finishing Siegel's methods analysis (November 14, 2014, continued on November 21)

In our previous analysis from October 1, the missing part was to analyse the heterogeneity of Siegel's method when a fraction of the matrix is consistent. In the following, we assume that a (resp. b) is the fraction of rows (resp. columns) that is consistent.

Based on the previous results, we could have something like the following

task heterogeneity:
- CV_mean_row : CV(Y)
- mean_CV_col : (1 - a) * (1 - b) * (sqrt(CV(X)² * CV(Y)² + CV(X)² + CV(Y)²⁾⁾
- a * b * CV(Y)
machine heterogeneity:
- CV_mean_col : 0 + a * b * CV(X)
- mean_CV_row : CV(X)

It should be noted also that the properties for Siegel's method are inverted. Rtask controls actually the machine heterogeneity and Rmach controls the task heterogeneity.

Experimentations

Let's start by some empirical tests on the partially-consistent method:

generate_matrix_siegel_CV <- function(n, m, mean_task, mean_mach, CV_task, CV_mach, 
    a = 0, b = 0) {
    rdistr <- function(n) {
        return(rgamma_cost(n, mean_task, mean_task * CV_task))
    }
    rdistc <- function(n) {
        return(rgamma_cost(n, mean_mach, mean_mach * CV_mach))
    }
    make_consistent(generate_matrix_siegel(n, m, rdistr, rdistc), a, b)
}

set.seed(0)
Z <- generate_matrix_siegel_CV(1000, 1000, 10, 10, 0.1, 0.1, 0.5, 0.25)
mean_CV_col(Z)

## [1] 0.1393

CV_mean_col(Z)

## [1] 0.02509

set.seed(0)
Z <- generate_matrix_siegel_CV(1000, 1000, 100, 100, 0.1, 0.1, 0.5, 0.25)
mean_CV_col(Z)

## [1] 0.1393

CV_mean_col(Z)

## [1] 0.02509

This is encouraging: the result does not depend on the mean on the distributions and there may exist a closed formula that depends only on CV.

Let's start by checking a simple formula for the CV of the means of the columns: a * b * CV(X)

a <- 1
b <- 0.5
CV_X <- 0.1
CV_Y <- 0.1
a * b * CV_X

## [1] 0.05

CV_mean_col(generate_matrix_siegel_CV(1000, 1000, 10, 10, CV_X, CV_Y, a, b))

## [1] 0.07042

a <- 0.5
b <- 1
a * b * CV_X

## [1] 0.05

CV_mean_col(generate_matrix_siegel_CV(1000, 1000, 10, 10, CV_X, CV_Y, a, b))

## [1] 0.05019

This seems OK for a but not for b. The mean of each column is a * X * mean(Y) + (1-a) * mean(X) * mean(Y) for the b sorted columns and mean(X) * mean(Y) for the others. The mean of those means is the same as the last and the standard deviation is sqrt(b * a² * mean(Y)² * var(X)) that is sqrt(b) * a * mean(Y) * std(X). Then, the CV would be : a * sqrt(b) * CV(X)

a <- 0.5
b <- 0.25
a * sqrt(b) * CV_X

## [1] 0.025

CV_mean_col(generate_matrix_siegel_CV(1000, 1000, 10, 10, CV_X, CV_Y, a, b))

## [1] 0.02487

Now, let's consider the mean of the CV of the columns. For a given sorted column, there are a part that has mean x*mean(Y) and a part that has mean mean(X) * mean(Y). Therefore, the mean of a column is: a * x * mean(Y) + (1-a) * mean(X) * mean(Y)

Let's use maxima for the next step (deriving the variance from the mean of the square of the random variable):

assume(CVX >= 0, CVY >= 0, MUX >= 0, MUY >= 0);
assume(a >= 0, a <= 1, b >= 0, b <= 1);
declare(a, constant);
declare(b, constant);
declare(x, constant);
VARX : MUX^2 * CVX^2;
VARY : MUY^2 * CVY^2;
MUX2 : VARX + MUX^2;
MUY2 : VARY + MUY^2;
MUCOL : a * x * MUY + (1-a) * MUX * MUY;
MUCOL2 : a * x^2 * MUY2 + (1-a) * MUX2 * MUY2;
VARCOL : MUCOL2 - MUCOL^2;
CVCOL : sqrt(VARCOL) / MUCOL;
expand(CVCOL);
factor(CVCOL); radcan(CVCOL); ratsimp(CVCOL);

The result is a bit complex. Let's see if this is independant from the mean.

ev(CVCOL, MUX=10, CVX=0.1, MUY=10, CVY=0.1, a=0.5, x=9);
ev(CVCOL, MUX=100, CVX=0.1, MUY=10, CVY=0.1, a=0.5, x=90);

At this point, I suspect that the mean of the CV of the columns may depends on the distribution. Let's test this:

generate_matrix_siegel_CV_unif <- function(n, m, mean_task, mean_mach, CV_task, 
    CV_mach, a = 0, b = 0) {
    runif_cost <- function(n, mean, sigma) {
        if (mean == 0 || sigma == 0) 
            return(rep(mean, n))
        min <- mean - sigma * sqrt(12)/2
        max <- mean + sigma * sqrt(12)/2
        return(runif(n, min, max))
    }
    rdistr <- function(n) {
        return(runif_cost(n, mean_task, mean_task * CV_task))
    }
    rdistc <- function(n) {
        return(runif_cost(n, mean_mach, mean_mach * CV_mach))
    }
    make_consistent(generate_matrix_siegel(n, m, rdistr, rdistc), a, b)
}

Z <- generate_matrix_siegel_CV(5000, 5000, 10, 10, 0.1, 0.1, 0.5, 0.25)
mean_CV_col(Z)

## [1] 0.1393

CV_mean_col(Z)

## [1] 0.02503


Z <- generate_matrix_siegel_CV_unif(5000, 5000, 10, 10, 0.1, 0.1, 0.5, 0.25)
mean_CV_col(Z)

## [1] 0.1402

CV_mean_col(Z)

## [1] 0.02501

Incorrect suspicion. There must be a formal result. Maybe the result is simple (I am starting to be hopeless):

a <- 0.5
b <- 1
sqrt(((1 - a) * CV_X^2 + 1) * CV_Y^2 + (1 - a) * CV_X^2)

## [1] 0.1227

mean_CV_col(generate_matrix_siegel_CV(5000, 5000, 10, 10, CV_X, CV_Y, a, b))

## [1] 0.1333

This is not the same thing (obviously). I have no idea how to proceed from this point.

Conclusion

In case of partial consistency, we will just empirically measure the heterogeneity properties for the mean of the CV of the column. For all the other settings, we have closed formulas.