Analyse noise-based and Siegel's method (December 15, 2014 and December 17, 2014)

Edit (2015-1-4): fix siegel implementation

We will formally derive the heterogeneity properties of the noise-based method. The CV of the means of the rows and columns are simple to derive. However, for the mean of the CV, we have a product of random variables. Let Z be the noise random variable:

task heterogeneity:
- CV_mean_row : CV(X)
- mean_CV_col : sqrt(CV(X)² * CV(Z)² + CV(X)² + CV(Z)²⁾
machine heterogeneity:
- CV_mean_col : CV(Y)
- mean_CV_row : sqrt(CV(Y)² * CV(Z)² + CV(Y)² + CV(Z)²⁾

CVX <- 0.1
CVY <- 0.2
CVZ <- 0.3
task <- rgamma_cost(1000, 1, CVX)
proc <- rgamma_cost(1000, 1, CVY)
mat <- generate_heterogeneous_matrix_noise(task, proc, CVZ)

c(CV_mean_row(mat), CVX)

## [1] 0.1013 0.1000

c(CV_mean_col(mat), CVY)

## [1] 0.21 0.20

c(mean_CV_col(mat), sqrt(CVX^2 * CVZ^2 + CVX^2 + CVZ^2))

## [1] 0.3183 0.3176

c(mean_CV_row(mat), sqrt(CVY^2 * CVZ^2 + CVY^2 + CVZ^2))

## [1] 0.3718 0.3655

Looks good. Now, we can extend the results with the second heterogeneity measure. Let assume we want Htask and Hmach as measures. We want to determine the three parameters CVX, CVY and CVZ. Let use the following rule:

Htask > Hmach: CVX=sqrt( (Htask² - Hmach²⁾ / (1 + Hmach²⁾ ), CVY=0 and CVZ=Hmach
otherwise: CVX=0, CVY=sqrt( (Hmach² - Htask²⁾ / (1 + Htask²⁾ ) and CVZ=Htask

This maximizes the noise and alters the matrix the most from the uniform one.

Let's try the first case:

Htask = 0.4
Hmach = 0.2
CVX <- sqrt((Htask^2 - Hmach^2)/(1 + Hmach^2))
CVY <- 0
CVZ <- Hmach
task <- rgamma_cost(1000, 1, CVX)
proc <- rgamma_cost(1000, 1, CVY)
mat <- generate_heterogeneous_matrix_noise(task, proc, CVZ)

c(mean_CV_col(mat), sqrt(CVX^2 * CVZ^2 + CVX^2 + CVZ^2))

## [1] 0.3957 0.4000

c(mean_CV_row(mat), sqrt(CVY^2 * CVZ^2 + CVY^2 + CVZ^2))

## [1] 0.1997 0.2000

cor.test(mat[, 1], mat[, 2])$estimate

##    cor 
## 0.7171

Last attempt to analyse Siegel's method

Let's go back to the mean of the CV of the columns for Siegel's method just to see if we can get the result when either a or b is 0 or 1 and not both (for now, we have only the case a=b=0 and a=b=1).

Let's start by varying b when a=1. A possible solution could be something like: (1 - b) * sqrt(CV(X)² * CV(Y)² + CV(X)² + CV(Y)²⁾ + b * CV(X)

CV_X <- 0.2
CV_Y <- 0.3
test_b <- sapply(seq(0, 1, 0.01), function(b) c((1 - b) * sqrt(CV_X^2 * CV_Y^2 + 
    CV_X^2 + CV_Y^2) + b * CV_X, mean_CV_col(generate_matrix_siegel_CV(2000, 
    2000, 10, 10, CV_X, CV_Y, 1, b))))

library(ggplot2)
p <- ggplot(data.frame(t(test_b)), aes(X1, X2))
p <- p + geom_point()
p <- p + geom_abline()
p

plot of chunk unnamed-chunk-1

Looks good. It also confirm intuition. Let's try with a very large matrix with b=0.5:

a <- 1
b <- 0.5
summary(replicate(10, mean_CV_col(generate_matrix_siegel_CV(2000, 2000, 10, 
    10, CV_X, CV_Y, a, b))))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.279   0.283   0.285   0.284   0.285   0.286

(1 - b) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + b * CV_X

## [1] 0.2828

Quite good. Let's do the same for a when b=1:

CV_X <- 0.2
CV_Y <- 0.3
test_a <- sapply(seq(0, 1, 0.01), function(a) c((1 - a) * sqrt(CV_X^2 * CV_Y^2 + 
    CV_X^2 + CV_Y^2) + a * CV_X, mean_CV_col(generate_matrix_siegel_CV(2000, 
    2000, 10, 10, CV_X, CV_Y, a, 1))))

library(ggplot2)
p <- ggplot(data.frame(t(test_a)), aes(X1, X2))
p <- p + geom_point()
p <- p + geom_abline()
p

plot of chunk unnamed-chunk-2

This is not a line and there is no intuition supporting it. It could be anything. For instance, the following is more fitting:

a <- seq(0, 1, 0.01)
test_a2 <- data.frame(x = test_a[1, ], pred = (1 - a^2) * sqrt(CV_X^2 * CV_Y^2 + 
    CV_X^2 + CV_Y^2) + a^2 * CV_X)
p + geom_line(data = test_a2, aes(x, pred))

plot of chunk unnamed-chunk-3

Let's try with a very large matrix with a=0.5:

a <- 0.5
b <- 1
summary(replicate(100, mean_CV_col(generate_matrix_siegel_CV(2000, 2000, 10, 
    10, CV_X, CV_Y, a, b))))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.328   0.332   0.333   0.333   0.334   0.339

(1 - a^2) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + a^2 * CV_X

## [1] 0.3241

Not quite. It is between the first quartile and the median. This may be a close guess but an incorrect one nonetheless.

Let's try a last estimation:

(1 - b) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2)
+ b * ((1 - a^2) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + a^2 * CV_X)

which can also be written:

(1 - b * a^2) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + b * a^2 * CV_X

a <- 0.5
b <- 0.5
summary(replicate(100, mean_CV_col(generate_matrix_siegel_CV(2000, 2000, 10, 
    10, CV_X, CV_Y, a, b))))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.344   0.348   0.349   0.349   0.351   0.355

(1 - b) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + b * ((1 - a^2) * sqrt(CV_X^2 * 
    CV_Y^2 + CV_X^2 + CV_Y^2) + a^2 * CV_X)

## [1] 0.3448

This is actually quite close. It may be correct after all. More experiments are required.

CV_X <- 0.2
CV_Y <- 0.3
test_ab <- NULL
for (a in seq(0, 1, 0.2)) for (b in seq(0, 1, 0.2)) {
    R <- summary(replicate(10, mean_CV_col(generate_matrix_siegel_CV(2000, 2000, 
        10, 10, CV_X, CV_Y, a, b))))
    pred <- (1 - b) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + b * ((1 - a^2) * 
        sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + a^2 * CV_X)
    test_ab <- rbind(test_ab, data.frame(a, b, med = R["Median"], pred))
}

hist((test_ab$pred - test_ab$med)/test_ab$pred)

plot of chunk unnamed-chunk-4

sum((test_ab$pred - test_ab$med)/test_ab$pred)

## [1] -0.39

Since the prediction is good, it may be a correct guess actually.

Let's try with other settings:

validate_guess <- function(CV_X, CV_Y) {
    test_ab <- NULL
    for (a in seq(0, 1, 0.2)) for (b in seq(0, 1, 0.2)) {
        R <- summary(replicate(10, mean_CV_col(generate_matrix_siegel_CV(2000, 
            2000, 10, 10, CV_X, CV_Y, a, b))))
        pred <- (1 - b) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + b * ((1 - 
            a^2) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + a^2 * CV_X)
        test_ab <- rbind(test_ab, data.frame(a, b, med = R["Median"], pred))
    }
    test_ab
}

test_ab_lohi <- validate_guess(0.1, 0.5)
test_ab_hilo <- validate_guess(0.5, 0.1)

hist((test_ab_hilo$pred - test_ab_hilo$med)/test_ab_hilo$pred)

plot of chunk unnamed-chunk-5

sum((test_ab_hilo$pred - test_ab_hilo$med)/test_ab_hilo$pred)

## [1] 0.009968

hist((test_ab_lohi$pred - test_ab_lohi$med)/test_ab_lohi$pred)

plot of chunk unnamed-chunk-5

sum((test_ab_lohi$pred - test_ab_lohi$med)/test_ab_lohi$pred)

## [1] -1.808

test_ab_lohi[which.min((test_ab_lohi$pred - test_ab_lohi$med)/test_ab_lohi$pred), 
    ]

##            a b   med   pred
## Median29 0.8 1 0.354 0.2484

OK, this last case is really bad especially when a is close to 1, b=1 and Vtask<<Vmach.

CV_X <- 0.1
CV_Y <- 0.5
a <- 0.8
b <- 1
summary(replicate(100, mean_CV_col(generate_matrix_siegel_CV(2000, 2000, 10, 
    10, CV_X, CV_Y, a, b))))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.350   0.353   0.353   0.353   0.355   0.357

(1 - b) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + b * ((1 - a^2) * sqrt(CV_X^2 * 
    CV_Y^2 + CV_X^2 + CV_Y^2) + a^2 * CV_X)

## [1] 0.2484

The actual value is actually underestimated. It also suggests that the estimation precision depends on the heterogeneity values, that is the heterogeneity values will impact the aggregation between the case a=b=0 and the case a=b=1.

Conclusion

We have analysed the noise-based method and have a better result for Siegel's method. There is still a missing case though for which we have a nice estimation (empirical MC simulations have shown it to be quite imprecise with some settings).