Edit (2015-1-4): fix siegel implementation
We will formally derive the heterogeneity properties of the noise-based method. The CV of the means of the rows and columns are simple to derive. However, for the mean of the CV, we have a product of random variables. Let Z be the noise random variable:
CVX <- 0.1
CVY <- 0.2
CVZ <- 0.3
task <- rgamma_cost(1000, 1, CVX)
proc <- rgamma_cost(1000, 1, CVY)
mat <- generate_heterogeneous_matrix_noise(task, proc, CVZ)
c(CV_mean_row(mat), CVX)
## [1] 0.1013 0.1000
c(CV_mean_col(mat), CVY)
## [1] 0.21 0.20
c(mean_CV_col(mat), sqrt(CVX^2 * CVZ^2 + CVX^2 + CVZ^2))
## [1] 0.3183 0.3176
c(mean_CV_row(mat), sqrt(CVY^2 * CVZ^2 + CVY^2 + CVZ^2))
## [1] 0.3718 0.3655
Looks good. Now, we can extend the results with the second heterogeneity measure. Let assume we want Htask and Hmach as measures. We want to determine the three parameters CVX, CVY and CVZ. Let use the following rule:
This maximizes the noise and alters the matrix the most from the uniform one.
Let's try the first case:
Htask = 0.4
Hmach = 0.2
CVX <- sqrt((Htask^2 - Hmach^2)/(1 + Hmach^2))
CVY <- 0
CVZ <- Hmach
task <- rgamma_cost(1000, 1, CVX)
proc <- rgamma_cost(1000, 1, CVY)
mat <- generate_heterogeneous_matrix_noise(task, proc, CVZ)
c(mean_CV_col(mat), sqrt(CVX^2 * CVZ^2 + CVX^2 + CVZ^2))
## [1] 0.3957 0.4000
c(mean_CV_row(mat), sqrt(CVY^2 * CVZ^2 + CVY^2 + CVZ^2))
## [1] 0.1997 0.2000
cor.test(mat[, 1], mat[, 2])$estimate
## cor
## 0.7171
Let's go back to the mean of the CV of the columns for Siegel's method just to see if we can get the result when either a or b is 0 or 1 and not both (for now, we have only the case a=b=0 and a=b=1).
Let's start by varying b when a=1. A possible solution could be something like: (1 - b) * sqrt(CV(X)2 * CV(Y)2 + CV(X)2 + CV(Y)2) + b * CV(X)
CV_X <- 0.2
CV_Y <- 0.3
test_b <- sapply(seq(0, 1, 0.01), function(b) c((1 - b) * sqrt(CV_X^2 * CV_Y^2 +
CV_X^2 + CV_Y^2) + b * CV_X, mean_CV_col(generate_matrix_siegel_CV(2000,
2000, 10, 10, CV_X, CV_Y, 1, b))))
library(ggplot2)
p <- ggplot(data.frame(t(test_b)), aes(X1, X2))
p <- p + geom_point()
p <- p + geom_abline()
p
Looks good. It also confirm intuition. Let's try with a very large matrix with b=0.5:
a <- 1
b <- 0.5
summary(replicate(10, mean_CV_col(generate_matrix_siegel_CV(2000, 2000, 10,
10, CV_X, CV_Y, a, b))))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.279 0.283 0.285 0.284 0.285 0.286
(1 - b) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + b * CV_X
## [1] 0.2828
Quite good. Let's do the same for a when b=1:
CV_X <- 0.2
CV_Y <- 0.3
test_a <- sapply(seq(0, 1, 0.01), function(a) c((1 - a) * sqrt(CV_X^2 * CV_Y^2 +
CV_X^2 + CV_Y^2) + a * CV_X, mean_CV_col(generate_matrix_siegel_CV(2000,
2000, 10, 10, CV_X, CV_Y, a, 1))))
library(ggplot2)
p <- ggplot(data.frame(t(test_a)), aes(X1, X2))
p <- p + geom_point()
p <- p + geom_abline()
p
This is not a line and there is no intuition supporting it. It could be anything. For instance, the following is more fitting:
a <- seq(0, 1, 0.01)
test_a2 <- data.frame(x = test_a[1, ], pred = (1 - a^2) * sqrt(CV_X^2 * CV_Y^2 +
CV_X^2 + CV_Y^2) + a^2 * CV_X)
p + geom_line(data = test_a2, aes(x, pred))
Let's try with a very large matrix with a=0.5:
a <- 0.5
b <- 1
summary(replicate(100, mean_CV_col(generate_matrix_siegel_CV(2000, 2000, 10,
10, CV_X, CV_Y, a, b))))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.328 0.332 0.333 0.333 0.334 0.339
(1 - a^2) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + a^2 * CV_X
## [1] 0.3241
Not quite. It is between the first quartile and the median. This may be a close guess but an incorrect one nonetheless.
Let's try a last estimation:
(1 - b) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2)
+ b * ((1 - a^2) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + a^2 * CV_X)
which can also be written:
(1 - b * a^2) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + b * a^2 * CV_X
a <- 0.5
b <- 0.5
summary(replicate(100, mean_CV_col(generate_matrix_siegel_CV(2000, 2000, 10,
10, CV_X, CV_Y, a, b))))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.344 0.348 0.349 0.349 0.351 0.355
(1 - b) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + b * ((1 - a^2) * sqrt(CV_X^2 *
CV_Y^2 + CV_X^2 + CV_Y^2) + a^2 * CV_X)
## [1] 0.3448
This is actually quite close. It may be correct after all. More experiments are required.
CV_X <- 0.2
CV_Y <- 0.3
test_ab <- NULL
for (a in seq(0, 1, 0.2)) for (b in seq(0, 1, 0.2)) {
R <- summary(replicate(10, mean_CV_col(generate_matrix_siegel_CV(2000, 2000,
10, 10, CV_X, CV_Y, a, b))))
pred <- (1 - b) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + b * ((1 - a^2) *
sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + a^2 * CV_X)
test_ab <- rbind(test_ab, data.frame(a, b, med = R["Median"], pred))
}
hist((test_ab$pred - test_ab$med)/test_ab$pred)
sum((test_ab$pred - test_ab$med)/test_ab$pred)
## [1] -0.39
Since the prediction is good, it may be a correct guess actually.
Let's try with other settings:
validate_guess <- function(CV_X, CV_Y) {
test_ab <- NULL
for (a in seq(0, 1, 0.2)) for (b in seq(0, 1, 0.2)) {
R <- summary(replicate(10, mean_CV_col(generate_matrix_siegel_CV(2000,
2000, 10, 10, CV_X, CV_Y, a, b))))
pred <- (1 - b) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + b * ((1 -
a^2) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + a^2 * CV_X)
test_ab <- rbind(test_ab, data.frame(a, b, med = R["Median"], pred))
}
test_ab
}
test_ab_lohi <- validate_guess(0.1, 0.5)
test_ab_hilo <- validate_guess(0.5, 0.1)
hist((test_ab_hilo$pred - test_ab_hilo$med)/test_ab_hilo$pred)
sum((test_ab_hilo$pred - test_ab_hilo$med)/test_ab_hilo$pred)
## [1] 0.009968
hist((test_ab_lohi$pred - test_ab_lohi$med)/test_ab_lohi$pred)
sum((test_ab_lohi$pred - test_ab_lohi$med)/test_ab_lohi$pred)
## [1] -1.808
test_ab_lohi[which.min((test_ab_lohi$pred - test_ab_lohi$med)/test_ab_lohi$pred),
]
## a b med pred
## Median29 0.8 1 0.354 0.2484
OK, this last case is really bad especially when a is close to 1, b=1 and Vtask<<Vmach.
CV_X <- 0.1
CV_Y <- 0.5
a <- 0.8
b <- 1
summary(replicate(100, mean_CV_col(generate_matrix_siegel_CV(2000, 2000, 10,
10, CV_X, CV_Y, a, b))))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.350 0.353 0.353 0.353 0.355 0.357
(1 - b) * sqrt(CV_X^2 * CV_Y^2 + CV_X^2 + CV_Y^2) + b * ((1 - a^2) * sqrt(CV_X^2 *
CV_Y^2 + CV_X^2 + CV_Y^2) + a^2 * CV_X)
## [1] 0.2484
The actual value is actually underestimated. It also suggests that the estimation precision depends on the heterogeneity values, that is the heterogeneity values will impact the aggregation between the case a=b=0 and the case a=b=1.
We have analysed the noise-based method and have a better result for Siegel's method. There is still a missing case though for which we have a nice estimation (empirical MC simulations have shown it to be quite imprecise with some settings).