In our previous analysis from October 1, the missing part was to analyse the heterogeneity of Siegel's method when a fraction of the matrix is consistent. In the following, we assume that a (resp. b) is the fraction of rows (resp. columns) that is consistent.
Based on the previous results, we could have something like the following
It should be noted also that the properties for Siegel's method are inverted. Rtask controls actually the machine heterogeneity and Rmach controls the task heterogeneity.
Let's start by some empirical tests on the partially-consistent method:
generate_matrix_siegel_CV <- function(n, m, mean_task, mean_mach, CV_task, CV_mach,
a = 0, b = 0) {
rdistr <- function(n) {
return(rgamma_cost(n, mean_task, mean_task * CV_task))
}
rdistc <- function(n) {
return(rgamma_cost(n, mean_mach, mean_mach * CV_mach))
}
make_consistent(generate_matrix_siegel(n, m, rdistr, rdistc), a, b)
}
set.seed(0)
Z <- generate_matrix_siegel_CV(1000, 1000, 10, 10, 0.1, 0.1, 0.5, 0.25)
mean_CV_col(Z)
## [1] 0.1393
CV_mean_col(Z)
## [1] 0.02509
set.seed(0)
Z <- generate_matrix_siegel_CV(1000, 1000, 100, 100, 0.1, 0.1, 0.5, 0.25)
mean_CV_col(Z)
## [1] 0.1393
CV_mean_col(Z)
## [1] 0.02509
This is encouraging: the result does not depend on the mean on the distributions and there may exist a closed formula that depends only on CV.
Let's start by checking a simple formula for the CV of the means of the columns: a * b * CV(X)
a <- 1
b <- 0.5
CV_X <- 0.1
CV_Y <- 0.1
a * b * CV_X
## [1] 0.05
CV_mean_col(generate_matrix_siegel_CV(1000, 1000, 10, 10, CV_X, CV_Y, a, b))
## [1] 0.07042
a <- 0.5
b <- 1
a * b * CV_X
## [1] 0.05
CV_mean_col(generate_matrix_siegel_CV(1000, 1000, 10, 10, CV_X, CV_Y, a, b))
## [1] 0.05019
This seems OK for a
but not for b
. The mean of each column is
a * X * mean(Y) + (1-a) * mean(X) * mean(Y) for the b
sorted columns and
mean(X) * mean(Y) for the others. The mean of those means is the same as the
last and the standard deviation is sqrt(b * a2 * mean(Y)2 * var(X)) that is
sqrt(b) * a * mean(Y) * std(X). Then, the CV would be :
a * sqrt(b) * CV(X)
a <- 0.5
b <- 0.25
a * sqrt(b) * CV_X
## [1] 0.025
CV_mean_col(generate_matrix_siegel_CV(1000, 1000, 10, 10, CV_X, CV_Y, a, b))
## [1] 0.02487
Now, let's consider the mean of the CV of the columns. For a given sorted column, there are a part that has mean x*mean(Y) and a part that has mean mean(X) * mean(Y). Therefore, the mean of a column is: a * x * mean(Y) + (1-a) * mean(X) * mean(Y)
Let's use maxima for the next step (deriving the variance from the mean of the square of the random variable):
assume(CVX >= 0, CVY >= 0, MUX >= 0, MUY >= 0);
assume(a >= 0, a <= 1, b >= 0, b <= 1);
declare(a, constant);
declare(b, constant);
declare(x, constant);
VARX : MUX^2 * CVX^2;
VARY : MUY^2 * CVY^2;
MUX2 : VARX + MUX^2;
MUY2 : VARY + MUY^2;
MUCOL : a * x * MUY + (1-a) * MUX * MUY;
MUCOL2 : a * x^2 * MUY2 + (1-a) * MUX2 * MUY2;
VARCOL : MUCOL2 - MUCOL^2;
CVCOL : sqrt(VARCOL) / MUCOL;
expand(CVCOL);
factor(CVCOL); radcan(CVCOL); ratsimp(CVCOL);
The result is a bit complex. Let's see if this is independant from the mean.
ev(CVCOL, MUX=10, CVX=0.1, MUY=10, CVY=0.1, a=0.5, x=9);
ev(CVCOL, MUX=100, CVX=0.1, MUY=10, CVY=0.1, a=0.5, x=90);
At this point, I suspect that the mean of the CV of the columns may depends on the distribution. Let's test this:
generate_matrix_siegel_CV_unif <- function(n, m, mean_task, mean_mach, CV_task,
CV_mach, a = 0, b = 0) {
runif_cost <- function(n, mean, sigma) {
if (mean == 0 || sigma == 0)
return(rep(mean, n))
min <- mean - sigma * sqrt(12)/2
max <- mean + sigma * sqrt(12)/2
return(runif(n, min, max))
}
rdistr <- function(n) {
return(runif_cost(n, mean_task, mean_task * CV_task))
}
rdistc <- function(n) {
return(runif_cost(n, mean_mach, mean_mach * CV_mach))
}
make_consistent(generate_matrix_siegel(n, m, rdistr, rdistc), a, b)
}
Z <- generate_matrix_siegel_CV(5000, 5000, 10, 10, 0.1, 0.1, 0.5, 0.25)
mean_CV_col(Z)
## [1] 0.1393
CV_mean_col(Z)
## [1] 0.02503
Z <- generate_matrix_siegel_CV_unif(5000, 5000, 10, 10, 0.1, 0.1, 0.5, 0.25)
mean_CV_col(Z)
## [1] 0.1402
CV_mean_col(Z)
## [1] 0.02501
Incorrect suspicion. There must be a formal result. Maybe the result is simple (I am starting to be hopeless):
a <- 0.5
b <- 1
sqrt(((1 - a) * CV_X^2 + 1) * CV_Y^2 + (1 - a) * CV_X^2)
## [1] 0.1227
mean_CV_col(generate_matrix_siegel_CV(5000, 5000, 10, 10, CV_X, CV_Y, a, b))
## [1] 0.1333
This is not the same thing (obviously). I have no idea how to proceed from this point.
In case of partial consistency, we will just empirically measure the heterogeneity properties for the mean of the CV of the column. For all the other settings, we have closed formulas.