Analysing analytically Siegel's methods (October 1, 2014)

Edit (2015-1-4): fix siegel implementation

After the last two study, it would be interesting to determine the heterogeneity properties of the matrices generated with the range-based and the CV-based methods. Let's consider that the heterogeneity measures are the CV of the means of the rows or columns.

However, it seems strange that the CV of the means of the rows is equivalent to the mean of the CVs of the columns. Let's perform a test:

generate_matrix_siegel_range <- function(n, m, Rtask, Rmach, a = 0, b = 0) {
    rdistr <- function(n) {
        return(runif(n, 1, Rtask))
    }
    rdistc <- function(n) {
        return(runif(n, 1, Rmach))
    }
    make_consistent(generate_matrix_siegel(n, m, rdistr, rdistc), a, b)
}

Z <- generate_matrix_siegel_range(100, 100, 10, 10)

CV_mean_row(Z)

## [1] 0.4228

mean_CV_col(Z)

## [1] 0.668

CV_mean_col(Z)

## [1] 0.05573

mean_CV_row(Z)

## [1] 0.4782

In the previous study this equivalence was only shown with uniform matrices. It does not hold for general ones. Therefore, we must also analyse the mean of the CV. It may not hold also for the previous proposed generation methods.

Inconsistent range-based method

Let X be the random variable U[1,Rtask] and Y be U[1,Rmach]. We assume that the dimension of the matrix is infinite.

Let's start with the CV of the means of the rows. The mean of each row is X * mean(Y). The mean of those mean is thus mean(X) * mean(Y) and the standard deviation is sd(X) * mean(Y). The CV is thus: CV(X) = sqrt(12)/6 * (Rtask-1)/(Rtask+1).

Let's try with small values of Rmach:

CV_mean_row(generate_matrix_siegel_range(1000, 1000, 10, 100))

## [1] 0.4654

sqrt(12)/6 * (10 - 1)/(10 + 1)

## [1] 0.4724

CV_mean_row(generate_matrix_siegel_range(1000, 1000, 5, 100))

## [1] 0.3775

sqrt(12)/6 * (5 - 1)/(5 + 1)

## [1] 0.3849

CV_mean_row(generate_matrix_siegel_range(1000, 1000, 100, 100))

## [1] 0.5759

sqrt(12)/6 * (100 - 1)/(100 + 1)

## [1] 0.5659

Seems legit. Let's continue with the mean of the CV of the rows. The CV of each row is CV(Y) = sqrt(12)/6 * (Rmach-1)/(Rmach+1).

mean_CV_row(generate_matrix_siegel_range(1000, 1000, 100, 10))

## [1] 0.4727

sqrt(12)/6 * (10 - 1)/(10 + 1)

## [1] 0.4724

mean_CV_row(generate_matrix_siegel_range(1000, 1000, 100, 5))

## [1] 0.3844

sqrt(12)/6 * (5 - 1)/(5 + 1)

## [1] 0.3849

mean_CV_row(generate_matrix_siegel_range(1000, 1000, 100, 100))

## [1] 0.5665

sqrt(12)/6 * (100 - 1)/(100 + 1)

## [1] 0.5659

Let's see the CV of the means of the columns. Each value on one column is the product of X and Y. It is the same thing for each column. Hence, the CV is zero.

CV_mean_col(generate_matrix_siegel_range(1000, 1000, 100, 100))

## [1] 0.02067

Let's finish with the mean of the CV. Considering the CV of one column is enough for the same reason as the previous case (each column is similar). This is: CV(X * Y) = sqrt(var(X) * var(Y) + var(X) * mean(Y)² + var(Y) * mean(X)²⁾ / (mean(X) * mean(Y)) = sqrt(CV(X)² * CV(Y)² + CV(X)² + CV(Y)^2).

This gives: sqrt((sqrt(12)/6 * (Rtask-1)/(Rtask+1) * sqrt(12)/6 * (Rmach-1)/(Rmach+1))²

(sqrt(12)/6 * (Rtask-1)/(Rtask+1))² + (sqrt(12)/6 * (Rmach-1)/(Rmach+1))^2).

mean_CV_col(generate_matrix_siegel_range(1000, 1000, Rtask <- 5, Rmach <- 5))

## [1] 0.5685

sqrt((sqrt(12)/6 * (Rtask - 1)/(Rtask + 1) * sqrt(12)/6 * (Rmach - 1)/(Rmach + 
    1))^2 + (sqrt(12)/6 * (Rtask - 1)/(Rtask + 1))^2 + (sqrt(12)/6 * (Rmach - 
    1)/(Rmach + 1))^2)

## [1] 0.5641

mean_CV_col(generate_matrix_siegel_range(1000, 1000, Rtask <- 100, Rmach <- 5))

## [1] 0.7297

sqrt((sqrt(12)/6 * (Rtask - 1)/(Rtask + 1) * sqrt(12)/6 * (Rmach - 1)/(Rmach + 
    1))^2 + (sqrt(12)/6 * (Rtask - 1)/(Rtask + 1))^2 + (sqrt(12)/6 * (Rmach - 
    1)/(Rmach + 1))^2)

## [1] 0.7182

mean_CV_col(generate_matrix_siegel_range(1000, 1000, Rtask <- 5, Rmach <- 100))

## [1] 0.7135

sqrt((sqrt(12)/6 * (Rtask - 1)/(Rtask + 1) * sqrt(12)/6 * (Rmach - 1)/(Rmach + 
    1))^2 + (sqrt(12)/6 * (Rtask - 1)/(Rtask + 1))^2 + (sqrt(12)/6 * (Rmach - 
    1)/(Rmach + 1))^2)

## [1] 0.7182

mean_CV_col(generate_matrix_siegel_range(1000, 1000, Rtask <- 100, Rmach <- 100))

## [1] 0.8584

sqrt((sqrt(12)/6 * (Rtask - 1)/(Rtask + 1) * sqrt(12)/6 * (Rmach - 1)/(Rmach + 
    1))^2 + (sqrt(12)/6 * (Rtask - 1)/(Rtask + 1))^2 + (sqrt(12)/6 * (Rmach - 
    1)/(Rmach + 1))^2)

## [1] 0.862

Nicely done. The same analysis may be used with the CV-based method:

task heterogeneity:
- CV_mean_row : CV(X)
- mean_CV_col : sqrt(CV(X)² * CV(Y)² + CV(X)² + CV(Y)²⁾
machine heterogeneity:
- CV_mean_col : 0
- mean_CV_row : CV(Y)

It should be noted that the CV of a uniform distribution is close to sqrt(12)/6 (~0.58) when Rtask and Rmach are large.

Let's see how this work for the consistent case.

Consistent range-based method

The difference is each row is sorted. This does not change the measures that consider the rows (CV_mean_row and mean_CV_row). Let's start by the CV of the means of the columns. Intuitively, it should be CV(Y). The reason would be that each value in the first column are the minimum values of Y. As they are all multiplied by a value X, then the mean of the first column is min(Y) * mean(X). With the same principle, the mean of the k-th column on n has the k/n-quantile value of Y. The distribution of the those means is therefore the initial distribution Y multiplied by mean(X), which does not affect the CV.

CV_mean_col(generate_matrix_siegel_range(1000, 1000, 100, 10, 1, 1))

## [1] 0.4726

sqrt(12)/6 * (10 - 1)/(10 + 1)

## [1] 0.4724

CV_mean_col(generate_matrix_siegel_range(1000, 1000, 100, 5, 1, 1))

## [1] 0.3849

sqrt(12)/6 * (5 - 1)/(5 + 1)

## [1] 0.3849

CV_mean_col(generate_matrix_siegel_range(1000, 1000, 100, 100, 1, 1))

## [1] 0.5666

sqrt(12)/6 * (100 - 1)/(100 + 1)

## [1] 0.5659

Good intuition, but proving it will be more difficult. Now, the coefficient of variation of each column is CV(X).

mean_CV_col(generate_matrix_siegel_range(1000, 1000, 10, 100, 1, 1))

## [1] 0.4788

sqrt(12)/6 * (10 - 1)/(10 + 1)

## [1] 0.4724

mean_CV_col(generate_matrix_siegel_range(1000, 1000, 5, 100, 1, 1))

## [1] 0.3961

sqrt(12)/6 * (5 - 1)/(5 + 1)

## [1] 0.3849

mean_CV_col(generate_matrix_siegel_range(1000, 1000, 100, 100, 1, 1))

## [1] 0.5578

sqrt(12)/6 * (100 - 1)/(100 + 1)

## [1] 0.5659

Nice, the conclusion is thus:

task heterogeneity:
- CV_mean_row : CV(X)
- mean_CV_col : CV(X)
machine heterogeneity:
- CV_mean_col : CV(Y)
- mean_CV_row : CV(Y)

The question is how to extends those results when a fraction a of the rows are sorted and only for a fraction b of the column. This seems to be a tedious work without significant challenge.

Shuffling method's heterogeneity properties

Now that it has been showed that there are two ways to measure heterogeneity, a new question is whether our proposed method control both measures in the same way and whether it is possible to formally analyse it.

task <- rgamma_cost(300, 1, 0.1)
proc <- rgamma_cost(300, 1, 0.2)
mat <- generate_heterogeneous_matrix_shuffling_smallest_cost(task, proc)
CV_mean_row(mat)

## [1] 0.09917

mean_CV_col(mat)

## [1] 0.1874

CV_mean_col(mat)

## [1] 0.2073

mean_CV_row(mat)

## [1] 0.2694

Unfortunately, this does not give the same measure. I have no hope to analytically derive the expected mean of the CV for either the rows or the columns. The noise method could be superior with this respect. The only concrete observation is that the mean of the CV are superior to their counterparts (up to twice), but are not precisely controlled.

Conclusion

It was showed that there is two ways to measure heterogeneity, even though these are equivalent in the uniform case.

We have analysed Siegel's methods. The consistent CV-based method actually allows the control of the heterogeneity with both heterogeneity measures. But, this method is quite close to the uniform method, which is more intuitive. The semi-consistent method has not been considered but should not present any challenge.

With our method, on the other hand, we control the heterogeneity with respect to one measure. This method removes most of the correlation, which the consistent CV-based method does not. Also, the shuffling method is slightly easier to use that the noise-based one because there is one additional parameter with this last method.