Analysing analytically Siegel's methods (October 1, 2014)

Edit (2015-1-4): fix siegel implementation

After the last two study, it would be interesting to determine the heterogeneity properties of the matrices generated with the range-based and the CV-based methods. Let's consider that the heterogeneity measures are the CV of the means of the rows or columns.

However, it seems strange that the CV of the means of the rows is equivalent to the mean of the CVs of the columns. Let's perform a test:

generate_matrix_siegel_range <- function(n, m, Rtask, Rmach, a = 0, b = 0) {
    rdistr <- function(n) {
        return(runif(n, 1, Rtask))
    }
    rdistc <- function(n) {
        return(runif(n, 1, Rmach))
    }
    make_consistent(generate_matrix_siegel(n, m, rdistr, rdistc), a, b)
}

Z <- generate_matrix_siegel_range(100, 100, 10, 10)

CV_mean_row(Z)
## [1] 0.4228
mean_CV_col(Z)
## [1] 0.668
CV_mean_col(Z)
## [1] 0.05573
mean_CV_row(Z)
## [1] 0.4782

In the previous study this equivalence was only shown with uniform matrices. It does not hold for general ones. Therefore, we must also analyse the mean of the CV. It may not hold also for the previous proposed generation methods.

Inconsistent range-based method

Let X be the random variable U[1,Rtask] and Y be U[1,Rmach]. We assume that the dimension of the matrix is infinite.

Let's start with the CV of the means of the rows. The mean of each row is X * mean(Y). The mean of those mean is thus mean(X) * mean(Y) and the standard deviation is sd(X) * mean(Y). The CV is thus: CV(X) = sqrt(12)/6 * (Rtask-1)/(Rtask+1).

Let's try with small values of Rmach:

CV_mean_row(generate_matrix_siegel_range(1000, 1000, 10, 100))
## [1] 0.4654
sqrt(12)/6 * (10 - 1)/(10 + 1)
## [1] 0.4724
CV_mean_row(generate_matrix_siegel_range(1000, 1000, 5, 100))
## [1] 0.3775
sqrt(12)/6 * (5 - 1)/(5 + 1)
## [1] 0.3849
CV_mean_row(generate_matrix_siegel_range(1000, 1000, 100, 100))
## [1] 0.5759
sqrt(12)/6 * (100 - 1)/(100 + 1)
## [1] 0.5659

Seems legit. Let's continue with the mean of the CV of the rows. The CV of each row is CV(Y) = sqrt(12)/6 * (Rmach-1)/(Rmach+1).

mean_CV_row(generate_matrix_siegel_range(1000, 1000, 100, 10))
## [1] 0.4727
sqrt(12)/6 * (10 - 1)/(10 + 1)
## [1] 0.4724
mean_CV_row(generate_matrix_siegel_range(1000, 1000, 100, 5))
## [1] 0.3844
sqrt(12)/6 * (5 - 1)/(5 + 1)
## [1] 0.3849
mean_CV_row(generate_matrix_siegel_range(1000, 1000, 100, 100))
## [1] 0.5665
sqrt(12)/6 * (100 - 1)/(100 + 1)
## [1] 0.5659

Let's see the CV of the means of the columns. Each value on one column is the product of X and Y. It is the same thing for each column. Hence, the CV is zero.

CV_mean_col(generate_matrix_siegel_range(1000, 1000, 100, 100))
## [1] 0.02067

Let's finish with the mean of the CV. Considering the CV of one column is enough for the same reason as the previous case (each column is similar). This is: CV(X * Y) = sqrt(var(X) * var(Y) + var(X) * mean(Y)2 + var(Y) * mean(X)2) / (mean(X) * mean(Y)) = sqrt(CV(X)2 * CV(Y)2 + CV(X)2 + CV(Y)2).

This gives: sqrt((sqrt(12)/6 * (Rtask-1)/(Rtask+1) * sqrt(12)/6 * (Rmach-1)/(Rmach+1))2

mean_CV_col(generate_matrix_siegel_range(1000, 1000, Rtask <- 5, Rmach <- 5))
## [1] 0.5685
sqrt((sqrt(12)/6 * (Rtask - 1)/(Rtask + 1) * sqrt(12)/6 * (Rmach - 1)/(Rmach + 
    1))^2 + (sqrt(12)/6 * (Rtask - 1)/(Rtask + 1))^2 + (sqrt(12)/6 * (Rmach - 
    1)/(Rmach + 1))^2)
## [1] 0.5641
mean_CV_col(generate_matrix_siegel_range(1000, 1000, Rtask <- 100, Rmach <- 5))
## [1] 0.7297
sqrt((sqrt(12)/6 * (Rtask - 1)/(Rtask + 1) * sqrt(12)/6 * (Rmach - 1)/(Rmach + 
    1))^2 + (sqrt(12)/6 * (Rtask - 1)/(Rtask + 1))^2 + (sqrt(12)/6 * (Rmach - 
    1)/(Rmach + 1))^2)
## [1] 0.7182
mean_CV_col(generate_matrix_siegel_range(1000, 1000, Rtask <- 5, Rmach <- 100))
## [1] 0.7135
sqrt((sqrt(12)/6 * (Rtask - 1)/(Rtask + 1) * sqrt(12)/6 * (Rmach - 1)/(Rmach + 
    1))^2 + (sqrt(12)/6 * (Rtask - 1)/(Rtask + 1))^2 + (sqrt(12)/6 * (Rmach - 
    1)/(Rmach + 1))^2)
## [1] 0.7182
mean_CV_col(generate_matrix_siegel_range(1000, 1000, Rtask <- 100, Rmach <- 100))
## [1] 0.8584
sqrt((sqrt(12)/6 * (Rtask - 1)/(Rtask + 1) * sqrt(12)/6 * (Rmach - 1)/(Rmach + 
    1))^2 + (sqrt(12)/6 * (Rtask - 1)/(Rtask + 1))^2 + (sqrt(12)/6 * (Rmach - 
    1)/(Rmach + 1))^2)
## [1] 0.862

Nicely done. The same analysis may be used with the CV-based method:

It should be noted that the CV of a uniform distribution is close to sqrt(12)/6 (~0.58) when Rtask and Rmach are large.

Let's see how this work for the consistent case.

Consistent range-based method

The difference is each row is sorted. This does not change the measures that consider the rows (CV_mean_row and mean_CV_row). Let's start by the CV of the means of the columns. Intuitively, it should be CV(Y). The reason would be that each value in the first column are the minimum values of Y. As they are all multiplied by a value X, then the mean of the first column is min(Y) * mean(X). With the same principle, the mean of the k-th column on n has the k/n-quantile value of Y. The distribution of the those means is therefore the initial distribution Y multiplied by mean(X), which does not affect the CV.

CV_mean_col(generate_matrix_siegel_range(1000, 1000, 100, 10, 1, 1))
## [1] 0.4726
sqrt(12)/6 * (10 - 1)/(10 + 1)
## [1] 0.4724
CV_mean_col(generate_matrix_siegel_range(1000, 1000, 100, 5, 1, 1))
## [1] 0.3849
sqrt(12)/6 * (5 - 1)/(5 + 1)
## [1] 0.3849
CV_mean_col(generate_matrix_siegel_range(1000, 1000, 100, 100, 1, 1))
## [1] 0.5666
sqrt(12)/6 * (100 - 1)/(100 + 1)
## [1] 0.5659

Good intuition, but proving it will be more difficult. Now, the coefficient of variation of each column is CV(X).

mean_CV_col(generate_matrix_siegel_range(1000, 1000, 10, 100, 1, 1))
## [1] 0.4788
sqrt(12)/6 * (10 - 1)/(10 + 1)
## [1] 0.4724
mean_CV_col(generate_matrix_siegel_range(1000, 1000, 5, 100, 1, 1))
## [1] 0.3961
sqrt(12)/6 * (5 - 1)/(5 + 1)
## [1] 0.3849
mean_CV_col(generate_matrix_siegel_range(1000, 1000, 100, 100, 1, 1))
## [1] 0.5578
sqrt(12)/6 * (100 - 1)/(100 + 1)
## [1] 0.5659

Nice, the conclusion is thus:

The question is how to extends those results when a fraction a of the rows are sorted and only for a fraction b of the column. This seems to be a tedious work without significant challenge.

Shuffling method's heterogeneity properties

Now that it has been showed that there are two ways to measure heterogeneity, a new question is whether our proposed method control both measures in the same way and whether it is possible to formally analyse it.

task <- rgamma_cost(300, 1, 0.1)
proc <- rgamma_cost(300, 1, 0.2)
mat <- generate_heterogeneous_matrix_shuffling_smallest_cost(task, proc)
CV_mean_row(mat)
## [1] 0.09917
mean_CV_col(mat)
## [1] 0.1874
CV_mean_col(mat)
## [1] 0.2073
mean_CV_row(mat)
## [1] 0.2694

Unfortunately, this does not give the same measure. I have no hope to analytically derive the expected mean of the CV for either the rows or the columns. The noise method could be superior with this respect. The only concrete observation is that the mean of the CV are superior to their counterparts (up to twice), but are not precisely controlled.

Conclusion

It was showed that there is two ways to measure heterogeneity, even though these are equivalent in the uniform case.

We have analysed Siegel's methods. The consistent CV-based method actually allows the control of the heterogeneity with both heterogeneity measures. But, this method is quite close to the uniform method, which is more intuitive. The semi-consistent method has not been considered but should not present any challenge.

With our method, on the other hand, we control the heterogeneity with respect to one measure. This method removes most of the correlation, which the consistent CV-based method does not. Also, the shuffling method is slightly easier to use that the noise-based one because there is one additional parameter with this last method.