Study Heterogeneity Quadrant (September 11, 2014)

We are interested in studying empirically relevant measures of task and machine heterogeneity to assess the heterogeneity quadrant concept and whether it is achieved.

First, we want to generate uniform cost matrices for which we have a clear and intuitive control over those heterogeneities.

Then, we want a set of measures to see which ones are the most relevant.

Then, we will test whether previous works that uses the original range-based method from Ali provides distinct values for those measures and whether the heterogeneity quadrant is achieved.

Uniform matrix

generate_uniform_matrix <- function(n, m, mean, cv_task, cv_mach) {
    rgamma_cost <- function(n, mean, sigma) {
        if (mean == 0 || sigma == 0) 
            return(rep(mean, n))
        shape <- mean^2/sigma^2
        rate <- mean/sigma^2
        return(rgamma(n, shape, rate))
    }

    task <- rgamma_cost(n, mean, cv_task * mean)
    machine <- rgamma_cost(m, mean, cv_mach * mean)
    return(create_matrix(task, machine))
}

(lolo <- generate_uniform_matrix(9, 7, 1, 0.1, 0.1))
##         [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]
##  [1,] 1.0315 1.0507 1.0860 1.1143 1.0215 1.2134 1.1870
##  [2,] 0.8827 0.8992 0.9294 0.9536 0.8742 1.0384 1.0158
##  [3,] 0.8375 0.8531 0.8818 0.9047 0.8294 0.9852 0.9637
##  [4,] 0.9506 0.9684 1.0009 1.0270 0.9414 1.1183 1.0939
##  [5,] 0.7769 0.7914 0.8180 0.8393 0.7694 0.9139 0.8940
##  [6,] 0.9574 0.9753 1.0081 1.0343 0.9482 1.1263 1.1018
##  [7,] 0.9811 0.9994 1.0330 1.0599 0.9716 1.1541 1.1290
##  [8,] 0.9657 0.9837 1.0168 1.0433 0.9564 1.1361 1.1113
##  [9,] 0.8846 0.9011 0.9314 0.9556 0.8760 1.0406 1.0180
(lohi <- generate_uniform_matrix(9, 7, 1, 0.1, 1))
##         [,1]   [,2]   [,3]   [,4]  [,5]   [,6]   [,7]
##  [1,] 0.4681 0.1592 0.9164 0.2639 1.721 1.2905 0.4251
##  [2,] 0.4620 0.1571 0.9045 0.2605 1.698 1.2736 0.4195
##  [3,] 0.4310 0.1466 0.8438 0.2430 1.584 1.1882 0.3914
##  [4,] 0.3467 0.1179 0.6788 0.1955 1.274 0.9559 0.3149
##  [5,] 0.4497 0.1529 0.8803 0.2535 1.653 1.2396 0.4083
##  [6,] 0.4756 0.1617 0.9312 0.2682 1.748 1.3112 0.4319
##  [7,] 0.3987 0.1356 0.7805 0.2248 1.465 1.0991 0.3620
##  [8,] 0.4075 0.1386 0.7978 0.2298 1.498 1.1235 0.3701
##  [9,] 0.4178 0.1421 0.8178 0.2355 1.536 1.1516 0.3794
(hihi <- generate_uniform_matrix(9, 7, 1, 1, 1))
##          [,1]    [,2]    [,3]    [,4]    [,5]   [,6]    [,7]
##  [1,] 0.06333 0.09308 0.10922 0.09739 0.02601 0.5745  0.7433
##  [2,] 0.26613 0.39116 0.45897 0.40927 0.10929 2.4142  3.1233
##  [3,] 0.03606 0.05300 0.06219 0.05545 0.01481 0.3271  0.4232
##  [4,] 1.03162 1.51630 1.77919 1.58651 0.42364 9.3586 12.1075
##  [5,] 0.22503 0.33075 0.38810 0.34607 0.09241 2.0414  2.6410
##  [6,] 0.04409 0.06480 0.07604 0.06780 0.01811 0.4000  0.5174
##  [7,] 0.02624 0.03857 0.04526 0.04035 0.01078 0.2380  0.3080
##  [8,] 0.55541 0.81635 0.95789 0.85415 0.22808 5.0385  6.5185
##  [9,] 0.78949 1.16041 1.36160 1.21414 0.32421 7.1621  9.2658

OK, now we have a way to generate matrices that correspond exactly to the intuition behind the heterogeneity quadrant.

Heterogeneity measures

For each row, we compute the mean, the standard deviation or the CV. We then compute the CV of those means or the mean of those CVs or the standard deviation of those means or the mean of those standard deviations:

CV_mean_row <- function(cost) {
    means <- sapply(1:nrow(cost), function(x) mean(cost[x, ]))
    return(sqrt(var(means))/mean(means))
}

CV_mean_col <- function(cost) {
    means <- sapply(1:ncol(cost), function(x) mean(cost[, x]))
    return(sqrt(var(means))/mean(means))
}

mean_CV_row <- function(cost) {
    CV <- sapply(1:nrow(cost), function(x) sqrt(var(cost[x, ]))/mean(cost[x, 
        ]))
    return(mean(CV))
}

mean_CV_col <- function(cost) {
    CV <- sapply(1:ncol(cost), function(x) sqrt(var(cost[, x]))/mean(cost[, 
        x]))
    return(mean(CV))
}

sd_mean_row <- function(cost) {
    means <- sapply(1:nrow(cost), function(x) mean(cost[x, ]))
    return(sqrt(var(means)))
}

sd_mean_col <- function(cost) {
    means <- sapply(1:ncol(cost), function(x) mean(cost[, x]))
    return(sqrt(var(means)))
}

mean_sd_row <- function(cost) {
    sd <- sapply(1:nrow(cost), function(x) sqrt(var(cost[x, ])))
    return(mean(sd))
}

mean_sd_col <- function(cost) {
    sd <- sapply(1:ncol(cost), function(x) sqrt(var(cost[, x])))
    return(mean(sd))
}

R <- generate_uniform_matrix(1000, 100, 1, 0.1, 1)
CV_mean_row(R)
## [1] 0.1005
CV_mean_col(R)
## [1] 0.9093
mean_CV_row(R)
## [1] 0.9093
mean_CV_col(R)
## [1] 0.1005
sd_mean_row(R)
## [1] 0.09945
sd_mean_col(R)
## [1] 0.8994
mean_sd_row(R)
## [1] 0.8994
mean_sd_col(R)
## [1] 0.09945

It appears that the CV of the means is the same as the mean of the CVs. Also, the standard deviation of means is the same as the mean of standard deviations.

Let's keep the CV of the means as our final candidate measures (the standard deviation of the means seems equivalent but absolute instead of relative).

CV_mean_row(lolo)
## [1] 0.08649
CV_mean_col(lolo)
## [1] 0.0685
CV_mean_row(lohi)
## [1] 0.09584
CV_mean_col(lohi)
## [1] 0.7769
CV_mean_row(hihi)
## [1] 1.099
CV_mean_col(hihi)
## [1] 1.185

This is consistent with our base matrices.

MPH <- function(ETC) {
    ECS <- 1/ETC
    MP <- sapply(1:ncol(ECS), function(j) sum(ECS[, j]))
    MP <- sort(MP)
    result <- 0
    for (j in 1:(ncol(ECS) - 1)) result <- result + MP[j]/MP[j + 1]
    result <- result/(ncol(ECS) - 1)
    return(result)
}

TDH <- function(ETC) {
    ECS <- 1/ETC
    TD <- sapply(1:nrow(ECS), function(i) sum(ECS[i, ]))
    TD <- sort(TD)
    result <- 0
    for (i in 1:(nrow(ECS) - 1)) result <- result + TD[i]/TD[i + 1]
    result <- result/(nrow(ECS) - 1)
    return(result)
}

TDH(lolo)
## [1] 0.9656
MPH(lolo)
## [1] 0.9719
TDH(lohi)
## [1] 0.962
MPH(lohi)
## [1] 0.6838
TDH(hihi)
## [1] 0.6646
MPH(hihi)
## [1] 0.6503

Those measures are able to characterize the heterogeneity on small matrices.

TDH(R)
## [1] 0.9993
MPH(R)
## [1] 0.9445

However, for a large matrix generated with the same parameters those measures are useless. Indeed, asymptotically, the ratios are all close to one.

Original range-based method

Let's see if the way the original range-based method was used allowed to reach several distinct settings in terms of heterogeneities.

heterogeneity_quadrant_measures <- function(Rtask, Rmach) {
    rdistr <- function(n) {
        return(runif(n, 1, Rtask))
    }
    rdistc <- function(n) {
        return(runif(n, 1, Rmach))
    }

    Z <- generate_matrix_siegel(1000, 100, rdistr, rdistc)
    return(list(CVmtask = CV_mean_row(Z), CVmmach = CV_mean_col(Z), MPH = MPH(Z), 
        TDH = TDH(Z)))
}

parameters <- rbind(c(10, 10), c(10, 100), c(100, 10), c(100, 1000), c(3000, 
    10), c(3000, 100), c(3000, 1000), c(10000, 10), c(10000, 100))
H <- sapply(1:nrow(parameters), function(x) heterogeneity_quadrant_measures(parameters[x, 
    1], parameters[x, 2]))
summary(sapply(data.frame(t(H)), unlist))
##     CVmtask         CVmmach            MPH             TDH       
##  Min.   :0.478   Min.   :0.0165   Min.   :0.980   Min.   :0.993  
##  1st Qu.:0.552   1st Qu.:0.0171   1st Qu.:0.987   1st Qu.:0.993  
##  Median :0.568   Median :0.0189   Median :0.994   Median :0.995  
##  Mean   :0.554   Mean   :0.0194   Mean   :0.992   Mean   :0.995  
##  3rd Qu.:0.581   3rd Qu.:0.0216   3rd Qu.:0.996   3rd Qu.:0.996  
##  Max.   :0.585   Max.   :0.0227   Max.   :0.998   Max.   :0.998

We see that the task and machine heterogeneities have always almost the same values. Let's consider the same test but with consistent cost matrices, which could actually impact those measures.

make_consistent <- function(mat, a = 1, b = 1) {
    row_max <- round(a * nrow(mat))
    col_max <- round(b * ncol(mat))
    if (row_max == 0 || col_max == 0) 
        return(mat)
    for (i in 1:row_max) mat[i, 1:col_max] <- sort(mat[i, 1:col_max])
    mat
}

generate_matrix_siegel_consistent <- function(R, C, rdistr, rdistc) {
    Z <- generate_matrix_siegel(R, C, rdistr, rdistc)
    make_consistent(Z)
}

heterogeneity_quadrant_measures <- function(Rtask, Rmach) {
    rdistr <- function(n) {
        return(runif(n, 1, Rtask))
    }
    rdistc <- function(n) {
        return(runif(n, 1, Rmach))
    }

    Z <- generate_matrix_siegel_consistent(1000, 100, rdistr, rdistc)
    return(list(CVmtask = CV_mean_row(Z), CVmmach = CV_mean_col(Z), MPH = MPH(Z), 
        TDH = TDH(Z)))
}

parameters <- rbind(c(10, 10), c(10, 100), c(100, 10), c(100, 1000), c(3000, 
    10), c(3000, 100), c(3000, 1000), c(10000, 10), c(10000, 100))
H <- sapply(1:nrow(parameters), function(x) heterogeneity_quadrant_measures(parameters[x, 
    1], parameters[x, 2]))
summary(sapply(data.frame(t(H)), unlist))
##     CVmtask         CVmmach           MPH             TDH       
##  Min.   :0.461   Min.   :0.470   Min.   :0.952   Min.   :0.993  
##  1st Qu.:0.550   1st Qu.:0.471   1st Qu.:0.961   1st Qu.:0.993  
##  Median :0.573   Median :0.564   Median :0.962   Median :0.994  
##  Mean   :0.554   Mean   :0.525   Mean   :0.967   Mean   :0.995  
##  3rd Qu.:0.589   3rd Qu.:0.565   3rd Qu.:0.978   3rd Qu.:0.996  
##  Max.   :0.601   Max.   :0.577   Max.   :0.978   Max.   :0.997

This changes significantly the machine heterogeneity. This shows that the way the heterogeneity quadrant is used is deeply flawed. What is expected is that the range paramaters have an impact on the heterogeneity whereas it is the consistency that impact it. Overall, we may also conclude that the tested settings in terms of heterogeneity in the literature lacks diversity.

Conclusion

This study was a preliminary test to check the hypothesis that the quadrant heterogeneity was a fraud. So far, it is the case. More complete test must be done. Additionally, the CVB method may show different outcomes. At the very least, this shows that previous studies do not control what they claim to control.