Generating Matrices of the Heterogeneity Quadrant (September 14, 2014)

Let's test two methods for generating matrices for which we control the heterogeneity properties. Both methods start from a uniform matrix. The problem with uniform matrices is that they have specific properties in terms of correlation between rows and columns (1). The idea behind both methods is to add some randomness in a uniform matrix. The first one multiplies each cost by a random variable, while the second shuffle the values.

Noise method

generate_heterogeneous_matrix_noise <- function(task, proc, CV) {
    mat <- matrix(rgamma_cost(length(task) * length(proc), 1, CV), nrow = length(task), 
        ncol = length(proc))
    return(create_matrix(task, proc) * mat)
}

task <- rgamma_cost(1000, 1, 0.1)
proc <- rgamma_cost(1000, 1, 0.2)
mat <- generate_heterogeneous_matrix_noise(task, proc, 0.3)

sqrt(var(as.vector(mat)))

## [1] 0.3866

cor.test(mat[, 1], mat[, 2])$estimate

##    cor 
## 0.1363

CV_mean_row(mat)

## [1] 0.1013

CV_mean_col(mat)

## [1] 0.21

As this method relies on random variable products, it can be analysed quite simply. First, the heterogeneities are respected. With high noise, however, the heterogeneities may be higher than expected due to the effect of some extreme values in the samples. In any case, it should converge towards the correct values when the size of the matrix grows to infinity.

Keeping the noise low, on the other hand, would maintain the correlation properties of the matrix. The expected correlation should be easy to derive from the CV of the noise, task costs and machine speeds.

As each value is the product of three distributions, it is also easy to find the distribution of the costs and their variances (see http://stats.stackexchange.com/questions/52646/variance-of-product-of-multiple-random-variables/52699#52699.

This method can generate matrices for which the heterogeneities are controlled. In addition, the correlation is controlled through the amount of noise. However, for low correlations, the noise is too important to ensure the heterogeneity properties.

Shuffling method

This second method is more combinatoric. It can be proved that when we add the same quantity on two costs (i1,j1) and (i2,j2), then the heterogeneity properties remain the same if we remove the same quantity on the costs (i1,j2) and (i2,j1). We propose several alternatives by relying on this principle and by randomly exchanging costs on the same row or on the same column:

The objective of the first method is to perform the smallest possible change at each step (a test ensures that no negative value is obtained)
The second method find the smallest cost and perform the smallest change.
The third method tries to minimize the overall variance of the matrix.

generate_heterogeneous_matrix_shuffling_smallest_change <- function(task, proc) {
    mat <- create_matrix(task, proc)
    for (i in 1:length(task)) for (j in 1:length(proc)) {
        otheri <- sample((1:(length(task) - 1) + i - 1)%%length(task) + 1, 1)
        otherj <- sample((1:(length(proc) - 1) + j - 1)%%length(proc) + 1, 1)
        diff12 <- mat[i, j] - mat[i, otherj]
        diff13 <- mat[i, j] - mat[otheri, j]
        diff24 <- mat[i, otherj] - mat[otheri, otherj]
        diff34 <- mat[otheri, j] - mat[otheri, otherj]
        mindiff <- min(abs(c(diff12, diff13, diff24, diff34)))
        if (diff12 == mindiff) 
            diff <- -diff12 else if (diff13 == mindiff) 
            diff <- -diff13 else if (diff24 == mindiff) 
            diff <- diff24 else diff <- diff34
        diff <- max(diff, -mat[i, j], -mat[otheri, otherj])
        diff <- min(diff, mat[i, otherj], mat[otheri, j])
        mat[i, j] <- mat[i, j] + diff
        mat[i, otherj] <- mat[i, otherj] - diff
        mat[otheri, j] <- mat[otheri, j] - diff
        mat[otheri, otherj] <- mat[otheri, otherj] + diff
    }
    return(mat)
}

generate_heterogeneous_matrix_shuffling_smallest_cost <- function(task, proc) {
    mat <- create_matrix(task, proc)
    for (i in 1:length(task)) for (j in 1:length(proc)) {
        otheri <- sample((1:(length(task) - 1) + i - 1)%%length(task) + 1, 1)
        otherj <- sample((1:(length(proc) - 1) + j - 1)%%length(proc) + 1, 1)
        diff12 <- mat[i, j] - mat[i, otherj]
        diff13 <- mat[i, j] - mat[otheri, j]
        diff24 <- mat[i, otherj] - mat[otheri, otherj]
        diff34 <- mat[otheri, j] - mat[otheri, otherj]
        mincost <- min(mat[i, j], mat[i, otherj], mat[otheri, j], mat[otheri, 
            otherj])
        if (mat[i, j] == mincost) 
            diff <- min(-diff12, -diff13) else if (mat[i, otherj] == mincost) 
            diff <- max(-diff12, diff24) else if (mat[otheri, j] == mincost) 
            diff <- max(-diff13, diff34) else diff <- min(diff24, diff34)
        mat[i, j] <- mat[i, j] + diff
        mat[i, otherj] <- mat[i, otherj] - diff
        mat[otheri, j] <- mat[otheri, j] - diff
        mat[otheri, otherj] <- mat[otheri, otherj] + diff
    }
    return(mat)
}

generate_heterogeneous_matrix_shuffling_smallest_variance <- function(task, 
    proc) {
    mat <- create_matrix(task, proc)
    for (i in 1:length(task)) for (j in 1:length(proc)) {
        otheri <- sample((1:(length(task) - 1) + i - 1)%%length(task) + 1, 1)
        otherj <- sample((1:(length(proc) - 1) + j - 1)%%length(proc) + 1, 1)
        diff <- (mat[i, otherj] + mat[otheri, j] - mat[i, j] - mat[otheri, otherj])/4
        mat[i, j] <- mat[i, j] + diff
        mat[i, otherj] <- mat[i, otherj] - diff
        mat[otheri, j] <- mat[otheri, j] - diff
        mat[otheri, otherj] <- mat[otheri, otherj] + diff
    }
    return(mat)
}

Let's generate a uniform matrix and get its properties.

task <- rgamma_cost(100, 1, 0.1)
proc <- rgamma_cost(100, 1, 0.2)
mat_orig <- create_matrix(task, proc)
sqrt(var(as.vector(mat_orig)))

## [1] 0.2328

cor.test(mat_orig[, 1], mat_orig[, 2])$estimate

## cor 
##   1

CV_mean_row(mat_orig)

## [1] 0.0972

CV_mean_col(mat_orig)

## [1] 0.2061

Let's study the first method.

mat <- generate_heterogeneous_matrix_shuffling_smallest_change(task, proc)

sqrt(var(as.vector(mat)))

## [1] 0.5243

cor.test(mat[, 1], mat[, 2])$estimate

##  cor 
## -0.1

CV_mean_row(mat)

## [1] 0.0972

CV_mean_col(mat)

## [1] 0.2061

plot.ecdf(mat_orig)
plot.ecdf(mat, add = TRUE)

plot of chunk generate_heterogeneous_matrix_shuffling_smallest_change_label

length(which(mat == 0))/length(which(mat >= 0))

## [1] 0.0436

As the shuffling is done without changing the mean of any row or column, the heterogeneity properties of the resulting matrix are exactly the same as the base one. Also, as the costs are all shuffled, the expected correlations are all zero. The variance, however, is significantly higher. Also, several costs are null, which may not be desirable and constitutes a violation of the original distribution.

mat <- generate_heterogeneous_matrix_shuffling_smallest_cost(task, proc)

sqrt(var(as.vector(mat)))

## [1] 0.3098

cor.test(mat[, 1], mat[, 2])$estimate

##    cor 
## 0.1238

CV_mean_row(mat)

## [1] 0.0972

CV_mean_col(mat)

## [1] 0.2061

plot.ecdf(mat_orig)
plot.ecdf(mat, add = TRUE)

plot of chunk generate_heterogeneous_matrix_shuffling_smallest_cost_label

This method has a closer variance to the original one. However, the correlation is not that close to 0 anymore. There may be a tradeoff between the cost variance and the correlation.

mat <- generate_heterogeneous_matrix_shuffling_smallest_variance(task, proc)

sqrt(var(as.vector(mat)))

## [1] 0.2322

cor.test(mat[, 1], mat[, 2])$estimate

##    cor 
## 0.9882

CV_mean_row(mat)

## [1] 0.0972

CV_mean_col(mat)

## [1] 0.2061

plot.ecdf(mat_orig)
plot.ecdf(mat, add = TRUE)

plot of chunk generate_heterogeneous_matrix_shuffling_smallest_variance_label

length(which(mat < 0))/length(which(mat >= 0))

## [1] 0

This last method only illustrates the fact that the cost variance may be slightly decreased while still respecting the heterogeneity properties (and the correlation decreases even a little).

A last method is studied: we consider each cost separately and is transformed into another one. This impact the matrix as follows: the other costs of the same row and column absorb a part of the change and the rest of the matrix is then adjusted to account for this. This may be seen as a generalization of the previous principle that considers four costs at each step.

generate_heterogeneous_matrix_shuffling_distribution <- function(task, proc) {
    mat <- create_matrix(task, proc)
    costs <- sample(as.vector(mat))
    for (i in 1:length(task)) for (j in 1:length(proc)) {
        new_cost <- head(costs, n = 1)
        costs <- tail(costs, n = -1)
        diff <- new_cost - mat[i, j]
        new_mat <- mat + diff/(length(task) - 1)/(length(proc) - 1)
        new_mat[i, ] <- mat[i, ] - diff/(length(proc) - 1)
        new_mat[, j] <- mat[, j] - diff/(length(task) - 1)
        new_mat[i, j] <- mat[i, j] + diff
        mat <- new_mat
    }
    return(mat)
}

mat <- generate_heterogeneous_matrix_shuffling_distribution(task, proc)

sqrt(var(as.vector(mat)))

## [1] 0.3483

cor.test(mat[, 1], mat[, 2])$estimate

##    cor 
## 0.2915

CV_mean_row(mat)

## [1] 0.0972

CV_mean_col(mat)

## [1] 0.2061

plot.ecdf(mat_orig)
plot.ecdf(mat, add = TRUE)

plot of chunk generate_heterogeneous_matrix_shuffling_distribution_label

length(which(mat < 0))/length(which(mat >= 0))

## [1] 0.001803

Again, there is a compromise between the variance and the correlation. The objective was to have a distribution closer to the original. In this respect, this is a better solution than the previous exchange-based ones (particularly the second one).

Conclusion

We have two class of methods in addition to the uniform matrix one for generating matrices that have specific properties in terms of heterogeneity. None is perfect. In particular, it is hard to control both the variance of the matrix and its correlation. The advantage of the second type of methods (based on “shuffling”) is that correlation may be ignored when described.