Let's test two methods for generating matrices for which we control the heterogeneity properties. Both methods start from a uniform matrix. The problem with uniform matrices is that they have specific properties in terms of correlation between rows and columns (1). The idea behind both methods is to add some randomness in a uniform matrix. The first one multiplies each cost by a random variable, while the second shuffle the values.
generate_heterogeneous_matrix_noise <- function(task, proc, CV) {
mat <- matrix(rgamma_cost(length(task) * length(proc), 1, CV), nrow = length(task),
ncol = length(proc))
return(create_matrix(task, proc) * mat)
}
task <- rgamma_cost(1000, 1, 0.1)
proc <- rgamma_cost(1000, 1, 0.2)
mat <- generate_heterogeneous_matrix_noise(task, proc, 0.3)
sqrt(var(as.vector(mat)))
## [1] 0.3866
cor.test(mat[, 1], mat[, 2])$estimate
## cor
## 0.1363
CV_mean_row(mat)
## [1] 0.1013
CV_mean_col(mat)
## [1] 0.21
As this method relies on random variable products, it can be analysed quite simply. First, the heterogeneities are respected. With high noise, however, the heterogeneities may be higher than expected due to the effect of some extreme values in the samples. In any case, it should converge towards the correct values when the size of the matrix grows to infinity.
Keeping the noise low, on the other hand, would maintain the correlation properties of the matrix. The expected correlation should be easy to derive from the CV of the noise, task costs and machine speeds.
As each value is the product of three distributions, it is also easy to find the distribution of the costs and their variances (see http://stats.stackexchange.com/questions/52646/variance-of-product-of-multiple-random-variables/52699#52699.
This method can generate matrices for which the heterogeneities are controlled. In addition, the correlation is controlled through the amount of noise. However, for low correlations, the noise is too important to ensure the heterogeneity properties.
This second method is more combinatoric. It can be proved that when we add the same quantity on two costs (i1,j1) and (i2,j2), then the heterogeneity properties remain the same if we remove the same quantity on the costs (i1,j2) and (i2,j1). We propose several alternatives by relying on this principle and by randomly exchanging costs on the same row or on the same column:
generate_heterogeneous_matrix_shuffling_smallest_change <- function(task, proc) {
mat <- create_matrix(task, proc)
for (i in 1:length(task)) for (j in 1:length(proc)) {
otheri <- sample((1:(length(task) - 1) + i - 1)%%length(task) + 1, 1)
otherj <- sample((1:(length(proc) - 1) + j - 1)%%length(proc) + 1, 1)
diff12 <- mat[i, j] - mat[i, otherj]
diff13 <- mat[i, j] - mat[otheri, j]
diff24 <- mat[i, otherj] - mat[otheri, otherj]
diff34 <- mat[otheri, j] - mat[otheri, otherj]
mindiff <- min(abs(c(diff12, diff13, diff24, diff34)))
if (diff12 == mindiff)
diff <- -diff12 else if (diff13 == mindiff)
diff <- -diff13 else if (diff24 == mindiff)
diff <- diff24 else diff <- diff34
diff <- max(diff, -mat[i, j], -mat[otheri, otherj])
diff <- min(diff, mat[i, otherj], mat[otheri, j])
mat[i, j] <- mat[i, j] + diff
mat[i, otherj] <- mat[i, otherj] - diff
mat[otheri, j] <- mat[otheri, j] - diff
mat[otheri, otherj] <- mat[otheri, otherj] + diff
}
return(mat)
}
generate_heterogeneous_matrix_shuffling_smallest_cost <- function(task, proc) {
mat <- create_matrix(task, proc)
for (i in 1:length(task)) for (j in 1:length(proc)) {
otheri <- sample((1:(length(task) - 1) + i - 1)%%length(task) + 1, 1)
otherj <- sample((1:(length(proc) - 1) + j - 1)%%length(proc) + 1, 1)
diff12 <- mat[i, j] - mat[i, otherj]
diff13 <- mat[i, j] - mat[otheri, j]
diff24 <- mat[i, otherj] - mat[otheri, otherj]
diff34 <- mat[otheri, j] - mat[otheri, otherj]
mincost <- min(mat[i, j], mat[i, otherj], mat[otheri, j], mat[otheri,
otherj])
if (mat[i, j] == mincost)
diff <- min(-diff12, -diff13) else if (mat[i, otherj] == mincost)
diff <- max(-diff12, diff24) else if (mat[otheri, j] == mincost)
diff <- max(-diff13, diff34) else diff <- min(diff24, diff34)
mat[i, j] <- mat[i, j] + diff
mat[i, otherj] <- mat[i, otherj] - diff
mat[otheri, j] <- mat[otheri, j] - diff
mat[otheri, otherj] <- mat[otheri, otherj] + diff
}
return(mat)
}
generate_heterogeneous_matrix_shuffling_smallest_variance <- function(task,
proc) {
mat <- create_matrix(task, proc)
for (i in 1:length(task)) for (j in 1:length(proc)) {
otheri <- sample((1:(length(task) - 1) + i - 1)%%length(task) + 1, 1)
otherj <- sample((1:(length(proc) - 1) + j - 1)%%length(proc) + 1, 1)
diff <- (mat[i, otherj] + mat[otheri, j] - mat[i, j] - mat[otheri, otherj])/4
mat[i, j] <- mat[i, j] + diff
mat[i, otherj] <- mat[i, otherj] - diff
mat[otheri, j] <- mat[otheri, j] - diff
mat[otheri, otherj] <- mat[otheri, otherj] + diff
}
return(mat)
}
Let's generate a uniform matrix and get its properties.
task <- rgamma_cost(100, 1, 0.1)
proc <- rgamma_cost(100, 1, 0.2)
mat_orig <- create_matrix(task, proc)
sqrt(var(as.vector(mat_orig)))
## [1] 0.2328
cor.test(mat_orig[, 1], mat_orig[, 2])$estimate
## cor
## 1
CV_mean_row(mat_orig)
## [1] 0.0972
CV_mean_col(mat_orig)
## [1] 0.2061
Let's study the first method.
mat <- generate_heterogeneous_matrix_shuffling_smallest_change(task, proc)
sqrt(var(as.vector(mat)))
## [1] 0.5243
cor.test(mat[, 1], mat[, 2])$estimate
## cor
## -0.1
CV_mean_row(mat)
## [1] 0.0972
CV_mean_col(mat)
## [1] 0.2061
plot.ecdf(mat_orig)
plot.ecdf(mat, add = TRUE)
length(which(mat == 0))/length(which(mat >= 0))
## [1] 0.0436
As the shuffling is done without changing the mean of any row or column, the heterogeneity properties of the resulting matrix are exactly the same as the base one. Also, as the costs are all shuffled, the expected correlations are all zero. The variance, however, is significantly higher. Also, several costs are null, which may not be desirable and constitutes a violation of the original distribution.
mat <- generate_heterogeneous_matrix_shuffling_smallest_cost(task, proc)
sqrt(var(as.vector(mat)))
## [1] 0.3098
cor.test(mat[, 1], mat[, 2])$estimate
## cor
## 0.1238
CV_mean_row(mat)
## [1] 0.0972
CV_mean_col(mat)
## [1] 0.2061
plot.ecdf(mat_orig)
plot.ecdf(mat, add = TRUE)
This method has a closer variance to the original one. However, the correlation is not that close to 0 anymore. There may be a tradeoff between the cost variance and the correlation.
mat <- generate_heterogeneous_matrix_shuffling_smallest_variance(task, proc)
sqrt(var(as.vector(mat)))
## [1] 0.2322
cor.test(mat[, 1], mat[, 2])$estimate
## cor
## 0.9882
CV_mean_row(mat)
## [1] 0.0972
CV_mean_col(mat)
## [1] 0.2061
plot.ecdf(mat_orig)
plot.ecdf(mat, add = TRUE)
length(which(mat < 0))/length(which(mat >= 0))
## [1] 0
This last method only illustrates the fact that the cost variance may be slightly decreased while still respecting the heterogeneity properties (and the correlation decreases even a little).
A last method is studied: we consider each cost separately and is transformed into another one. This impact the matrix as follows: the other costs of the same row and column absorb a part of the change and the rest of the matrix is then adjusted to account for this. This may be seen as a generalization of the previous principle that considers four costs at each step.
generate_heterogeneous_matrix_shuffling_distribution <- function(task, proc) {
mat <- create_matrix(task, proc)
costs <- sample(as.vector(mat))
for (i in 1:length(task)) for (j in 1:length(proc)) {
new_cost <- head(costs, n = 1)
costs <- tail(costs, n = -1)
diff <- new_cost - mat[i, j]
new_mat <- mat + diff/(length(task) - 1)/(length(proc) - 1)
new_mat[i, ] <- mat[i, ] - diff/(length(proc) - 1)
new_mat[, j] <- mat[, j] - diff/(length(task) - 1)
new_mat[i, j] <- mat[i, j] + diff
mat <- new_mat
}
return(mat)
}
mat <- generate_heterogeneous_matrix_shuffling_distribution(task, proc)
sqrt(var(as.vector(mat)))
## [1] 0.3483
cor.test(mat[, 1], mat[, 2])$estimate
## cor
## 0.2915
CV_mean_row(mat)
## [1] 0.0972
CV_mean_col(mat)
## [1] 0.2061
plot.ecdf(mat_orig)
plot.ecdf(mat, add = TRUE)
length(which(mat < 0))/length(which(mat >= 0))
## [1] 0.001803
Again, there is a compromise between the variance and the correlation. The objective was to have a distribution closer to the original. In this respect, this is a better solution than the previous exchange-based ones (particularly the second one).
We have two class of methods in addition to the uniform matrix one for generating matrices that have specific properties in terms of heterogeneity. None is perfect. In particular, it is hard to control both the variance of the matrix and its correlation. The advantage of the second type of methods (based on “shuffling”) is that correlation may be ignored when described.