This study consists in testing some iterative methods found in the literature to generate contingency tables with fixed row and column sums (which is very close to our problem). As such it is a follow-up of the previous study on December 21, 2015.
A second objective is to test Julia for its supposed high speed.
Several articles in the literature address the problem of generating contingency tables. The methods are close to the shuffling-based method proposed in our 2015 Euro-Par paper. However, when generating a cost matrix with a given heterogeneity, we fix the means of each row and of each column rather than a sum. But, the problem is equivalent. In both cases, each cost must be non-negative.
The literature on contingency tables seems to consist of two kind of approaches: Markov chain Monte Carlo and Sequential Importance Sampling.
We start by the first and the article “Sampling and Counting Contingency Tables Using Markov Chains” from Jonathan Connell (2011). This short paper (technical report maybe) focus on two-rowed tables and a MCMC performing unitary modifications.
“Rapidly mixing Markov chains for sampling contingency tables with a constant number of rows” from Cryan et al. (2006) presents a MCMC method. It was published in the SIAM Journal on Computing. It contains the proof that the “2x2 heat bath” is fast (rapidly mixing) and that it is polynomial in the number of columns and the logarithm of the table sum. This method performs non-unitary modification: it select a random quantity to remove or add to some weights.
Finally, “Random Sampling of Contingency Tables via Probabilistic Divide-and-Conquer” from DeSalvo (2016) proposes specific methods but does not seems promising for a concrete application. It is an ArXiv paper.
For SIS, we have “Sequential Monte Carlo Methods for Statistical Analysis of Tables” from Chen, Diaconis et al. (2005). It was published in the Journal of the American Statistical Association. They propose a sequential importance sampling to generate directly a contingency table (non-iterative method). It is assumed to be fast but not necessarily uniform.
There is also “Semigroups and sequential importance sampling for multiway tables” from Yoshida et al. (2011) that proposes an extension of SIS procedure for multiway tables. It is an arXiv paper.
We focus in the following on a MCMC method with the 2x2 heat bath method to replace the shuffling method, which looks like it but is a bit hacky.
We will assume that the correlation decreases when iterating until it stabilizes around some value. This allows us to detect when the convergence has occurred. It is however an assumption because the convergence may occur later, but it provides some intuition.
Another adjustment we do is to consider floats rather than integers. We assume that this will work well in practice.
Let’s start Julia engine (specificity of knitr+Julia).
library(knitr)
library(runr)
j = proc_julia()
j$start()
knit_engines$set(julia = function(options) {
knitr:::wrap(j$exec(options$code), options)
})
The function for measuring the row and column correlations:
using StatsBase
function cor_row(M)
n = size(M)[1]
cov_row = 0
for i1 = 1:n
for i2 = 1:n
if i1 != i2
cov_row += cov(vec(M[i1,:]), vec(M[i2,:])) /
sqrt(var(vec(M[i1,:])) * var(vec(M[i2,:])))
end
end
end
cov_row /= n * n - n
end
## cor_row
function cor_col(M)
m = size(M)[2]
cov_col = 0
for j1 = 1:m
for j2 = 1:m
if j1 != j2
cov_col += cov(vec(M[:,j1]), vec(M[:,j2])) /
sqrt(var(vec(M[:,j1])) * var(vec(M[:,j2])))
end
end
end
cov_col /= m * m - m
end
## cor_col
The code for testing with a small matrix for comparison with the previous study:
result = 0;
## 0
t1 = time();
## 1.46305181679602e9
for i = 1:1
srand(0);
n = 10;
m = 10;
R = rand(n);
C = rand(m);
M = R * transpose(C);
MC = 10000;
result = zeros(Float64, MC, 2);
for i = 1:MC
result[i,1] = cor_row(M)
result[i,2] = cor_col(M)
i1 = rand(1:n)
j1 = rand(1:m)
i2 = (i1 + rand(1:n-1) - 1) % n + 1
j2 = (j1 + rand(1:m-1) - 1) % m + 1
sumR1 = M[i1,j1] + M[i1,j2]
sumR2 = M[i2,j1] + M[i2,j2]
sumC1 = M[i1,j1] + M[i2,j1]
sumC2 = M[i2,j2] + M[i2,j2]
binf = max(sumC1 - sumR2, 0)
bsup = min(sumR1, sumC1)
M[i1,j1] = rand() * (bsup - binf) + binf
M[i1,j2] = sumR1 - M[i1,j1]
M[i2,j1] = sumC1 - M[i1,j1]
M[i2,j2] = sumR2 - M[i2,j1]
end
end
makespan = time() - t1;
## 4.133729934692383
The previous useless bloc was necessary to avoid useless output (due to some bug with knitr).
It takes around 4 seconds to perform ten thousands iterations, but the convergence is really fast compared to the previous study: hundreds of iterations are sufficient instead of 50 thousands iterations.
Note that the uniform distribution is used whereas we used the gamma distribution in the previous articles. We assume that the choice of the distribution has small impact on the convergence (less than one order of magnitude).
using ASCIIPlots
lineplot(result[1:1000,1])
## -------------------------------------------------------------
## |\ | 1.00
## | |
## | |
## |\ |
## |\ |
## | |
## |\ |
## |\ |
## |\ |
## |\ |
## |\ |
## |\ |
## |-\ |
## | \\ |
## | /\ |
## | -\ |
## | //\ |
## | /\ /\ \\ \ \ \ |
## | /-\\\\\\//\\ \\-\\\\-\\\\\\ \\\\\ \ /\//\/\\\\ |
## | / //-/ /////-/ \// //-///- -\/-/-\/-//-/ / //\\| -0.01
## -------------------------------------------------------------
## 1.00 1000.00
lineplot(result[1:1000,2])
## -------------------------------------------------------------
## |\ | 1.00
## |\ |
## | |
## |\ |
## |\ |
## |\ |
## |-\\ |
## | \-\ |
## | |
## | \/\ |
## | //\ |
## | \\\-\ |
## | --/ \ |
## | \\\ |
## | -/\ |
## | /\/\ --//-//\\ \ |
## | \//-/\ \ \\ \ \\-\\ |
## | ///\ /\-/ -\-/\\ \ \-///// |
## | ////// //-\/\/-\ \---// /\ |
## | //-//\/-/ / --\\| 0.20
## -------------------------------------------------------------
## 1.00 1000.00
Now let’s compare with the performance of R:
timestamp()
## ##------ Thu May 12 13:06:47 2016 ------##
n <- 10
m <- 10
M <- runif(n) %*% t(runif(m))
MC <- 10000
result <- data.frame(row = rep(0, MC), col = rep(0, MC))
for (i in 1:MC) {
result[i,1] = mean_cor_row(M)
result[i,2] = mean_cor_col(M)
i1 <- sample(1:n, 1)
j1 <- sample(1:m, 1)
i2 = (i1 + sample(1:(n - 1), 1) - 1) %% n + 1
j2 = (j1 + sample(1:(m - 1), 1) - 1) %% m + 1
sumR1 = M[i1,j1] + M[i1,j2]
sumR2 = M[i2,j1] + M[i2,j2]
sumC1 = M[i1,j1] + M[i2,j1]
sumC2 = M[i2,j2] + M[i2,j2]
binf = max(sumC1 - sumR2, 0)
bsup = min(sumR1, sumC1)
M[i1,j1] = runif(1) * (bsup - binf) + binf
M[i1,j2] = sumR1 - M[i1,j1]
M[i2,j1] = sumC1 - M[i1,j1]
M[i2,j2] = sumR2 - M[i2,j1]
}
timestamp()
## ##------ Thu May 12 13:07:30 2016 ------##
It takes around ten times longer to generate the data than with Julia. Let’s plot the results:
plot(result[1:1000,1], ylim = c(0, 1), type = "l")
lines(result[1:1000,2], col = 2)
The convergence is similar: hundreds of iterations are sufficient to converge.
The previous experiments compare the 2x2 heat bath method and the unitary method. It took around 50 000 iterations for this size (10 by 10 matrices) with the unitary method, whereas it requires less than 500 iterations with the 2x2 heat bath method (100x faster).
In terms of implementation, Julia is ten times faster. Therefore, we will measure the convergence with the Julia implementation. Let’s consider 100 rows and 30 columns as in our previous articles. We go up to 10 thousands iterations for starter.
result = 0;
## 0
t1 = time();
## 1.463051822879935e9
for i = 1:1
srand(0);
n = 100;
m = 30;
R = rand(n);
C = rand(m);
M = R * transpose(C);
MC = 10000;
freq = 100;
result = zeros(Float64, round(Int, MC/freq), 2);
for i = 1:MC
if i % freq == 1
result[round(Int, i/freq)+1,1] = cor_row(M)
result[round(Int, i/freq)+1,2] = cor_col(M)
end
i1 = rand(1:n)
j1 = rand(1:m)
i2 = (i1 + rand(1:n-1) - 1) % n + 1
j2 = (j1 + rand(1:m-1) - 1) % m + 1
sumR1 = M[i1,j1] + M[i1,j2]
sumR2 = M[i2,j1] + M[i2,j2]
sumC1 = M[i1,j1] + M[i2,j1]
sumC2 = M[i2,j2] + M[i2,j2]
binf = max(sumC1 - sumR2, 0)
bsup = min(sumR1, sumC1)
M[i1,j1] = rand() * (bsup - binf) + binf
M[i1,j2] = sumR1 - M[i1,j1]
M[i2,j1] = sumC1 - M[i1,j1]
M[i2,j2] = sumR2 - M[i2,j1]
end
end
makespan = time() - t1;
## 2.8538599014282227
Let’s plot the result for all the iterations.
using ASCIIPlots
lineplot(result[:,1])
## -------------------------------------------------------------
## |\ | 1.00
## | |
## |\ |
## | |
## | \ |
## | \ |
## | \ |
## | \ |
## | \\ |
## | \ |
## | \ |
## | \\ |
## | \\ |
## | \\ |
## | \\ |
## | \-\ |
## | \\\\ |
## | --\--\\ |
## | -\----\--- |
## | ---------------------------| 0.19
## -------------------------------------------------------------
## 1.00 100.00
lineplot(result[:,2])
## -------------------------------------------------------------
## |\ | 1.00
## | |
## | |
## |\ |
## | |
## | \ |
## | \ |
## | \ |
## | \ |
## | \\ |
## | \ |
## | \ |
## | \\ |
## | \\ |
## | \\\ |
## | \\\\ |
## | \\---- |
## | \--\\ |
## | ------\--\- |
## | \------------------------| 0.17
## -------------------------------------------------------------
## 1.00 100.00
We see the same behavior for both row and column correlations: after around 5000 iterations, they remain stable. It is therefore likely that 10 thousands iterations will suffice for generating instances of this size.
n <- 100
m <- 30
cvtask <- 0.3
cvmach <- 0.3
task <- rgamma_cost(n, 1, cvtask)
machine <- rgamma_cost(m, 1, cvmach)
timestamp()
## ##------ Thu May 12 13:17:05 2016 ------##
mat <- heat_bath_based_method(task, machine, 1e5)
timestamp()
## ##------ Thu May 12 13:17:10 2016 ------##
It takes 5 seconds to perform 100 thousands iterations with R.
Based on the results of this study, we can launch the adapted 2x2 heat bath method and even go up to 100 thousands, which takes a few seconds in R (excluding the time to measure the correlation). This may make up for the simplifications we did:
Several assumptions were made in this study:
Let’s stop Julia (otherwise the process stay alive and poses problem when kniting the document again):
j$stop()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
##
## locale:
## [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_0.9.3.1 dplyr_0.4.3 tidyr_0.2.0 purrr_0.2.0
## [5] stringr_1.0.0 runr_0.0.7 knitr_1.10.5
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.1 magrittr_1.5 MASS_7.3-44
## [4] munsell_0.4 colorspace_1.2-2 R6_2.1.2
## [7] plyr_1.8.1 tools_3.3.0 dichromat_2.0-0
## [10] parallel_3.3.0 grid_3.3.0 gtable_0.1.2
## [13] pacman_0.4.1 DBI_0.3.1 htmltools_0.2.6
## [16] yaml_2.1.13 digest_0.6.9 assertthat_0.1
## [19] RColorBrewer_1.0-5 reshape2_1.2.2 formatR_0.10
## [22] evaluate_0.7 rmarkdown_0.7 labeling_0.1
## [25] stringi_0.5-5 scales_0.2.3 proto_0.3-10