Fixing the heat bath method

Edited on June 3, 2016.

There is a bug in the recent heat bath generation method: it generates matrices with negative values when the variance is large. Let’s check this.

n <- 100
m <- 30
cvtask <- 0.3
cvmach <- 3.5
task <- rgamma_cost(n, 1, cvtask)
machine <- rgamma_cost(m, 1, cvmach)
set.seed(1)
mat <- heat_bath_based_method(task, machine, 1e5)
mat[mat < 0]

##  [1] -4.295298e-19 -8.259392e-18 -1.558080e-16 -1.136244e-16 -9.878297e-17
##  [6] -2.632578e-17 -8.076150e-17 -4.361323e-16 -1.388321e-16 -1.110223e-16
## [11] -2.407688e-16 -2.974502e-16 -7.302190e-17 -2.498150e-16 -1.299289e-16
## [16] -3.089370e-19 -4.947090e-17 -1.889222e-17 -6.123793e-18 -2.283670e-16
## [21] -3.188731e-17 -1.040834e-16 -2.474488e-16 -8.311896e-17 -1.072347e-16
## [26] -2.939339e-16 -2.519225e-16 -8.606737e-16 -6.938894e-18 -2.886809e-17
## [31] -4.840489e-17 -4.991328e-17 -6.892667e-16 -1.696776e-16 -3.556183e-17
## [36] -1.694066e-16 -2.980986e-16 -2.005870e-16 -1.127785e-17 -2.351089e-18
## [41] -9.787650e-18 -3.130822e-16 -5.700892e-17 -7.350975e-17 -1.594012e-18
## [46] -1.716095e-16 -6.084593e-17 -2.751475e-17 -3.843644e-18 -4.304443e-17
## [51] -1.528022e-15 -2.636780e-16 -2.931683e-16 -7.118031e-17 -6.381414e-17
## [56] -6.406604e-18 -9.267004e-18 -1.127784e-16 -4.405657e-19 -2.892385e-16
## [61] -2.732189e-17 -7.784572e-17 -5.570103e-18 -2.945626e-17 -1.613877e-16
## [66] -3.443931e-16 -3.547252e-17 -3.751340e-17 -1.982247e-18 -2.416921e-16
## [71] -4.727121e-17 -4.871746e-19 -1.942890e-16 -6.504171e-18 -5.848568e-17
## [76] -5.368081e-17 -1.137681e-16 -7.349261e-17 -6.868421e-17 -5.716625e-17
## [81] -1.691492e-16 -5.638744e-17 -3.780142e-16 -5.551115e-17 -7.985263e-17
## [86] -6.017322e-18 -1.280410e-17 -1.302349e-15 -9.535790e-18 -1.144472e-16
## [91] -2.546027e-17 -1.952716e-15 -5.476080e-17 -1.797850e-15 -4.030141e-17
## [96] -5.637684e-18 -7.338360e-18 -7.946242e-17

The negative values are all close to 0. This may thus be due to rounding. Let’s adjust the method by forbidding negative value:

set.seed(1)
mat <- heat_bath_based_method_positive(task, machine, 1e5)
mat[mat < 0]

## numeric(0)

OK, this is fixed.

Revisiting convergence

We validated the convergence with 100 thousands iterations with 100x30 matrices using the correlation as an indicator.

Now, let’s study the distribution of an arbitrary value to see if this is consistent:

cvtask <- 0.3
cvmach <- 0.3
task <- rgamma_cost(n, 1, cvtask)
machine <- rgamma_cost(m, 1, cvmach)
iterations <- 10^(1:4)
map(iterations, ~ replicate(1000, heat_bath_based_method_positive(task, machine, .)[1,1])) %>%
  as.data.frame(col.names = iterations) %>%
  gather(iteration, value) %>%
  ggplot(aes(x = value)) +
  geom_histogram() +
  facet_wrap(~ iteration)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

After 10 thousands iterations, there is no spike on the initial value. It suggests that 10 thousands iterations are sufficient, which is consistent with our previous observations. 100 thousands should thus be more than sufficient.

However, the distribution is not uniform (right-skewed). Let’s see with a smaller matrix (that should converge faster) what is the supposed final distribution:

n <- 5
m <- 5
task <- rgamma_cost(n, 1, cvtask)
machine <- rgamma_cost(m, 1, cvmach)
iterations <- 10^(0:3)
map(iterations, ~ replicate(1000, heat_bath_based_method_positive(task, machine, .)[1,1])) %>%
  as.data.frame(col.names = iterations) %>%
  gather(iteration, value) %>%
  ggplot(aes(x = value)) +
  geom_histogram() +
  facet_wrap(~ iteration)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

OK, the final distribution seems to be indeed right-skewed and stay the same after 100 iterations.

Let’s do this with a second value just to be sure:

task <- rgamma_cost(n, 1, cvtask)
machine <- rgamma_cost(m, 1, cvmach)
iterations <- 10^(0:3)
map(iterations, ~ replicate(1000, heat_bath_based_method_positive(task, machine, .)[1,1])) %>%
  as.data.frame(col.names = iterations) %>%
  gather(iteration, value) %>%
  ggplot(aes(x = value)) +
  geom_histogram() +
  facet_wrap(~ iteration)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

This confirms our analysis. The method has been fixed and we are still convinced that 10 thousands iterations would suffice and that 100 thousands are definitely sufficient.

## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_0.9.3.1 dplyr_0.4.3     tidyr_0.2.0     purrr_0.2.0    
## [5] stringr_1.0.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1        knitr_1.10.5       magrittr_1.5      
##  [4] MASS_7.3-44        munsell_0.4        colorspace_1.2-2  
##  [7] R6_2.1.2           plyr_1.8.1         tools_3.3.0       
## [10] dichromat_2.0-0    parallel_3.3.0     grid_3.3.0        
## [13] gtable_0.1.2       pacman_0.4.1       DBI_0.3.1         
## [16] htmltools_0.2.6    yaml_2.1.13        assertthat_0.1    
## [19] digest_0.6.9       RColorBrewer_1.0-5 reshape2_1.2.2    
## [22] formatR_0.10       codetools_0.2-14   evaluate_0.7      
## [25] rmarkdown_0.7      labeling_0.1       stringi_0.5-5     
## [28] scales_0.2.3       proto_0.3-10

Fixing the heat bath method

L.-C. Canon

May 31, 2016

Revisiting convergence