Large generation study (July 28, 2015)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

We end up having many generation methods with different properties. Let’s do a summary and remove the redundant ones.

We will start by testing a new method idea (control over the number of SV that are non-zero). Then, we will do a summary of them. We want to this whether each method can generate matrices with varying characteristics. Thus, we need to select a few measures and see if each method has some specificity. We will finally try to cluster them.

Varying number of zero SV

We want to see if HLPT is affected by the number of SV that are zero.

n <- 200
m <- 50
ones <- 10
Z <- generate_matrix_sv(c(rep(1, ones), rep(0, min(n, m) - ones)), n, m)
LPT_unrelated(Z, func = min)

## [1] 0

Unfortunately, this does not work because this produces too much zeros in the cost matrix and the makespan is always zero.

Summary of generation methods

So we have finally the following methods with their arguments:

generate_random_matrix(n, m, mean, cv)
generate_heterogeneous_matrix_shuffling_smallest_cost(task, proc)
generate_heterogeneous_matrix_noise(task, proc, CV)
generate_heterogeneous_matrix_noise_corr(n, m, rdist, rhoR, rhoC, Vmax)
generate_matrix_corr_positive(n, m, rdist, mu, CV, rhoR, rhoC)
generate_matrix_cvsv(n, m, rdist, cvsv, mu)
generate_matrix_TMA(n, m, TMA, mu)

The first one is completely random. The next two ones are for heterogeneity. The following two are for the correlation and the last two are for exploration.

Measures for each method

For each of them, we will use random parameters and see how the properties of the generated matrix change. We will focus on a robust measure for the spread by keeping the first and third quartile. The properties of interest are: the heterogeneity, the CV, the correlation and the CVSV.

m <- 200
n <- 50
rdist <- rgamma_cost
properties_generation_large <- NULL
for (i in 1:100) {
  Z_list <- list(
    generate_random_matrix(n, m, runif(1), runif(1)),
    generate_heterogeneous_matrix_shuffling_smallest_cost(rdist(n, 1, runif(1)),
                                                          rdist(m, 1, runif(1))),
    generate_heterogeneous_matrix_noise(rdist(n, 1, runif(1)),
                                        rdist(m, 1, runif(1)),
                                        runif(1)),
    generate_heterogeneous_matrix_noise_corr(n, m, rdist, runif(1), runif(1), runif(1)),
    generate_matrix_corr_positive(n, m, rdist, runif(1), runif(1), runif(1), runif(1)),
    generate_matrix_cvsv(n, m, rdist, runif(10), runif(1)),
    generate_matrix_TMA(n, m, runif(1), runif(1)))
  for (j in 1:length(Z_list)) {
    Z <- Z_list[[j]]
    properties_generation_large <- rbind(properties_generation_large,
                                         data.frame(CV_mean_row = CV_mean_row(Z),
                                                    CV_mean_col = CV_mean_col(Z),
                                                    mean_CV_row = mean_CV_row(Z),
                                                    mean_CV_col = mean_CV_col(Z),
                                                    CV = CV_meas(as.vector(Z)),
                                                    mean_cor_row = mean_cor_row(Z),
                                                    mean_cor_col = mean_cor_col(Z),
                                                    CV_SV = CV_SV(Z),
                                                    method = j))
  }
}
properties_generation_large <- tbl_df(properties_generation_large) %>%
  mutate(method = factor(method, labels = c("random", "shuffling", "noise",
                                            "noise_corr", "combi", "cvsv", "TMA")))

Let’s plot the histogram for all methods.

properties_generation_large %>%
  mutate(mean_CV_row = log(mean_CV_row), mean_CV_col = log(mean_CV_col),
         CV = log(CV)) %>%
  gather(measure, value, -method) %>%
  ggplot(aes(x = value, color = method, fill = method)) +
  geom_bar(position = "fill") +
  facet_wrap(~ measure, scales = "free")

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

It’s difficult to see the specificity of each method. Let’s see which methods are similar by clustering them:

properties_generation_large_wide <- properties_generation_large %>%
  gather(measure, value, -method) %>%
  group_by(method, measure) %>%
  summarise(median = median(value)) %>%
  spread(measure, median)
rownames(properties_generation_large_wide) <- properties_generation_large_wide$method
plot(hclust(dist(properties_generation_large_wide)))

## Warning in dist(properties_generation_large_wide): NAs introduits lors de
## la conversion automatique

Based on the median values, the correlation and the heterogeneity methods form two groups (this confirm that their design objectives is reached). The TMA and the cvsv methods are quite distinct.

Let’s plot the first and third quartiles for each method and each measure:

properties_generation_large %>%
  mutate(CV_SV = log(CV_SV)) %>%
  gather(measure, value, -method) %>%
  group_by(method, measure) %>%
  summarise(Q1 = quantile(value, 0.25), Q3 = quantile(value, 0.75)) %>%
  ggplot(aes(x = method, ymin = Q1, ymax = Q3)) +
  geom_linerange() +
  facet_wrap(~ measure) +
  coord_flip()

Let’s discuss each method:

The random generation method produces matrices with very low heterogeneity (first definition) and very low correlation, while maintaining some heterogeneity (second definition). This method is quite distinct from the others.
The noise-based and the shuffling method are extremely similar. Let’s keep the noise-based as its spread for the CV_SV is larger.
noise_corr and combi are also quite similar. Keeping either one is fine.
The TMA is low everywhere except for the CV_SV. It is not clear if using the random method instead could be sufficient.
The cvsv method is similar to the random-based method but has even lower values of CV_SV.

The four most interesting methods and the four most different are: the noise-based, the combination-based, the TMA-based and the CVSV-based one.

Conclusion

This study confirms that the two methods proposed for the heterogeneity (shuffling and noise) are quite similar. It also shows that the two that are currently proposed for the correlation (combination and noise) are also close. This is good because it allows to show the effect of the correlation with two distinct methods that may be expected to perform similarly. This will strengthen the conclusions.

Also, the two new proposed methods (TMA and CVSV) are quite interesting and can each generate instances that are not covered by existing method (including the random approach).