Preliminary empirical evaluation for (Pm,Pk)||Cmax

Let’s study the behavior of the proposed algorithm with random instances.

Single Example

Let’s try the scheduling phase with LPT.

costs <- runif(7)
schedule_LPT(costs, 3)$mks / schedule_optimal(costs, 3)$mks

## [1] 1

Let’s try the allocation phase.

costs <- matrix(c(1, 2,
                  2, 3,
                  2, 3,
                  2, 3,
                  1, 2), ncol = 2, byrow = TRUE)
allocation_optimal(costs, 2, 2)

## $alloc
##   X1 X2 X3 X4 X5
## 3  1  2  1  1  1
## 
## $mks
## [1] 3

allocation_greedy_balance(costs, 2, 2)

## $alloc
##      [,1] [,2]
## [1,]    1    0
## [2,]    1    0
## [3,]    1    0
## [4,]    0    1
## [5,]    1    0
## 
## $mks
## [1] 3
## 
## $LB
## [1] 3

This seems to be working.

Random Generation

Let’s generate some random instances to see how we perform relatively to the makespan of an optimal schedule.

n <- 6
m <- 3
k <- 3
for (i in 1:10) {
  costs <- matrix(runif(2 * n), nrow = n, byrow = TRUE)
  print(allocation_greedy_balance(costs, m, k)$mks / allocation_optimal(costs, m, k)$mks)
}

## [1] 1
## [1] 1
## [1] 1
## [1] 1
## [1] 1.201615
## [1] 1
## [1] 1
## [1] 1
## [1] 1
## [1] 1

Given the time to find an optimal schedule, it is no possible to run a large set of experiments. Moreover with small instances, our algorithm may often be equivalent to the optimal one as we can see. It seems necessary to generate a quick estimation of the makespan for larger instances.

Lower Bound

We can do it by considering the best lower bound.

n <- 100
m <- 10
k <- 10
ratio <- NULL
for (i in 1:100) {
  costs <- matrix(runif(2 * n), nrow = n, byrow = TRUE)
  result <- allocation_greedy_balance(costs, m, k)
  ratio <- c(ratio, result$mks / result$LB)
}
print(c(all(ratio >= 1), mean(ratio)))

## [1] 1.000000 1.026243

Seems to work. Let’s plot the average ratio between the makespan obtained with our algorithm and the lower bound on a heat-map. The numbers of machines of each type is identical in the following.

results <- NULL
for (n in ceiling((200^(1/14))^(1:14))) {
  for (m in c(1:5, seq(7, 15, 2), seq(20, 40, 5))) {
    ratios <- replicate(100, {
      costs <- matrix(runif(2 * n), nrow = n, byrow = TRUE)
      result <- allocation_greedy_balance(costs, m, m)
      result$mks / result$LB
    })
    results <- rbind(results, data.frame(n = n, m = m, ratio = mean(ratios)))
  }
}
ggplot(data = results, aes(x = n, y = m)) +
  geom_tile(aes(fill = ratio)) +
  geom_contour(aes(z = ratio, color = ..level..), breaks = seq(1, 1.5, 0.05)) +
  scale_x_log10(breaks = ceiling((200^(1/14))^(1:14))) +
  scale_y_log10() +
  annotation_logticks(sides = "l") +
  scale_fill_gradientn(colours = c("white", rainbow(100), "black")) +
  scale_color_continuous(low = "black", high = "black", guide = "none")

# library(directlabels)
# direct.label(method = "bottom.pieces")

The lower bound can be up to 20% lower than the generated instance on average, indicating either that our algorithm perform badly or that the lower bound is too coarse. In both cases, we may assume this corresponds to more difficult instances (when the number of tasks and machines is low and when the number of machines is around 15% of the number of tasks).

Preliminary Exploration

Here is the list of the remaining parameters to fix (for a total of ten):

number of tasks \(n\)
numbers of machines \(m\) and \(k\)
cost distribution on each column (uniform, gamma)
cost expected value on each column
cost CV on each column
correlation between both columns

We already varied \(n\) and \(m=k\). Let’s study each parameter one after the other. We start by fixing \(n=30\) and \(m=5\), a seemingly difficult setting.

n <- 30
m <- 5
results <- map_df(1:1000, function(x) {
  k <- sample(1:15, 1)
  costs <- matrix(runif(2 * n), nrow = n, byrow = TRUE)
  result <- allocation_greedy_balance(costs, m, k)
  data.frame(k = k, ratio = result$mks / result$LB)
})
ggplot(data = results, aes(x = k, y = ratio)) +
  geom_boxplot(aes(group = k)) +
  geom_jitter(size = 0.5) +
  stat_smooth()

The hardest setting seems to be when \(k=m\). Let’s study the effect of the expected cost on each column. We fix the expected cost to 1 for the first machines and vary the other.

n <- 30
m <- 5
k <- 5
results <- map_df(1:1000, function(x) {
  cost_ratio <- exp(runif(1, log(0.01), log(100)))
  costs <- matrix(c(runif(n, min = 0, max = 2),
                    runif(n, min = 0, max = 2 * cost_ratio)),
                  nrow = n, byrow = FALSE)
  result <- allocation_greedy_balance(costs, m, k)
  data.frame(cost_ratio = cost_ratio, ratio = result$mks / result$LB)
})
ggplot(data = results, aes(x = cost_ratio, y = ratio - 1)) +
  geom_point(size = 0.5) +
  stat_smooth() +
  scale_x_log10() +
  scale_y_log10() +
  annotation_logticks()

## Warning: Removed 3 rows containing non-finite values (stat_smooth).

We see a symmetry relatively to the case when the expected costs are identical. It seems the performance is the worst when there is a ratio 10x between the expected costs of both columns. However, the difference is not significant and the number of processors of each type may have additional impact. We keep the same expected cost for both columns. Let’s study the effect of the coefficients of variation (CV).

n <- 30
m <- 5
k <- 5
results <- NULL
for (min1 in seq(0, 1, length.out = 15)) {
  for (min2 in seq(0, 1, length.out = 15)) {
    ratios <- replicate(100, {
      costs <- matrix(c(runif(n, min = min1, max = 2 - min1),
                        runif(n, min = min2, max = 2 - min2)),
                      nrow = n, byrow = FALSE)
      result <- allocation_greedy_balance(costs, m, k)
      result$mks / result$LB
    })
    results <- rbind(results, data.frame(min1 = min1, min2 = min2,
                                         ratio = mean(ratios)))
  }
}
ggplot(data = results, aes(x = min1, y = min2)) +
  geom_tile(aes(fill = ratio)) +
  geom_contour(aes(z = ratio, color = ..level..), breaks = seq(1, 1.5, 0.05)) +
  scale_fill_gradientn(colours = c("white", rainbow(100), "black")) +
  scale_color_continuous(low = "black", high = "black", guide = "none")

This is difficult to interpret. The worst situation is when the costs on one type of processors are identical and the others varies significantly. Let’s investigate this case in particular.

ratios <- replicate(100, {
  costs <- matrix(c(runif(n, min = 1, max = 1),
                    runif(n, min = 0, max = 2)),
                  nrow = n, byrow = FALSE)
  result <- allocation_greedy_balance(costs, m, k)
  result$mks / result$LB
})
plot(ratios)

We have indeed large ratios (horizon is up to 40% greater than LB). Let’s study the effect of the correlation and the variance.

n <- 30
m <- 5
k <- 5
results <- NULL
for (min1 in seq(0, 1, length.out = 15)) {
  for (alpha in seq(0, 1, length.out = 15)) {
    ratios <- replicate(100, {
      col1 <- runif(n, min = 0, max = 2)
      col2 <- min1 * (alpha * runif(n, min = 0, max = 2) + (1 - alpha) * col1 - 1) + 1
      costs <- matrix(c(col1, col2), nrow = n, byrow = FALSE)
      result <- allocation_greedy_balance(costs, m, k)
      result$mks / result$LB
    })
    results <- rbind(results, data.frame(min1 = min1, alpha = alpha,
                                         ratio = mean(ratios)))
  }
}
ggplot(data = results, aes(x = min1, y = alpha)) +
  geom_tile(aes(fill = ratio)) +
  geom_contour(aes(z = ratio, color = ..level..), breaks = seq(1, 1.5, 0.05)) +
  scale_fill_gradientn(colours = c("white", rainbow(100), "black")) +
  scale_color_continuous(low = "black", high = "black", guide = "none")

The correlation does not seems to have a strong impact. This is consistent with the previous XP showing that having the costs on one of the machine all close to one is a more difficult instance.

Conclusion

The proposed lower bound can be quite far from the proposed algorithm. Here is a list of the effects that have been studied:

the worst cases occur when the instance size is small (one processor of each type and a few tasks)
the worst ratio of task/machine is when there are around 3 tasks for each machine
the CV of one column should be zero for the worst cases
the expected cost on each column and the correlation do not have a strong impact

It remains to study the effect of the distribution and to vary several parameters at the same time to confirm these preliminary observations. Also, the impact of the CV is not well understood and could be investigated with less machines for instance.

Also, the next step will be to implement methods from the literature (Scheduling independent tasks on multi‐cores with GPU accelerators) and heuristics for \(R||C_{\max}\) (HEFT, …).

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.1.0 dplyr_0.4.3   tidyr_0.2.0   purrr_0.2.0   stringr_1.0.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7      knitr_1.14       magrittr_1.5     munsell_0.4     
##  [5] lattice_0.20-34  colorspace_1.2-2 R6_2.1.2         plyr_1.8.1      
##  [9] tools_3.3.2      parallel_3.3.2   grid_3.3.2       nlme_3.1-128    
## [13] gtable_0.1.2     mgcv_1.8-16      pacman_0.4.1     DBI_0.3.1       
## [17] htmltools_0.3.5  yaml_2.1.13      assertthat_0.1   digest_0.6.9    
## [21] tibble_1.2       Matrix_1.2-7.1   formatR_1.4      evaluate_0.10   
## [25] rmarkdown_1.1    labeling_0.1     stringi_0.5-5    scales_0.4.0