Let’s consider a set of heterogeneous tasks with given expected cost and variance. We assume that the scheduling strategy is a dynamic one that starts each task as soon as possible starting by the longest or the one with the most variance.

n <- 300
p <- 10
costs <- cbind(sort(runif(n), decreasing = TRUE),
               sort(runif(n), decreasing = TRUE))
compl <- eval_makespan_MC_all(costs, p, MC = 1e4)

Let’s start with a coarse estimation assuming all tasks finishes at the same time:

model_ecdf <- rnorm(1e4, mean = sum(costs[,1]) / p,
                    sd = sqrt(sum(costs[,2]^2)) / p)
plot.ecdf(apply(compl, 2, max))
plot.ecdf(model_ecdf, add = TRUE, col = 2)

ks.test(apply(compl, 2, max), model_ecdf)$statistic
##      D 
## 0.0561

This is close but no an upper bound. Let’s consider a less favorable scenario (costs come in opposite order):

compl_inv <- eval_makespan_MC_all(costs[n:1,], p, MC = 1e4)
plot.ecdf(apply(compl_inv, 2, max))
plot.ecdf(model_ecdf, add = TRUE, col = 2)

ks.test(apply(compl_inv, 2, max), model_ecdf)$statistic
##      D 
## 0.5363

This is not good.

Graham’s bound

Let’s improve this based on Graham’s bound:

model_ecdf <- rnorm(1e4, mean = sum(costs[1:(n - 1),1]) / p + costs[n,1],
                    sd = sqrt(sum(costs[1:(n - 1),2]^2) / p^2 + costs[n,2]^2))
plot.ecdf(compl[n,])
plot.ecdf(model_ecdf, add = TRUE, col = 2)

ks.test(compl[n,], model_ecdf)$statistic
##      D 
## 0.0398

Very precise.

model_ecdf <- rnorm(1e4, mean = sum(costs[1:(n - p),1]) / p + costs[n - p + 1,1],
                    sd = sqrt(sum(costs[1:(n - p),2]^2) / p^2 + costs[n - p + 1,2]^2))
plot.ecdf(compl[n - p + 1,])
plot.ecdf(model_ecdf, add = TRUE, col = 2)

ks.test(compl[n - p + 1,], model_ecdf)$statistic
##      D 
## 0.0467

Quite accurate too. Let’s perform the max now:

model_ecdf <- rep(0, 1e4)
for (i in seq_len(p))
  model_ecdf <- pmax(rnorm(1e4, mean = sum(costs[1:(n - i),1]) / p + costs[n - i + 1,1],
                          sd = sqrt(sum(costs[1:(n - i),2]^2) / p^2 + costs[n - i + 1,2]^2)),
                    model_ecdf)
plot.ecdf(apply(compl, 2, max))
plot.ecdf(model_ecdf, add = TRUE, col = 2)

ks.test(apply(compl, 2, max), model_ecdf)$statistic
##      D 
## 0.6627

We lose the accuracy because this is a maximum between dependant variable. Note that the completion of the last task to start is a good approximation in this case:

model_ecdf <- rnorm(1e4, mean = sum(costs[1:(n - 1),1]) / p + costs[n,1],
                    sd = sqrt(sum(costs[1:(n - 1),2]^2) / p^2 + costs[n,2]^2))
ks.test(apply(compl, 2, max), model_ecdf)$statistic
##      D 
## 0.0608

Let’s consider the less favorable scenario:

model_ecdf <- rnorm(1e4, mean = sum(costs[2:n,1]) / p + costs[1,1],
                    sd = sqrt(sum(costs[2:n,2]^2) / p^2 + costs[1,2]^2))
plot.ecdf(compl_inv[n,])
plot.ecdf(model_ecdf, add = TRUE, col = 2)

ks.test(compl_inv[n,], model_ecdf)$statistic
##     D 
## 0.297

We have a max bound that is quite far away.

model_ecdf <- rep(0, 1e4)
for (i in seq_len(p))
  model_ecdf <- pmax(rnorm(1e4, mean = sum(costs[(1 + i):n,1]) / p + costs[i,1],
                          sd = sqrt(sum(costs[(1 + i):n,2]^2) / p^2 + costs[i,2]^2)),
                    model_ecdf)
plot.ecdf(apply(compl_inv, 2, max))
plot.ecdf(model_ecdf, add = TRUE, col = 2)

ks.test(apply(compl_inv, 2, max), model_ecdf)$statistic
##      D 
## 0.3701

Since the schedule is less compact, then the maximum is performed between variables that are more independent. Let’s see if the completion time of the last task to start is accurate:

model_ecdf <- rnorm(1e4, mean = sum(costs[2:n,1]) / p + costs[1,1],
                    sd = sqrt(sum(costs[2:n,2]^2) / p^2 + costs[1,2]^2))
plot.ecdf(apply(compl_inv, 2, max))
plot.ecdf(model_ecdf, add = TRUE, col = 2)

ks.test(apply(compl, 2, max), model_ecdf)$statistic
##      D 
## 0.2718

Not very good and it is not an upper bound anymore.

Validation

Let’s settle for a lower bound (average work), an upper bound (max of all task completion times) and in-between estimations (last task completion time, max of last p task completion times). Let’s consider the following orders: - random - LPT - Largest Variance First - LPT and LVF - inverse LPT and LVF

Let’s start by generating the completion times:

n <- 100
p <- 5
costs_rand <- cbind(runif(n), runif(n))
compl_rand <- eval_makespan_MC_all(costs_rand, p, MC = 1e4)
costs_LPT <- cbind(sort(runif(n), decreasing = TRUE), runif(n))
compl_LPT <- eval_makespan_MC_all(costs_LPT, p, MC = 1e4)
costs_LVF <- cbind(runif(n), sort(runif(n), decreasing = TRUE))
compl_LVF <- eval_makespan_MC_all(costs_LVF, p, MC = 1e4)
costs_LPTLVF <- cbind(sort(runif(n), decreasing = TRUE),
                      sort(runif(n), decreasing = TRUE))
compl_LPTLVF <- eval_makespan_MC_all(costs_LPTLVF, p, MC = 1e4)
costs_LPTLVF_inv <- cbind(sort(runif(n), decreasing = FALSE),
                          sort(runif(n), decreasing = FALSE))
compl_LPTLVF_inv <- eval_makespan_MC_all(costs_LPTLVF_inv, p, MC = 1e4)

Let’s compute the estimation and plot them for the first instance:

estimations_ecdf <- function(costs, p) {
  upper_bound_task_ecdf <- function(costs, p, i) {
    n <- nrow(costs)
    rnorm(1e4, mean = sum(costs[seq_len(i - 1),1]) / p + costs[i,1],
          sd = sqrt(sum(costs[seq_len(i - 1),2]^2) / p^2 + costs[i,2]^2))
  }
  n <- nrow(costs)
  compl_times <- do.call(rbind, map(1:n, ~ upper_bound_task_ecdf(costs, p, .)))
  list(
    lower_bound = rnorm(1e4, mean = sum(costs[,1]) / p,
                      sd = sqrt(sum(costs[,2]^2)) / p),
    upper_bound = apply(compl_times, 2, max),
    last_p_tasks = apply(compl_times[(n - p + 1):n,],
                         2, max),
    last_tasks = compl_times[n,]
  )
}
plot_ecdfs <- function(costs, compl, p) {
  mks <- apply(compl, 2, max)
  plot.ecdf(mks, xlim = quantile(mks, c(0.01, 0.99)))
  estimations <- estimations_ecdf(costs, p)
  for (i in 1:length(estimations))
    plot.ecdf(estimations[[i]], add = TRUE, col = i + 1)
}
par(mfrow = c(2, 3))
plot_ecdfs(costs_rand, compl_rand, p)
plot_ecdfs(costs_LPT, compl_LPT, p)
plot_ecdfs(costs_LVF, compl_LVF, p)
plot_ecdfs(costs_LPTLVF, compl_LPTLVF, p)
plot_ecdfs(costs_LPTLVF_inv, compl_LPTLVF_inv, p)

OK, the lower and upper bounds appear to work correctly. Moreover, the most precise strategy seems to be the completion time of the last task.

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.18.so
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] stringr_1.0.0 tidyr_0.2.0   ggplot2_2.2.1 dplyr_0.4.3   purrr_0.2.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10     knitr_1.15.1     magrittr_1.5     munsell_0.4     
##  [5] colorspace_1.2-2 R6_2.1.0         rlang_0.1        plyr_1.8.1      
##  [9] tools_3.4.0      parallel_3.4.0   grid_3.4.0       gtable_0.1.2    
## [13] pacman_0.4.6     DBI_0.3.1        htmltools_0.3.6  yaml_2.1.13     
## [17] lazyeval_0.2.0   rprojroot_1.2    digest_0.6.3     assertthat_0.1  
## [21] tibble_1.3.1     evaluate_0.10    rmarkdown_1.5    stringi_0.5-5   
## [25] compiler_3.4.0   scales_0.4.1     backports_1.0.5