Comparison with competing algorithms from the literature

In this study, we will compare several methods from the literature to the proposed ones on a limited set of instances. Let’s first analyze the shape of realistic instance.

Instance Generation

In “Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources” (Beaumont et al. 2016), some data are provided for 8 types of linear algebra tasks (from Cholesky and QR factorization) that we extract below. Moreover, in the XP section, there are 24 CPU cores and 4 GPU units.

costs_CPU <- c(27.78, 34.42, 31.52, 36.46, 22.08, 33.78, 17.63, 32.93)
ratio_GPU <- c(1.72, 8.72, 26.96, 28.80, 1.91, 15.90, 1.87, 14.64)
costs_type <- matrix(c(costs_CPU, costs_CPU / ratio_GPU),
                     ncol = 2, byrow = FALSE)

In “Are Static Schedules so Bad ? A Case Study on Cholesky Factorization” (Agullo et al. 2016), the Cholesky factorization 5x5 tile matrix uses 5 POTRF tasks, 10 TRSM tasks, 10 SYRK tasks and 10 GEMM tasks. But XP are done with 12x12 tile in which there are 12 POTRF tasks, 50 to 100 TRSM and SYRK and around 200 GEMM.

In “Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms” (Agullo et al. 2015), there was 12 CPU cores and 3 GPU units. The execution ratios were similar.

We observe a ratio of 300 to 400 tasks for 15 to 30 machines (with 5 CPU cores for each GPU). The execution time on a CPU is mostly homogeneous (between 25 and 35) and the speedup is between 2 and 30.

A realistic instance could be the following:

n <- 300
m <- 20
k <- m / 5
costs_CPU <- runif(n, min = 25, max = 35)
ratio_GPU <- runif(n, min = 2, max = 30)
costs <- matrix(c(costs_CPU, costs_CPU / ratio_GPU), ncol = 2, byrow = FALSE)

Comparison with Literature

Let’s start with this instance (we will vary it later on).

RR <- c(all_proposed_methods(costs, m, k),
        all_literature_methods(costs, m, k),
        all_RCmax_methods(PmPk2R(costs, m, k), genetic = FALSE))

## Loading required package: nnet

sort(map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks)))

## LPT_balance_largest LPT_balance_relsuff LPT_balance_abssuff 
##            1.000000            1.000000            1.000000 
##         DADA_custom                DADA               CLB2C 
##            1.016400            1.016400            1.018912 
##   balance_sufferage              MinMin                 EFT 
##            1.061681            1.101725            1.101725 
##      greedy_balance              DualHP          HeteroPrio 
##            1.102862            1.102862            1.123810 
##                 Al5         greedy_Suff         balance_EFT 
##            1.156739            1.188485            1.195541 
##                HEFT             EFT_SPT                HEFT 
##            1.198154            1.237431            1.259803 
##   classic_sufferage            HEFT_min                 MCT 
##            1.268144            1.278374            1.300952 
##           EFT_RATIO             EFT_LPT                 Al4 
##            1.303998            1.337192            1.426099 
##                RAND              MaxMin                 MET 
##            1.507865            1.522923            2.445892

Our proposed algorithm (greedy balance) has good performance but is beaten by CL2BC, which has a similar complexity. Let’s repeat this test a few times to see if there is a tendency.

for (i in 1:4) {
  costs_CPU <- runif(n, min = 25, max = 35)
  ratio_GPU <- runif(n, min = 2, max = 30)
  costs <- matrix(c(costs_CPU, costs_CPU / ratio_GPU), ncol = 2, byrow = FALSE)
  RR <- c(all_proposed_methods(costs, m, k),
          all_literature_methods(costs, m, k),
          all_RCmax_methods(PmPk2R(costs, m, k), genetic = FALSE))
  print(sort(map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))))
}

##               CLB2C LPT_balance_relsuff LPT_balance_largest 
##            1.000000            1.012211            1.012352 
## LPT_balance_abssuff      greedy_balance              DualHP 
##            1.012352            1.030137            1.030137 
##         DADA_custom                DADA              MinMin 
##            1.030827            1.030827            1.040576 
##                 EFT   balance_sufferage          HeteroPrio 
##            1.040576            1.054369            1.092631 
##         greedy_Suff         balance_EFT                 Al5 
##            1.142153            1.156195            1.168489 
##                HEFT                 MCT            HEFT_min 
##            1.211996            1.222186            1.265513 
##           EFT_RATIO   classic_sufferage                HEFT 
##            1.278915            1.289872            1.290996 
##             EFT_LPT                 Al4             EFT_SPT 
##            1.294736            1.329526            1.362950 
##              MaxMin                RAND                 MET 
##            1.552343            1.596176            2.281607 
##         DADA_custom               CLB2C                DADA 
##            1.000000            1.000000            1.000000 
## LPT_balance_largest LPT_balance_relsuff LPT_balance_abssuff 
##            1.001910            1.001910            1.001910 
##   balance_sufferage      greedy_balance              DualHP 
##            1.030351            1.035759            1.035759 
##              MinMin                 EFT          HeteroPrio 
##            1.049780            1.049780            1.082929 
##                 Al5         balance_EFT                 MCT 
##            1.127980            1.154270            1.192347 
##                HEFT         greedy_Suff           EFT_RATIO 
##            1.195277            1.200335            1.241416 
##            HEFT_min   classic_sufferage                 Al4 
##            1.250127            1.268132            1.280783 
##                HEFT             EFT_LPT             EFT_SPT 
##            1.287480            1.323888            1.328885 
##              MaxMin                RAND                 MET 
##            1.580707            1.610895            2.665139 
## LPT_balance_relsuff LPT_balance_largest LPT_balance_abssuff 
##            1.000000            1.001278            1.001278 
##         DADA_custom                DADA               CLB2C 
##            1.005781            1.005781            1.012939 
##         balance_EFT              MinMin                 EFT 
##            1.078193            1.081907            1.081907 
##      greedy_balance              DualHP   balance_sufferage 
##            1.084694            1.084694            1.098518 
##          HeteroPrio                 Al5         greedy_Suff 
##            1.118397            1.129244            1.155957 
##                HEFT                 MCT             EFT_SPT 
##            1.217090            1.231622            1.249898 
##            HEFT_min           EFT_RATIO                 Al4 
##            1.268463            1.292322            1.300554 
##             EFT_LPT   classic_sufferage                HEFT 
##            1.313527            1.327194            1.337512 
##              MaxMin                RAND                 MET 
##            1.497012            1.524540            2.294017 
## LPT_balance_relsuff         DADA_custom                DADA 
##            1.000000            1.000970            1.000970 
## LPT_balance_largest LPT_balance_abssuff               CLB2C 
##            1.002029            1.002029            1.012285 
##              MinMin                 EFT      greedy_balance 
##            1.086352            1.086352            1.088976 
##              DualHP   balance_sufferage         balance_EFT 
##            1.088976            1.089116            1.103053 
##          HeteroPrio                 Al5                HEFT 
##            1.112535            1.161898            1.193266 
##         greedy_Suff             EFT_SPT           EFT_RATIO 
##            1.210882            1.234799            1.245026 
##                 MCT             EFT_LPT            HEFT_min 
##            1.257792            1.258217            1.259687 
##                HEFT   classic_sufferage                 Al4 
##            1.260946            1.268762            1.373985 
##              MaxMin                RAND                 MET 
##            1.453621            1.469787            2.065068

CLB2C is often in the first six position, our algorithm is often around position 9 and HeteroPrio is often in position 12. Let’s focus on a subset of good heuristics and competing algorithms: LPT_balance, DADA_custom, CLB2C, balance_sufferage, MinMin, greedy_bal, DualHP and HeteroPrio.

ratios <- replicate(n = 10, {
  costs_CPU <- runif(n, min = 25, max = 35)
  ratio_GPU <- runif(n, min = 2, max = 30)
  costs <- matrix(c(costs_CPU, costs_CPU / ratio_GPU), ncol = 2, byrow = FALSE)
  RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
             schedule_DADA_custom(costs, m, k),
             schedule_CLB2C(costs, m, k),
             RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
             RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
             schedule_greedy_balance(costs, m, k),
             schedule_DualHP(costs, m, k),
             schedule_HeteroPrio(costs, m, k))
  names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
                 "greedy_bal", "DualHP", "HeteroPrio")
  map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})

Let’s plot the result.

data.frame(t(ratios)) %>%
  gather(key = "algo", value = "ratio") %>%
  ggplot(aes(x = algo, y = ratio)) +
  geom_boxplot()

It seems that there is three sets of heuristics: LPT_bal, DADA and CLB2C, which are very close and all very good; MinMin, greedy_bal and DualHP, which are comparable; and HeteroPrio, which is slightly worse. Balance_suff is between these last two sets.

Let’s try with smaller instances (to gain computation speed and do more simulations).

n <- 100
m <- 10
k <- m / 5
ratios <- replicate(n = 10, {
  costs_CPU <- runif(n, min = 25, max = 35)
  ratio_GPU <- runif(n, min = 2, max = 30)
  costs <- matrix(c(costs_CPU, costs_CPU / ratio_GPU), ncol = 2, byrow = FALSE)
  RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
             schedule_DADA_custom(costs, m, k),
             schedule_CLB2C(costs, m, k),
             RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
             RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
             schedule_greedy_balance(costs, m, k),
             schedule_DualHP(costs, m, k),
             schedule_HeteroPrio(costs, m, k))
  names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
                 "greedy_bal", "DualHP", "HeteroPrio")
  map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})
data.frame(t(ratios)) %>%
  gather(key = "algo", value = "ratio") %>%
  ggplot(aes(x = algo, y = ratio)) +
  geom_boxplot()

The same tendencies can be observed except that the last 4 heuristics are worse. Let’s try a more symmetrical scenario.

m <- 6
k <- 6
ratios <- replicate(n = 10, {
  costs_CPU <- runif(n, min = 25, max = 35)
  ratio_GPU <- runif(n, min = 2, max = 30)
  costs <- matrix(c(costs_CPU, costs_CPU / ratio_GPU), ncol = 2, byrow = FALSE)
  costs <- t(apply(costs, 1, sample))
  RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
             schedule_DADA_custom(costs, m, k),
             schedule_CLB2C(costs, m, k),
             RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
             RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
             schedule_greedy_balance(costs, m, k),
             schedule_DualHP(costs, m, k),
             schedule_HeteroPrio(costs, m, k))
  names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
                 "greedy_bal", "DualHP", "HeteroPrio")
  map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})
data.frame(t(ratios)) %>%
  gather(key = "algo", value = "ratio") %>%
  ggplot(aes(x = algo, y = ratio)) +
  geom_boxplot()

There is a clear difference: greedy_bal is much better, while CLB2C performs far worse. Let’s revert the symmetry and let’s see if the variance on the CPU has any noticeable impact.

m <- 10
k <- m / 5
ratios <- replicate(n = 10, {
  costs_CPU <- runif(n, min = 29, max = 31)
  ratio_GPU <- runif(n, min = 2, max = 30)
  costs <- matrix(c(costs_CPU, costs_CPU / ratio_GPU), ncol = 2, byrow = FALSE)
  RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
             schedule_DADA_custom(costs, m, k),
             schedule_CLB2C(costs, m, k),
             RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
             RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
             schedule_greedy_balance(costs, m, k),
             schedule_DualHP(costs, m, k),
             schedule_HeteroPrio(costs, m, k))
  names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
                 "greedy_bal", "DualHP", "HeteroPrio")
  map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})
data.frame(t(ratios)) %>%
  gather(key = "algo", value = "ratio") %>%
  ggplot(aes(x = algo, y = ratio)) +
  geom_boxplot()

Balance_suff performs better, but the impact is low. And with a larger variance:

ratios <- replicate(n = 10, {
  costs_CPU <- runif(n, min = 10, max = 50)
  ratio_GPU <- runif(n, min = 2, max = 30)
  costs <- matrix(c(costs_CPU, costs_CPU / ratio_GPU), ncol = 2, byrow = FALSE)
  RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
             schedule_DADA_custom(costs, m, k),
             schedule_CLB2C(costs, m, k),
             RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
             RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
             schedule_greedy_balance(costs, m, k),
             schedule_DualHP(costs, m, k),
             schedule_HeteroPrio(costs, m, k))
  names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
                 "greedy_bal", "DualHP", "HeteroPrio")
  map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})
data.frame(t(ratios)) %>%
  gather(key = "algo", value = "ratio") %>%
  ggplot(aes(x = algo, y = ratio)) +
  geom_boxplot()

Greedy_bal and DualHP seems to be favored while DADA and CLB2C are less efficient. Now let’s see what happen when the cost reference is on the GPU rather than on the CPU:

ratios <- replicate(n = 10, {
  costs_GPU <- runif(n, min = 2.5, max = 3.5)
  ratio_GPU <- runif(n, min = 2, max = 30)
  costs <- matrix(c(costs_GPU * ratio_GPU, costs_GPU), ncol = 2, byrow = FALSE)
  RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
             schedule_DADA_custom(costs, m, k),
             schedule_CLB2C(costs, m, k),
             RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
             RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
             schedule_greedy_balance(costs, m, k),
             schedule_DualHP(costs, m, k),
             schedule_HeteroPrio(costs, m, k))
  names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
                 "greedy_bal", "DualHP", "HeteroPrio")
  map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})
data.frame(t(ratios)) %>%
  gather(key = "algo", value = "ratio") %>%
  ggplot(aes(x = algo, y = ratio)) +
  geom_boxplot()

Now, let’s reduce the variance:

ratios <- replicate(n = 10, {
  costs_GPU <- runif(n, min = 2.9, max = 3.1)
  ratio_GPU <- runif(n, min = 2, max = 30)
  costs <- matrix(c(costs_GPU * ratio_GPU, costs_GPU), ncol = 2, byrow = FALSE)
  RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
             schedule_DADA_custom(costs, m, k),
             schedule_CLB2C(costs, m, k),
             RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
             RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
             schedule_greedy_balance(costs, m, k),
             schedule_DualHP(costs, m, k),
             schedule_HeteroPrio(costs, m, k))
  names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
                 "greedy_bal", "DualHP", "HeteroPrio")
  map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})
data.frame(t(ratios)) %>%
  gather(key = "algo", value = "ratio") %>%
  ggplot(aes(x = algo, y = ratio)) +
  geom_boxplot()

And increase it:

ratios <- replicate(n = 10, {
  costs_GPU <- runif(n, min = 1, max = 5)
  ratio_GPU <- runif(n, min = 2, max = 30)
  costs <- matrix(c(costs_GPU * ratio_GPU, costs_GPU), ncol = 2, byrow = FALSE)
  RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
             schedule_DADA_custom(costs, m, k),
             schedule_CLB2C(costs, m, k),
             RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
             RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
             schedule_greedy_balance(costs, m, k),
             schedule_DualHP(costs, m, k),
             schedule_HeteroPrio(costs, m, k))
  names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
                 "greedy_bal", "DualHP", "HeteroPrio")
  map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})
data.frame(t(ratios)) %>%
  gather(key = "algo", value = "ratio") %>%
  ggplot(aes(x = algo, y = ratio)) +
  geom_boxplot()

The variance of the GPU does not seem to have any sensible impact. When the GPU (the small costs) is taken as the reference, greedy_bal is better than CLB2C and HeteroPrio is quite large.

Conclusion

We characterized the shape of realistic instances in linear algebra for \((Pm,Pk)||C_{\max}\) when precedence constraints are ignored. For random instances, the best heuristics are:

very good (a few percents from the best): LPT_bal, DADA, CLB2C
good (around 10%): balance_suff, MinMin, greedy_bal, DualHP, HeteroPrio

Here are the key point when we differ from realistic instances:

symetrical instances:
- very good: LPT_bal, greedy_bal, balance_suff, DualHP
- good: DADA
high CPU cost variance:
- very good: LPT_bal, DualHP
- good: greedy_bal, DADA, CLB2C
low GPU heterogeneity:
- very good: LPT_bal, DADA, balance_suff, greedy_bal, DualHP
- good: CL2BC, MinMin

LPT_bal has always very good performance and is a good reference heuristic. greedy_bal has reasonable performance (always very good or good). CLB2C has often good performance, except on the most realistic instances where it is very good and beats greed_bal. HeteroPrio is mostly at 10% to 20% from the best solution.

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.1.0 dplyr_0.4.3   tidyr_0.2.0   purrr_0.2.0   stringr_1.0.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7      knitr_1.14       magrittr_1.5     munsell_0.4     
##  [5] colorspace_1.2-2 R6_2.1.2         plyr_1.8.1       tools_3.3.2     
##  [9] parallel_3.3.2   grid_3.3.2       gtable_0.1.2     pacman_0.4.1    
## [13] DBI_0.3.1        htmltools_0.3.5  yaml_2.1.13      assertthat_0.1  
## [17] digest_0.6.9     tibble_1.2       reshape2_1.2.2   formatR_1.4     
## [21] evaluate_0.10    rmarkdown_1.1    labeling_0.1     stringi_0.5-5   
## [25] scales_0.4.0

Comparison with competing algorithms from the literature

L.-C. Canon

January 10, 2017

Instance Generation

Comparison with Literature

Conclusion