In this study, we will compare several methods from the literature to the proposed ones on a limited set of instances. Let’s first analyze the shape of realistic instance.
In “Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources” (Beaumont et al. 2016), some data are provided for 8 types of linear algebra tasks (from Cholesky and QR factorization) that we extract below. Moreover, in the XP section, there are 24 CPU cores and 4 GPU units.
costs_CPU <- c(27.78, 34.42, 31.52, 36.46, 22.08, 33.78, 17.63, 32.93)
ratio_GPU <- c(1.72, 8.72, 26.96, 28.80, 1.91, 15.90, 1.87, 14.64)
costs_type <- matrix(c(costs_CPU, costs_CPU / ratio_GPU),
ncol = 2, byrow = FALSE)
In “Are Static Schedules so Bad ? A Case Study on Cholesky Factorization” (Agullo et al. 2016), the Cholesky factorization 5x5 tile matrix uses 5 POTRF tasks, 10 TRSM tasks, 10 SYRK tasks and 10 GEMM tasks. But XP are done with 12x12 tile in which there are 12 POTRF tasks, 50 to 100 TRSM and SYRK and around 200 GEMM.
In “Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms” (Agullo et al. 2015), there was 12 CPU cores and 3 GPU units. The execution ratios were similar.
We observe a ratio of 300 to 400 tasks for 15 to 30 machines (with 5 CPU cores for each GPU). The execution time on a CPU is mostly homogeneous (between 25 and 35) and the speedup is between 2 and 30.
A realistic instance could be the following:
n <- 300
m <- 20
k <- m / 5
costs_CPU <- runif(n, min = 25, max = 35)
ratio_GPU <- runif(n, min = 2, max = 30)
costs <- matrix(c(costs_CPU, costs_CPU / ratio_GPU), ncol = 2, byrow = FALSE)
Let’s start with this instance (we will vary it later on).
RR <- c(all_proposed_methods(costs, m, k),
all_literature_methods(costs, m, k),
all_RCmax_methods(PmPk2R(costs, m, k), genetic = FALSE))
## Loading required package: nnet
sort(map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks)))
## LPT_balance_largest LPT_balance_relsuff LPT_balance_abssuff
## 1.000000 1.000000 1.000000
## DADA_custom DADA CLB2C
## 1.016400 1.016400 1.018912
## balance_sufferage MinMin EFT
## 1.061681 1.101725 1.101725
## greedy_balance DualHP HeteroPrio
## 1.102862 1.102862 1.123810
## Al5 greedy_Suff balance_EFT
## 1.156739 1.188485 1.195541
## HEFT EFT_SPT HEFT
## 1.198154 1.237431 1.259803
## classic_sufferage HEFT_min MCT
## 1.268144 1.278374 1.300952
## EFT_RATIO EFT_LPT Al4
## 1.303998 1.337192 1.426099
## RAND MaxMin MET
## 1.507865 1.522923 2.445892
Our proposed algorithm (greedy balance) has good performance but is beaten by CL2BC, which has a similar complexity. Let’s repeat this test a few times to see if there is a tendency.
for (i in 1:4) {
costs_CPU <- runif(n, min = 25, max = 35)
ratio_GPU <- runif(n, min = 2, max = 30)
costs <- matrix(c(costs_CPU, costs_CPU / ratio_GPU), ncol = 2, byrow = FALSE)
RR <- c(all_proposed_methods(costs, m, k),
all_literature_methods(costs, m, k),
all_RCmax_methods(PmPk2R(costs, m, k), genetic = FALSE))
print(sort(map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))))
}
## CLB2C LPT_balance_relsuff LPT_balance_largest
## 1.000000 1.012211 1.012352
## LPT_balance_abssuff greedy_balance DualHP
## 1.012352 1.030137 1.030137
## DADA_custom DADA MinMin
## 1.030827 1.030827 1.040576
## EFT balance_sufferage HeteroPrio
## 1.040576 1.054369 1.092631
## greedy_Suff balance_EFT Al5
## 1.142153 1.156195 1.168489
## HEFT MCT HEFT_min
## 1.211996 1.222186 1.265513
## EFT_RATIO classic_sufferage HEFT
## 1.278915 1.289872 1.290996
## EFT_LPT Al4 EFT_SPT
## 1.294736 1.329526 1.362950
## MaxMin RAND MET
## 1.552343 1.596176 2.281607
## DADA_custom CLB2C DADA
## 1.000000 1.000000 1.000000
## LPT_balance_largest LPT_balance_relsuff LPT_balance_abssuff
## 1.001910 1.001910 1.001910
## balance_sufferage greedy_balance DualHP
## 1.030351 1.035759 1.035759
## MinMin EFT HeteroPrio
## 1.049780 1.049780 1.082929
## Al5 balance_EFT MCT
## 1.127980 1.154270 1.192347
## HEFT greedy_Suff EFT_RATIO
## 1.195277 1.200335 1.241416
## HEFT_min classic_sufferage Al4
## 1.250127 1.268132 1.280783
## HEFT EFT_LPT EFT_SPT
## 1.287480 1.323888 1.328885
## MaxMin RAND MET
## 1.580707 1.610895 2.665139
## LPT_balance_relsuff LPT_balance_largest LPT_balance_abssuff
## 1.000000 1.001278 1.001278
## DADA_custom DADA CLB2C
## 1.005781 1.005781 1.012939
## balance_EFT MinMin EFT
## 1.078193 1.081907 1.081907
## greedy_balance DualHP balance_sufferage
## 1.084694 1.084694 1.098518
## HeteroPrio Al5 greedy_Suff
## 1.118397 1.129244 1.155957
## HEFT MCT EFT_SPT
## 1.217090 1.231622 1.249898
## HEFT_min EFT_RATIO Al4
## 1.268463 1.292322 1.300554
## EFT_LPT classic_sufferage HEFT
## 1.313527 1.327194 1.337512
## MaxMin RAND MET
## 1.497012 1.524540 2.294017
## LPT_balance_relsuff DADA_custom DADA
## 1.000000 1.000970 1.000970
## LPT_balance_largest LPT_balance_abssuff CLB2C
## 1.002029 1.002029 1.012285
## MinMin EFT greedy_balance
## 1.086352 1.086352 1.088976
## DualHP balance_sufferage balance_EFT
## 1.088976 1.089116 1.103053
## HeteroPrio Al5 HEFT
## 1.112535 1.161898 1.193266
## greedy_Suff EFT_SPT EFT_RATIO
## 1.210882 1.234799 1.245026
## MCT EFT_LPT HEFT_min
## 1.257792 1.258217 1.259687
## HEFT classic_sufferage Al4
## 1.260946 1.268762 1.373985
## MaxMin RAND MET
## 1.453621 1.469787 2.065068
CLB2C is often in the first six position, our algorithm is often around position 9 and HeteroPrio is often in position 12. Let’s focus on a subset of good heuristics and competing algorithms: LPT_balance, DADA_custom, CLB2C, balance_sufferage, MinMin, greedy_bal, DualHP and HeteroPrio.
ratios <- replicate(n = 10, {
costs_CPU <- runif(n, min = 25, max = 35)
ratio_GPU <- runif(n, min = 2, max = 30)
costs <- matrix(c(costs_CPU, costs_CPU / ratio_GPU), ncol = 2, byrow = FALSE)
RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
schedule_DADA_custom(costs, m, k),
schedule_CLB2C(costs, m, k),
RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
schedule_greedy_balance(costs, m, k),
schedule_DualHP(costs, m, k),
schedule_HeteroPrio(costs, m, k))
names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
"greedy_bal", "DualHP", "HeteroPrio")
map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})
Let’s plot the result.
data.frame(t(ratios)) %>%
gather(key = "algo", value = "ratio") %>%
ggplot(aes(x = algo, y = ratio)) +
geom_boxplot()
It seems that there is three sets of heuristics: LPT_bal, DADA and CLB2C, which are very close and all very good; MinMin, greedy_bal and DualHP, which are comparable; and HeteroPrio, which is slightly worse. Balance_suff is between these last two sets.
Let’s try with smaller instances (to gain computation speed and do more simulations).
n <- 100
m <- 10
k <- m / 5
ratios <- replicate(n = 10, {
costs_CPU <- runif(n, min = 25, max = 35)
ratio_GPU <- runif(n, min = 2, max = 30)
costs <- matrix(c(costs_CPU, costs_CPU / ratio_GPU), ncol = 2, byrow = FALSE)
RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
schedule_DADA_custom(costs, m, k),
schedule_CLB2C(costs, m, k),
RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
schedule_greedy_balance(costs, m, k),
schedule_DualHP(costs, m, k),
schedule_HeteroPrio(costs, m, k))
names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
"greedy_bal", "DualHP", "HeteroPrio")
map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})
data.frame(t(ratios)) %>%
gather(key = "algo", value = "ratio") %>%
ggplot(aes(x = algo, y = ratio)) +
geom_boxplot()
The same tendencies can be observed except that the last 4 heuristics are worse. Let’s try a more symmetrical scenario.
m <- 6
k <- 6
ratios <- replicate(n = 10, {
costs_CPU <- runif(n, min = 25, max = 35)
ratio_GPU <- runif(n, min = 2, max = 30)
costs <- matrix(c(costs_CPU, costs_CPU / ratio_GPU), ncol = 2, byrow = FALSE)
costs <- t(apply(costs, 1, sample))
RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
schedule_DADA_custom(costs, m, k),
schedule_CLB2C(costs, m, k),
RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
schedule_greedy_balance(costs, m, k),
schedule_DualHP(costs, m, k),
schedule_HeteroPrio(costs, m, k))
names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
"greedy_bal", "DualHP", "HeteroPrio")
map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})
data.frame(t(ratios)) %>%
gather(key = "algo", value = "ratio") %>%
ggplot(aes(x = algo, y = ratio)) +
geom_boxplot()
There is a clear difference: greedy_bal is much better, while CLB2C performs far worse. Let’s revert the symmetry and let’s see if the variance on the CPU has any noticeable impact.
m <- 10
k <- m / 5
ratios <- replicate(n = 10, {
costs_CPU <- runif(n, min = 29, max = 31)
ratio_GPU <- runif(n, min = 2, max = 30)
costs <- matrix(c(costs_CPU, costs_CPU / ratio_GPU), ncol = 2, byrow = FALSE)
RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
schedule_DADA_custom(costs, m, k),
schedule_CLB2C(costs, m, k),
RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
schedule_greedy_balance(costs, m, k),
schedule_DualHP(costs, m, k),
schedule_HeteroPrio(costs, m, k))
names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
"greedy_bal", "DualHP", "HeteroPrio")
map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})
data.frame(t(ratios)) %>%
gather(key = "algo", value = "ratio") %>%
ggplot(aes(x = algo, y = ratio)) +
geom_boxplot()
Balance_suff performs better, but the impact is low. And with a larger variance:
ratios <- replicate(n = 10, {
costs_CPU <- runif(n, min = 10, max = 50)
ratio_GPU <- runif(n, min = 2, max = 30)
costs <- matrix(c(costs_CPU, costs_CPU / ratio_GPU), ncol = 2, byrow = FALSE)
RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
schedule_DADA_custom(costs, m, k),
schedule_CLB2C(costs, m, k),
RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
schedule_greedy_balance(costs, m, k),
schedule_DualHP(costs, m, k),
schedule_HeteroPrio(costs, m, k))
names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
"greedy_bal", "DualHP", "HeteroPrio")
map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})
data.frame(t(ratios)) %>%
gather(key = "algo", value = "ratio") %>%
ggplot(aes(x = algo, y = ratio)) +
geom_boxplot()
Greedy_bal and DualHP seems to be favored while DADA and CLB2C are less efficient. Now let’s see what happen when the cost reference is on the GPU rather than on the CPU:
ratios <- replicate(n = 10, {
costs_GPU <- runif(n, min = 2.5, max = 3.5)
ratio_GPU <- runif(n, min = 2, max = 30)
costs <- matrix(c(costs_GPU * ratio_GPU, costs_GPU), ncol = 2, byrow = FALSE)
RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
schedule_DADA_custom(costs, m, k),
schedule_CLB2C(costs, m, k),
RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
schedule_greedy_balance(costs, m, k),
schedule_DualHP(costs, m, k),
schedule_HeteroPrio(costs, m, k))
names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
"greedy_bal", "DualHP", "HeteroPrio")
map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})
data.frame(t(ratios)) %>%
gather(key = "algo", value = "ratio") %>%
ggplot(aes(x = algo, y = ratio)) +
geom_boxplot()
Now, let’s reduce the variance:
ratios <- replicate(n = 10, {
costs_GPU <- runif(n, min = 2.9, max = 3.1)
ratio_GPU <- runif(n, min = 2, max = 30)
costs <- matrix(c(costs_GPU * ratio_GPU, costs_GPU), ncol = 2, byrow = FALSE)
RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
schedule_DADA_custom(costs, m, k),
schedule_CLB2C(costs, m, k),
RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
schedule_greedy_balance(costs, m, k),
schedule_DualHP(costs, m, k),
schedule_HeteroPrio(costs, m, k))
names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
"greedy_bal", "DualHP", "HeteroPrio")
map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})
data.frame(t(ratios)) %>%
gather(key = "algo", value = "ratio") %>%
ggplot(aes(x = algo, y = ratio)) +
geom_boxplot()
And increase it:
ratios <- replicate(n = 10, {
costs_GPU <- runif(n, min = 1, max = 5)
ratio_GPU <- runif(n, min = 2, max = 30)
costs <- matrix(c(costs_GPU * ratio_GPU, costs_GPU), ncol = 2, byrow = FALSE)
RR <- list(schedule_LPT_balance(costs, m, k, strategy = "relsuff"),
schedule_DADA_custom(costs, m, k),
schedule_CLB2C(costs, m, k),
RCmax_schedule_balance_sufferage(PmPk2R(costs, m, k)),
RCmax_schedule_MinMin(PmPk2R(costs, m, k)),
schedule_greedy_balance(costs, m, k),
schedule_DualHP(costs, m, k),
schedule_HeteroPrio(costs, m, k))
names(RR) <- c("LPT_bal", "DADA", "CLB2C", "balance_suff", "MinMin",
"greedy_bal", "DualHP", "HeteroPrio")
map_dbl(RR, ~.$mks) / min(map_dbl(RR, ~.$mks))
})
data.frame(t(ratios)) %>%
gather(key = "algo", value = "ratio") %>%
ggplot(aes(x = algo, y = ratio)) +
geom_boxplot()
The variance of the GPU does not seem to have any sensible impact. When the GPU (the small costs) is taken as the reference, greedy_bal is better than CLB2C and HeteroPrio is quite large.
We characterized the shape of realistic instances in linear algebra for \((Pm,Pk)||C_{\max}\) when precedence constraints are ignored. For random instances, the best heuristics are:
Here are the key point when we differ from realistic instances:
LPT_bal has always very good performance and is a good reference heuristic. greedy_bal has reasonable performance (always very good or good). CLB2C has often good performance, except on the most realistic instances where it is very good and beats greed_bal. HeteroPrio is mostly at 10% to 20% from the best solution.
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
##
## locale:
## [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_2.1.0 dplyr_0.4.3 tidyr_0.2.0 purrr_0.2.0 stringr_1.0.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.7 knitr_1.14 magrittr_1.5 munsell_0.4
## [5] colorspace_1.2-2 R6_2.1.2 plyr_1.8.1 tools_3.3.2
## [9] parallel_3.3.2 grid_3.3.2 gtable_0.1.2 pacman_0.4.1
## [13] DBI_0.3.1 htmltools_0.3.5 yaml_2.1.13 assertthat_0.1
## [17] digest_0.6.9 tibble_1.2 reshape2_1.2.2 formatR_1.4
## [21] evaluate_0.10 rmarkdown_1.1 labeling_0.1 stringi_0.5-5
## [25] scales_0.4.0