We consider the problem of scheduling array reductions on n machines with m segments. The first segments of all machines need to be reduced together to produce the final first reduced segment on machine 1 (the root machine for the reduction). Same for the second segments, the third, … (each segment can be reduced concurrently). The communication use the Hockney model: one communication cost includes the latency alpha and the time to transfer the segment depending on the bandwidth (whose inverse is beta). We also consider the time to reduce two segments gamma. The communication and computation may overlap.
Let's start by representing a schedule as a matrix with machines on rows and segment on columns. For each segment on each processor, there is one destination:
evaluate <- function(solution, alpha, beta, gamma) {
n <- nrow(solution)
m <- ncol(solution)
# Arrays to keep the next time a processor is ready for reduction, for
# sending and for receiving
reducing <- rep(0, n)
sending <- rep(0, n)
receiving <- rep(0, n)
ready <- matrix(0, nrow = n, ncol = m)
# Start with segment 1 (first column) and find the processors that do not
# receive any segment (and are thus ready to send their own)
for (i in 1:m) {
for (k in 1:(n - 1)) {
# Select all the ready processors
leaf <- !sapply(1:n, function(x) return(x %in% solution[, i]))
if (!any(leaf))
return(Inf)
# Keep the one which can start the first
from <- which(leaf)[which.min(pmax(ready[leaf, i], sending[leaf],
receiving[solution[leaf, i]] - alpha))]
to <- solution[from, i]
starting <- max(ready[from, i], sending[from], receiving[to] - alpha)
# Performs the transfer and reduction
sending[from] <- starting + beta
receiving[to] <- starting + alpha + beta
reducing[to] <- max(reducing[to], receiving[to]) + gamma
ready[to, i] <- reducing[to]
solution[from, i] <- from
}
}
return(max(ready[1, ]))
}
This evaluation method assumes that a machine is busy only when actually sending or receiving the data, not during the latency period. The latency just corresponds to a shift of the communication (data start to be received after they have started to be sent).
Another evaluation method could consider backfilling, that is considering the leaves of all segments at each step instead of only the leaves of the current step.
Lastly, it is not guaranteed that it is optimal to consider the leaf which transfer will start the soonest.
Let's try some test:
n <- 3
m <- 2
solution1 <- matrix(c(1, 1, 1, 3, 1, 1), nrow = n, ncol = m, byrow = TRUE)
solution2 <- matrix(c(1, 1, 1, 1, 1, 1), nrow = n, ncol = m, byrow = TRUE)
c(evaluate(solution1, 0.1, 1, 0.3), 2 * 0.1 + 3 * 1 + 2 * 0.3)
## [1] 3.8 3.8
c(evaluate(solution1, 0.1, 1, 1.3), 2 * 0.1 + 3 * 1 + 2 * 1.3)
## [1] 5.8 5.8
c(evaluate(solution2, 1.1, 1, 0.3), 1.1 + 4 * max(1, 0.3) + min(1, 0.3))
## [1] 5.4 5.4
c(evaluate(solution2, 1.1, 1, 1.3), 1.1 + 4 * max(1, 1.3) + min(1, 1.3))
## [1] 7.3 7.3
Before testing all possible solutions, we can limit the results to the ones that are really distinct. For that, we test if two solutions are identical given a permutation of the machines:
find_permutation <- function(sol1, sol2, permutation = NULL) {
n <- nrow(sol1)
if (length(permutation) == n)
return(test_permutation(sol1, sol2, permutation))
permutation <- c(permutation, NA)
for (i in 1:n) {
if (i %in% permutation)
next
permutation[length(permutation)] <- i
if (find_permutation(sol1, sol2, permutation))
return(TRUE)
}
return(FALSE)
}
test_permutation <- function(sol1, sol2, permutation) {
n <- nrow(sol1)
m <- ncol(sol1)
for (i in 2:n) for (j in 1:m) if (permutation[sol1[i, j]] != sol2[permutation[i],
j])
return(FALSE)
return(TRUE)
}
Now, let's try each possible combination:
find_min <- function(n, m, alpha, beta, gamma, solution = NULL, i = 2, j = 1) {
if (is.null(solution))
solution <- matrix(1, nrow = n, ncol = m)
if (i == n + 1)
return(list(mks = evaluate(solution, alpha, beta, gamma), nb = 1, valid = 1,
min = 1, sols = list(solution)))
result <- list(mks = Inf, nb = 0, valid = 0, min = 0, sols = list())
for (k in 1:n) if (i != k) {
solution[i, j] = k
value <- find_min(n, m, alpha, beta, gamma, solution, i + (j == m),
j%%m + 1)
result$nb <- result$nb + value$nb
if (value$mks == Inf)
next
result$valid <- result$valid + value$valid
if (abs(value$mks - result$mks) < 1e-15) {
result$min <- result$min + value$min
for (sol1 in value$sols) {
present <- FALSE
for (sol2 in result$sols) if (find_permutation(sol1, sol2))
present <- TRUE
if (!present)
result$sols <- c(result$sols, list(sol1))
}
} else if (value$mks < result$mks) {
result$mks <- value$mks
result$min <- value$min
result$sols <- value$sols
}
}
return(result)
}
This method returns the minimum makespan, the number of tested solutions, the number of valid solutions, the number of optimal solutions and a list of distinct optimal solutions.
Its performance could be improved by partially evaluating the partial solution and continuing to the next if its value is infinite or simply greater than the currently found minimum makespan.
Let's check we got the same solution as before:
find_min(n, m, 0.1, 1, 0.3)
## $mks
## [1] 3.8
##
## $nb
## [1] 16
##
## $valid
## [1] 9
##
## $min
## [1] 3
##
## $sols
## $sols[[1]]
## [,1] [,2]
## [1,] 1 1
## [2,] 1 1
## [3,] 2 2
##
## $sols[[2]]
## [,1] [,2]
## [1,] 1 1
## [2,] 1 3
## [3,] 1 1
find_min(n, m, 0.1, 1, 1.3)
## $mks
## [1] 5.8
##
## $nb
## [1] 16
##
## $valid
## [1] 9
##
## $min
## [1] 1
##
## $sols
## $sols[[1]]
## [,1] [,2]
## [1,] 1 1
## [2,] 1 3
## [3,] 1 1
find_min(n, m, 1.1, 1, 0.3)
## $mks
## [1] 5.4
##
## $nb
## [1] 16
##
## $valid
## [1] 9
##
## $min
## [1] 1
##
## $sols
## $sols[[1]]
## [,1] [,2]
## [1,] 1 1
## [2,] 1 1
## [3,] 1 1
find_min(n, m, 1.1, 1, 1.3)
## $mks
## [1] 7.3
##
## $nb
## [1] 16
##
## $valid
## [1] 9
##
## $min
## [1] 1
##
## $sols
## $sols[[1]]
## [,1] [,2]
## [1,] 1 1
## [2,] 1 1
## [3,] 1 1
Let's have a quick estimation of the time taken for each test case given n and m in minutes.
n <- 8
m <- 8
duration <- system.time(find_min(4, 3, 1, 1, 1))
timepersolution <- duration[[1]]/(4 - 1)^(3 * (4 - 1))/60
sapply(1:m, function(j) return(sapply(1:n, function(i) return(suppressWarnings(as.integer(timepersolution *
(i - 1)^(j * (i - 1))))))))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0
## [4,] 0 0 0 5 158 4284 115676 3123278
## [5,] 0 0 185 47496 12159072 NA NA NA
## [6,] 0 107 337482 1054631456 NA NA NA NA
## [7,] 0 24072 1123112189 NA NA NA NA NA
## [8,] 9 7500206 NA NA NA NA NA NA
We can go up to 3 machines with up to 8 segments. For 4 machines, 4 segments at most. For 5 machines, 2 segments. For up to 8 machines, only one segment.
With more time (on a cluster for instance), we can test 4 machines and up to 6 segments, 5 machines and 3 segments, 6 machines and 2 segments.
Finally, the interesting cases are as follows:
Maybe having both unitary latency and bandwidth would be interesting also.
Let's visualize each schedule. Three lines per machine, the first for the reduction activity, the second for the reception, the last for the emission. The first three lines are for the root machine.
plot_schedule <- function(solution, alpha, beta, gamma) {
n <- nrow(solution)
m <- ncol(solution)
makespan <- evaluate(solution, alpha, beta, gamma)
plot(c(0, makespan), c(0, -n), type = "n")
reducing <- rep(0, n)
sending <- rep(0, n)
receiving <- rep(0, n)
ready <- matrix(0, nrow = n, ncol = m)
# Start with segment 1 (first column) and find the processors that do not
# receive any segment (and are thus ready to send their own)
for (i in 1:m) {
for (k in 1:(n - 1)) {
# Select all the ready processors
leaf <- !sapply(1:n, function(x) return(x %in% solution[, i]))
if (!any(leaf))
return(Inf)
# Keep the one which can start the first
from <- which(leaf)[which.min(pmax(ready[leaf, i], sending[leaf],
receiving[solution[leaf, i]] - alpha))]
to <- solution[from, i]
starting <- max(ready[from, i], sending[from], receiving[to] - alpha)
# Performs the transfer and reduction
sending[from] <- starting + beta
receiving[to] <- starting + alpha + beta
reducing[to] <- max(reducing[to], receiving[to]) + gamma
ready[to, i] <- reducing[to]
solution[from, i] <- from
# Plot transfer
rect(sending[from] - beta, -(from - 1 + 2/3), sending[from], -(from -
1 + 3/3), col = colors()[k + i * n])
text(sending[from] - beta/2, -(from - 1 + 5/6), i)
rect(receiving[to] - beta, -(to - 1 + 1/3), receiving[to], -(to -
1 + 2/3), col = colors()[k + i * n])
text(receiving[to] - beta/2, -(to - 1 + 3/6), i)
rect(reducing[to] - gamma, -(to - 1 + 0/3), reducing[to], -(to -
1 + 1/3), col = colors()[k + i * n])
text(reducing[to] - gamma/2, -(to - 1 + 1/6), i)
}
}
}
plot_schedule(solution1, 0.1, 1, 0.3)
Now, let's study more substantial schedules. Let's start with 4 machines and 4 segments. The code is similar to the evaluation code.
solutions <- find_min(4, 4, 1, 1, 1)
for (i in 1:length(solutions$sols)) plot_schedule(solutions$sols[[i]], 1, 1,
1)
This last schedule clearly shows that a lower makespan could be reached by starting to send a part of segment 4 from machine 3 to machine 2 before segment 1 is sent on machine 3. This requires to revise the code. There are two solutions: either by making the evaluation code choosing the next transfer among all segments (however, a greedy may not be optimal), or by changing the representation of the schedule to contain the information about the order of the transfers (each machines would have a succession of pairs segment/destination). This last solution may be better as it costs the same development time and will ensure that the optimal solution is found. However, the complexity will considerably increase and optimization may be required in order to have interesting results. Here are the new estimations:
sapply(1:m, function(j) return(sapply(1:n, function(i) return(suppressWarnings(as.integer(timepersolution *
factorial(j)^(i - 1) * (i - 1)^(j * (i - 1))))))))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0
## [3,] 0 0 0 1 163 23481 4602371 1178207035
## [4,] 0 0 47 81243 274197312 NA NA NA
## [5,] 0 11 240450 NA NA NA NA NA
## [6,] 0 3455 NA NA NA NA NA NA
## [7,] 0 1540620 NA NA NA NA NA NA
## [8,] 9 960026373 NA NA NA NA NA NA
The limit would then be 3 machines with 4 segments, 4 machines with 3 segments and the same for the others.