Initial attempt to determine optimal solutions for array reductions

We consider the problem of scheduling array reductions on n machines with m segments. The first segments of all machines need to be reduced together to produce the final first reduced segment on machine 1 (the root machine for the reduction). Same for the second segments, the third, … (each segment can be reduced concurrently). The communication use the Hockney model: one communication cost includes the latency alpha and the time to transfer the segment depending on the bandwidth (whose inverse is beta). We also consider the time to reduce two segments gamma. The communication and computation may overlap.

Schedule evaluation

Let's start by representing a schedule as a matrix with machines on rows and segment on columns. For each segment on each processor, there is one destination:

the destination cannot be the same processor: s[i,] != i
segments on processor 1 stays on it (this is the root machine): s[1,] == 1

evaluate <- function(solution, alpha, beta, gamma) {
    n <- nrow(solution)
    m <- ncol(solution)

    # Arrays to keep the next time a processor is ready for reduction, for
    # sending and for receiving
    reducing <- rep(0, n)
    sending <- rep(0, n)
    receiving <- rep(0, n)
    ready <- matrix(0, nrow = n, ncol = m)

    # Start with segment 1 (first column) and find the processors that do not
    # receive any segment (and are thus ready to send their own)
    for (i in 1:m) {
        for (k in 1:(n - 1)) {
            # Select all the ready processors
            leaf <- !sapply(1:n, function(x) return(x %in% solution[, i]))
            if (!any(leaf)) 
                return(Inf)
            # Keep the one which can start the first
            from <- which(leaf)[which.min(pmax(ready[leaf, i], sending[leaf], 
                receiving[solution[leaf, i]] - alpha))]
            to <- solution[from, i]
            starting <- max(ready[from, i], sending[from], receiving[to] - alpha)
            # Performs the transfer and reduction
            sending[from] <- starting + beta
            receiving[to] <- starting + alpha + beta
            reducing[to] <- max(reducing[to], receiving[to]) + gamma
            ready[to, i] <- reducing[to]
            solution[from, i] <- from
        }
    }

    return(max(ready[1, ]))
}

This evaluation method assumes that a machine is busy only when actually sending or receiving the data, not during the latency period. The latency just corresponds to a shift of the communication (data start to be received after they have started to be sent).

Another evaluation method could consider backfilling, that is considering the leaves of all segments at each step instead of only the leaves of the current step.

Lastly, it is not guaranteed that it is optimal to consider the leaf which transfer will start the soonest.

Validation

Let's try some test:

n <- 3
m <- 2

solution1 <- matrix(c(1, 1, 1, 3, 1, 1), nrow = n, ncol = m, byrow = TRUE)
solution2 <- matrix(c(1, 1, 1, 1, 1, 1), nrow = n, ncol = m, byrow = TRUE)

c(evaluate(solution1, 0.1, 1, 0.3), 2 * 0.1 + 3 * 1 + 2 * 0.3)

## [1] 3.8 3.8

c(evaluate(solution1, 0.1, 1, 1.3), 2 * 0.1 + 3 * 1 + 2 * 1.3)

## [1] 5.8 5.8

c(evaluate(solution2, 1.1, 1, 0.3), 1.1 + 4 * max(1, 0.3) + min(1, 0.3))

## [1] 5.4 5.4

c(evaluate(solution2, 1.1, 1, 1.3), 1.1 + 4 * max(1, 1.3) + min(1, 1.3))

## [1] 7.3 7.3

Bruteforce search

Before testing all possible solutions, we can limit the results to the ones that are really distinct. For that, we test if two solutions are identical given a permutation of the machines:

find_permutation <- function(sol1, sol2, permutation = NULL) {
    n <- nrow(sol1)
    if (length(permutation) == n) 
        return(test_permutation(sol1, sol2, permutation))

    permutation <- c(permutation, NA)
    for (i in 1:n) {
        if (i %in% permutation) 
            next
        permutation[length(permutation)] <- i
        if (find_permutation(sol1, sol2, permutation)) 
            return(TRUE)
    }
    return(FALSE)
}

test_permutation <- function(sol1, sol2, permutation) {
    n <- nrow(sol1)
    m <- ncol(sol1)
    for (i in 2:n) for (j in 1:m) if (permutation[sol1[i, j]] != sol2[permutation[i], 
        j]) 
        return(FALSE)
    return(TRUE)
}

Now, let's try each possible combination:

find_min <- function(n, m, alpha, beta, gamma, solution = NULL, i = 2, j = 1) {
    if (is.null(solution)) 
        solution <- matrix(1, nrow = n, ncol = m)

    if (i == n + 1) 
        return(list(mks = evaluate(solution, alpha, beta, gamma), nb = 1, valid = 1, 
            min = 1, sols = list(solution)))

    result <- list(mks = Inf, nb = 0, valid = 0, min = 0, sols = list())
    for (k in 1:n) if (i != k) {
        solution[i, j] = k
        value <- find_min(n, m, alpha, beta, gamma, solution, i + (j == m), 
            j%%m + 1)
        result$nb <- result$nb + value$nb
        if (value$mks == Inf) 
            next
        result$valid <- result$valid + value$valid
        if (abs(value$mks - result$mks) < 1e-15) {
            result$min <- result$min + value$min
            for (sol1 in value$sols) {
                present <- FALSE
                for (sol2 in result$sols) if (find_permutation(sol1, sol2)) 
                  present <- TRUE
                if (!present) 
                  result$sols <- c(result$sols, list(sol1))
            }
        } else if (value$mks < result$mks) {
            result$mks <- value$mks
            result$min <- value$min
            result$sols <- value$sols
        }
    }
    return(result)
}

This method returns the minimum makespan, the number of tested solutions, the number of valid solutions, the number of optimal solutions and a list of distinct optimal solutions.

Its performance could be improved by partially evaluating the partial solution and continuing to the next if its value is infinite or simply greater than the currently found minimum makespan.

Validation

Let's check we got the same solution as before:

find_min(n, m, 0.1, 1, 0.3)

## $mks
## [1] 3.8
## 
## $nb
## [1] 16
## 
## $valid
## [1] 9
## 
## $min
## [1] 3
## 
## $sols
## $sols[[1]]
##      [,1] [,2]
## [1,]    1    1
## [2,]    1    1
## [3,]    2    2
## 
## $sols[[2]]
##      [,1] [,2]
## [1,]    1    1
## [2,]    1    3
## [3,]    1    1

find_min(n, m, 0.1, 1, 1.3)

## $mks
## [1] 5.8
## 
## $nb
## [1] 16
## 
## $valid
## [1] 9
## 
## $min
## [1] 1
## 
## $sols
## $sols[[1]]
##      [,1] [,2]
## [1,]    1    1
## [2,]    1    3
## [3,]    1    1

find_min(n, m, 1.1, 1, 0.3)

## $mks
## [1] 5.4
## 
## $nb
## [1] 16
## 
## $valid
## [1] 9
## 
## $min
## [1] 1
## 
## $sols
## $sols[[1]]
##      [,1] [,2]
## [1,]    1    1
## [2,]    1    1
## [3,]    1    1

find_min(n, m, 1.1, 1, 1.3)

## $mks
## [1] 7.3
## 
## $nb
## [1] 16
## 
## $valid
## [1] 9
## 
## $min
## [1] 1
## 
## $sols
## $sols[[1]]
##      [,1] [,2]
## [1,]    1    1
## [2,]    1    1
## [3,]    1    1

Performance analysis

Let's have a quick estimation of the time taken for each test case given n and m in minutes.

n <- 8
m <- 8
duration <- system.time(find_min(4, 3, 1, 1, 1))
timepersolution <- duration[[1]]/(4 - 1)^(3 * (4 - 1))/60
sapply(1:m, function(j) return(sapply(1:n, function(i) return(suppressWarnings(as.integer(timepersolution * 
    (i - 1)^(j * (i - 1))))))))

##      [,1]    [,2]       [,3]       [,4]     [,5] [,6]   [,7]    [,8]
## [1,]    0       0          0          0        0    0      0       0
## [2,]    0       0          0          0        0    0      0       0
## [3,]    0       0          0          0        0    0      0       0
## [4,]    0       0          0          5      158 4284 115676 3123278
## [5,]    0       0        185      47496 12159072   NA     NA      NA
## [6,]    0     107     337482 1054631456       NA   NA     NA      NA
## [7,]    0   24072 1123112189         NA       NA   NA     NA      NA
## [8,]    9 7500206         NA         NA       NA   NA     NA      NA

We can go up to 3 machines with up to 8 segments. For 4 machines, 4 segments at most. For 5 machines, 2 segments. For up to 8 machines, only one segment.

With more time (on a cluster for instance), we can test 4 machines and up to 6 segments, 5 machines and 3 segments, 6 machines and 2 segments.

Finally, the interesting cases are as follows:

no latency, unitary bandwidth and no computation
no latency, unitary bandwidth and computation (with overlapping)
unitary latency, infinite bandwidth and no computation
unitary latency, infinite bandwidth and unitary computation

Maybe having both unitary latency and bandwidth would be interesting also.

Visualization

Let's visualize each schedule. Three lines per machine, the first for the reduction activity, the second for the reception, the last for the emission. The first three lines are for the root machine.

plot_schedule <- function(solution, alpha, beta, gamma) {
    n <- nrow(solution)
    m <- ncol(solution)
    makespan <- evaluate(solution, alpha, beta, gamma)
    plot(c(0, makespan), c(0, -n), type = "n")

    reducing <- rep(0, n)
    sending <- rep(0, n)
    receiving <- rep(0, n)
    ready <- matrix(0, nrow = n, ncol = m)

    # Start with segment 1 (first column) and find the processors that do not
    # receive any segment (and are thus ready to send their own)
    for (i in 1:m) {
        for (k in 1:(n - 1)) {
            # Select all the ready processors
            leaf <- !sapply(1:n, function(x) return(x %in% solution[, i]))
            if (!any(leaf)) 
                return(Inf)
            # Keep the one which can start the first
            from <- which(leaf)[which.min(pmax(ready[leaf, i], sending[leaf], 
                receiving[solution[leaf, i]] - alpha))]
            to <- solution[from, i]
            starting <- max(ready[from, i], sending[from], receiving[to] - alpha)
            # Performs the transfer and reduction
            sending[from] <- starting + beta
            receiving[to] <- starting + alpha + beta
            reducing[to] <- max(reducing[to], receiving[to]) + gamma
            ready[to, i] <- reducing[to]
            solution[from, i] <- from
            # Plot transfer
            rect(sending[from] - beta, -(from - 1 + 2/3), sending[from], -(from - 
                1 + 3/3), col = colors()[k + i * n])
            text(sending[from] - beta/2, -(from - 1 + 5/6), i)
            rect(receiving[to] - beta, -(to - 1 + 1/3), receiving[to], -(to - 
                1 + 2/3), col = colors()[k + i * n])
            text(receiving[to] - beta/2, -(to - 1 + 3/6), i)
            rect(reducing[to] - gamma, -(to - 1 + 0/3), reducing[to], -(to - 
                1 + 1/3), col = colors()[k + i * n])
            text(reducing[to] - gamma/2, -(to - 1 + 1/6), i)
        }
    }
}

Validation

plot_schedule(solution1, 0.1, 1, 0.3)

plot of chunk unnamed-chunk-8

Simulation

Now, let's study more substantial schedules. Let's start with 4 machines and 4 segments. The code is similar to the evaluation code.

solutions <- find_min(4, 4, 1, 1, 1)
for (i in 1:length(solutions$sols)) plot_schedule(solutions$sols[[i]], 1, 1, 
    1)

plot of chunk unnamed-chunk-9

This last schedule clearly shows that a lower makespan could be reached by starting to send a part of segment 4 from machine 3 to machine 2 before segment 1 is sent on machine 3. This requires to revise the code. There are two solutions: either by making the evaluation code choosing the next transfer among all segments (however, a greedy may not be optimal), or by changing the representation of the schedule to contain the information about the order of the transfers (each machines would have a succession of pairs segment/destination). This last solution may be better as it costs the same development time and will ensure that the optimal solution is found. However, the complexity will considerably increase and optimization may be required in order to have interesting results. Here are the new estimations:

sapply(1:m, function(j) return(sapply(1:n, function(i) return(suppressWarnings(as.integer(timepersolution * 
    factorial(j)^(i - 1) * (i - 1)^(j * (i - 1))))))))

##      [,1]      [,2]   [,3]  [,4]      [,5]  [,6]    [,7]       [,8]
## [1,]    0         0      0     0         0     0       0          0
## [2,]    0         0      0     0         0     0       0          0
## [3,]    0         0      0     1       163 23481 4602371 1178207035
## [4,]    0         0     47 81243 274197312    NA      NA         NA
## [5,]    0        11 240450    NA        NA    NA      NA         NA
## [6,]    0      3455     NA    NA        NA    NA      NA         NA
## [7,]    0   1540620     NA    NA        NA    NA      NA         NA
## [8,]    9 960026373     NA    NA        NA    NA      NA         NA

The limit would then be 3 machines with 4 segments, 4 machines with 3 segments and the same for the others.