Open MPI algorithm selection study

The previous studies suggested that the algorithm chosen for some message sizes (between 10 kB and 300 kB) is sub-optimal. Let’s investigate by testing other algorithms and studying their parameters. We will first test each algorithm in turn and set their parameters to optimize the median case. Then, we plan to compare all algorithms together. Finally, it may be interesting to compare the default performance with the one we get by setting a few parameters.

MCA parameters of interest

The parameters of interest are:

coll_tuned_use_dynamic_rules to activate dynamic decision (put to 1 in this study)
coll_tuned_reduce_algorithm 1:“linear”, 2:“chain”, 3:“pipeline”, 4:“binary”, 5:“binomial”, 6:“in-order_binary”
coll_tuned_reduce_algorithm_segmentsize for the segment size for all algorithms except linear (it goes from 1 kiB to 64 kiB with independent experiments suggesting that 2 kiB to 8 kiB may be the most relevant)
coll_tuned_reduce_algorithm_max_requests for the maximum number of outstanding send requests for all algorithms except linear (no limit by default)
coll_tuned_reduce_algorithm_chain_fanout for chain (4 by default)
coll_tuned_dynamic_rules_filename for specifying new dynamic rules (described in “Flexible collective communication tuning architecture applied to open mpi” supposedly published at EuroMPI 2006 even though it does not appear in the proceedings)

Study preparation

We use a similar approach as the last study except for a few changes:

sizes lower than 100 B are discarded
the new MPI Benchmark (0.9.4) fixes the swapping problem
we expect simpler/faster processing with the --params option

Let’s set up the environment (we keep 30 repetitions for now because it is a coarse analysis):

# Check nobody is on the cluster
for i in $(seq 0 35); do echo -n $i; ssh jupiter$i w; done | grep -v USER

# Set Open MPI version
VERSION=openmpi-1.10.2
cd ~ && rm bin && ln -s ompi/${VERSION}/bin bin
echo Forcing NFS synchro
for i in $(seq 0 35); do ssh jupiter$i ls -l; done > /dev/null

# Directory containing final results
RESULT_DIR=$PWD/algorithms
mkdir -p ${RESULT_DIR}

# Nodes to use for XP
> ${RESULT_DIR}/hostfile
for i in $(seq 3 18) $(seq 20 35)
do
    echo jupiter$i >> ${RESULT_DIR}/hostfile
done

# Clean environment
mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean

# General variables
TIMEOUT=100
REPETITION=30
REPEAT=30
SIZES=100,300,1000,3000,10000,30000,100000,300000,1000000,3000000,10000000

And now, the adapted variation script (additional information will be available in the result file and the results are structured by algorithm):

function variate_parameter {
    SUBDIR=$1
    PARAMETER=$2
    VALUES=$3
    SETTINGS=${@:4}
    mkdir -p ${RESULT_DIR}/${SUBDIR}/${PARAMETER}
    ompi_info -c > ${RESULT_DIR}/${SUBDIR}/${PARAMETER}/info
    mpirun --version > ${RESULT_DIR}/${SUBDIR}/${PARAMETER}/version
    for i in $(seq 1 ${REPEAT})
    do
        echo Iteration ${i} on ${REPEAT} with ${REPETITION} measures per size
        for VALUE in $(shuf -e -- ${VALUES})
        do
            echo -n " Launch benchmark"
            echo " for parameter ${PARAMETER} with value ${VALUE}"
            timeout ${TIMEOUT} mpirun --mca ${PARAMETER} ${VALUE} ${SETTINGS} \
                -n 512 --npernode 16 --hostfile ${RESULT_DIR}/hostfile \
                mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
                --params=parameter:${PARAMETER},value:${VALUE},iteration:${i} \
                --msizes-list=${SIZES} -r ${REPETITION} --shuffle-jobs 2>&1 > \
                    ${RESULT_DIR}/${SUBDIR}/${PARAMETER}/result_${VALUE}_${i}
        done
    done
}

Retrieving data will therefore only consists in the following (no processing):

function get_results {
    RESULT_DIR=results
    mkdir -p ${RESULT_DIR}
    echo Retreiving data
    rsync --recursive jupiter:algorithms/* ${RESULT_DIR}/
}

Algorithm tuning

We will now consider each algorithm in turn and set their parameters to coarsely optimize the median case (only improvement greater than 10-20% are considered).

Chain

The linear algorithm does not need to be tuned, so let’s start with chain. We keep the default value for btl_openib_eager_limit because this is a coarse tuning even though this may improve slightly the performance.

We start by studying the effect of the segment size. The tested range is quite large to have an overview of the magnitude of the effect.

Due to some mistake in the arguments, the size order is not randomized in this test. We keep it nonetheless as we expect it has no effect.

variate_parameter chain coll_tuned_reduce_algorithm_segmentsize \
        "1024 4096 16384 65536" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca coll_tuned_use_dynamic_rules 1" \
        "--mca coll_tuned_reduce_algorithm 2"
get_results # on local machine

Retrieving the data in R will be more complex since no processing is done:

library(stringr)
read.table.parameter <- function(subdir, parameter) {
  dirname <- paste("results", subdir, parameter, sep = "/")
  files <- list.files(dirname, pattern = "result_*")
  read.table.file <- function(filename) {
    con <- file(filename, open = "r")
    info <- map(readLines(con), ~ str_match(., "#@(.*)=(.*)")[2:3]) %>%
      discard(~ any(is.na(.)))
    close(con)
    data <- read.table(filename, header = TRUE)
    for (i in info)
      data[i[1]] <- type.convert(i[2])
    data
  }
  map_df(paste(dirname, files, sep = "/"), read.table.file)
}

SEGMENT <- "coll_tuned_reduce_algorithm_segmentsize"
chain_segment <- read.table.parameter("chain", SEGMENT)

Let’s directly plot the medians. We first load some reference performance from the previous study for comparison, which uses the default algorithm:

default_raw <- read.table("results/default/summary.txt", sep = ",")
names(default_raw) <- c("iteration", "size", "time")

ggplot.median <- function(data) {
  default <- data[1:nrow(default_raw),]
  default$value <- Inf
  default$iteration <- default_raw$iteration
  default$msize <- default_raw$size
  default$runtime_sec <- default_raw$time
  data <- rbind(data, default)
  group_by(data, iteration, value, msize) %>%
    summarise(median = median(runtime_sec)) %>%
    ggplot(aes(x = factor(value), y = median)) +
    facet_wrap(~ msize, ncol = 4, scales = "free_y") +
    geom_boxplot() +
    scale_y_log10() +
    annotation_logticks(sides = "l")
}

ggplot.median(chain_segment)

We see that the best segment size depends on the message size. For small messages (until 30 kB), 1024 is the best, but for larger messages, it is preferable to used a larger segment size (4 kiB or even 16 kiB). Very large segment sizes are not beneficial (but could for message sizes larger than 10 MB). Given those results, we might try the following values next time: 512, 2048, 8192, 16384.

As a trade-off, let’s use segments of size 4 kiB in the following when varying the maximum number of outstanding requests.

variate_parameter chain coll_tuned_reduce_algorithm_max_requests \
        "0 1 3 10 30" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca coll_tuned_use_dynamic_rules 1" \
        "--mca coll_tuned_reduce_algorithm 2" \
        "--mca coll_tuned_reduce_algorithm_segmentsize 4096"
get_results # on local machine

Let’s plot the medians:

OUT <- "coll_tuned_reduce_algorithm_max_requests"
chain_out <- read.table.parameter("chain", OUT)
ggplot.median(chain_out)

It seems to have very low effect and only for the default setting. Given this result, let’s set it to 10 in all following test as we do not expect it to be that important.

Now the last parameter coll_tuned_reduce_algorithm_chain_fanout, which is used only by this algorithm.

variate_parameter chain coll_tuned_reduce_algorithm_chain_fanout \
        "2 4 6" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca coll_tuned_use_dynamic_rules 1" \
        "--mca coll_tuned_reduce_algorithm 2" \
        "--mca coll_tuned_reduce_algorithm_segmentsize 4096" \
        "--mca coll_tuned_reduce_algorithm_max_requests 10"
get_results # on local machine

Let’s plot the medians:

FANOUT <- "coll_tuned_reduce_algorithm_chain_fanout"
chain_fanout <- read.table.parameter("chain", FANOUT)
ggplot.median(chain_fanout)

So this is a compromise again: small fanout for large messages (like a pipeline) and large fanout for small messages (like a flat-tree).

This also shows that the performance of chain is far worse than the default algorithm. This is consistent with the fact that this algorithm is never used by default. The performance are quite stable for small messages, but this is useless because it is too much sub-optimal. For message of size 300 kB, however, the performance are a bit better than the default algorithm (less than a factor two). This confirms that default performance are sub-optimal with this size.

Pipeline

Let’s consider a more promising algorithm (it is used by default for large messages):

variate_parameter pipeline coll_tuned_reduce_algorithm_segmentsize \
        "512 2048 8192 16384" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca coll_tuned_use_dynamic_rules 1" \
        "--mca coll_tuned_reduce_algorithm 3" \
        "--mca coll_tuned_reduce_algorithm_max_requests 10"
get_results # on local machine

Let’s plot the medians:

pipeline_segment <- read.table.parameter("pipeline", SEGMENT)
ggplot.median(pipeline_segment)

There seems to be no better choice than a low segment size except for large messages. For small messages, the performance is 10 times higher than the default one. We will continue to decrease the values for the segments in the next algorithm. We also see that the performance for 300 kB are two to three times better than with the default algorithm with a segment of size 512.

Binary

Let’s try the binary algorithm that is used by default when the number of processes is large. It is used exclusively with a segment size of 32 kiB. We will see if this large size is actually advantageous.

variate_parameter binary coll_tuned_reduce_algorithm_segmentsize \
        "512 2048 8192 32768" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca coll_tuned_use_dynamic_rules 1" \
        "--mca coll_tuned_reduce_algorithm 4" \
        "--mca coll_tuned_reduce_algorithm_max_requests 10"
get_results # on local machine

Let’s plot the medians:

binary_segment <- read.table.parameter("binary", SEGMENT)
ggplot.median(binary_segment)

This time, larger segment sizes are indeed better. Note that this algorithm is rarely the best (only for 100 kB and 300 kB, when the performance of the default algorithm are sub-optimal).

Binomial

The binomial algorithm is used without fragmentation for small messages or with fragment of size 1 kiB for large messages.

variate_parameter binomial coll_tuned_reduce_algorithm_segmentsize \
        "512 2048 8192 32768" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca coll_tuned_use_dynamic_rules 1" \
        "--mca coll_tuned_reduce_algorithm 5" \
        "--mca coll_tuned_reduce_algorithm_max_requests 10"
get_results # on local machine

Let’s plot the medians:

binomial_segment <- read.table.parameter("binomial", SEGMENT)
ggplot.median(binomial_segment)

Large segment sizes seem to be always preferable except maybe for messages sizes 10 kB to 100 kB. The binomial provides the best performance for sizes <= 1 kB. It seems also to be used for sizes <= 300 kB, which is not optimal because chain, binary and pipeline outperform binomial for sizes >= 100 kB.

In-order binary

The last algorithm is the in-order binary tree. It is not expected to perform very well because it is used by default only when the operation is not commutative. But, its ordering may happen to be quite adapted to the hierarchy structure (what is called a “happy accident” by Zhu et al., 2009).

variate_parameter inorderbinary coll_tuned_reduce_algorithm_segmentsize \
        "512 2048 8192 32768" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca coll_tuned_use_dynamic_rules 1" \
        "--mca coll_tuned_reduce_algorithm 6" \
        "--mca coll_tuned_reduce_algorithm_max_requests 10"
get_results # on local machine

Let’s plot the medians:

inorderbinary_segment <- read.table.parameter("inorderbinary", SEGMENT)
ggplot.median(inorderbinary_segment)

The in-order binary has actually good performance for small messages (similar to binomial and even better for sizes 1 kB to 300 kB). The segment size must be low for small messages <= 30 kB and large for messages >= 100 kB. For intermediate sizes, there is still a compromise.

Summary of algorithms performance

Let’s summarize the best performing algorithms and their related parameters for each size. For each previous boxplot, we keep only the median and for each algorithm, we keep only the minimum of these medians. This hides the higher variance that we get with in-order binary than with binomial, but this is coarse.

chain_segment$job_shuffling_enabled = 0 # fix for mistake mentioned above
algo_summary <- rbind(
  cbind(chain_segment[,names(chain_out)], name = "chain_segment"),
  cbind(chain_out, name = "chain_out"),
  cbind(chain_fanout, name = "chain_fanout"),
  cbind(pipeline_segment, name = "pipeline_segment"),
  cbind(binary_segment, name = "binary_segment"),
  cbind(binomial_segment, name = "binomial_segment"),
  cbind(inorderbinary_segment, name = "inorderbinary_segment")
)

algo_summary %>%
  group_by(msize, name, value, iteration) %>%
  summarise(median = median(runtime_sec)) %>%
  summarise(medmed = median(median)) %>%
  summarise(minmedmed = min(medmed)) %>%
  ggplot(aes(x = msize, y = minmedmed, color = name, shape = name)) +
  geom_line() +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  scale_shape_manual(values = 1:7) +
  annotation_logticks(sides = "l")

In summary:

100 B to 30 kB: in-order binary with segment size 512 B
\(>\) 30 kB to 300 kB: in-order binary with segment size 2 kiB
\(>\) 300 kB to 10 MB: pipeline with segment size 512 B

Coarsely tuned Open MPI and default performance

Let’s compare the performance of Open MPI whith default settings and directly and coarsely tuned ones.

Let’s first create the file for dynamic rules based on the previous analysis:

echo "1               # num of collectives (only reduce)
11              # ID for Reduce collective (see coll_base_functions.h)
1               # number of com sizes
1               # comm size 1
3               # number of msg sizes
0 6 0 512       # for message size 0, in-order binary, no topo, 512 segmentation
30001 6 0 2048  # for message size 0, in-order binary, no topo, 2 kiB segmentation
300001 3 0 512  # for message size 0, pipeline, no topo, 512 segmentation" \
> ${RESULT_DIR}/dynamic_rules

The launching script needs to be adapted for this one. We increase the number of repetitions as we plan to assess the variation.

REPETITION=100
function launch_tuned {
    SUBDIR=tuned
    LEVEL=$1
    SETTINGS=${@:2}
    mkdir -p ${RESULT_DIR}/${SUBDIR}/${LEVEL}
    ompi_info -c > ${RESULT_DIR}/${SUBDIR}/${LEVEL}/info
    mpirun --version > ${RESULT_DIR}/${SUBDIR}/${LEVEL}/version
    echo Launch benchmark with ${LEVEL} tuning
    for i in $(seq 1 ${REPEAT})
    do
        echo Iteration ${i} on ${REPEAT} with ${REPETITION} measures per size
        timeout ${TIMEOUT} mpirun ${SETTINGS} \
            -n 512 --npernode 16 --hostfile ${RESULT_DIR}/hostfile \
            mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
            --params=level:${LEVEL},iteration:${i} \
            --msizes-list=${SIZES} -r ${REPETITION} --shuffle-jobs 2>&1 > \
                ${RESULT_DIR}/${SUBDIR}/${LEVEL}/result_${i}
    done
}

We will compare the default performance with the ones with can get by setting a few parameters or with more advanced but still coarse tuning.

launch_tuned no
launch_tuned direct \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0"
launch_tuned coarse \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca btl_openib_eager_limit 4096" \
        "--mca coll_tuned_use_dynamic_rules 1" \
        "--mca coll_tuned_dynamic_rules_filename ${RESULT_DIR}/dynamic_rules" \
        "--mca coll_tuned_reduce_algorithm_max_requests 10"
get_results # on local machine

perf_tuning <- rbind(read.table.parameter("tuned", "no"),
                     read.table.parameter("tuned", "direct"),
                     read.table.parameter("tuned", "coarse"))
perf_tuning %>%
  group_by(iteration, level, msize) %>%
  summarise(median = median(runtime_sec)) %>%
  ggplot(aes(x = level, y = median)) +
  facet_wrap(~ msize, ncol = 4, scales = "free_y") +
  geom_boxplot() +
  scale_y_log10() +
  annotation_logticks(sides = "l")

There is failed optimization for 10 MB messages and we suspect that increasing the segment size (to 2 kiB for instance) would solve it.

This first look into tuning shows that the performance of Open MPI can significantly be improved by settings some basic flags. It can be furthermore improved with some quick tuning. The achieved improvement can be between 10% and up to a factor of 3. Although some interaction between the parameters could reveal new venues for improvement, we suspect that a fine tuning of Open MPI would not significantly increase the performance (not by more than 20%).

Let’s take a look into the variability for three sizes: 100, 10 kB, 1 MB.

perf_tuning %>%
  filter(msize %in% c(1e2, 1e4, 1e6)) %>%
  ggplot(aes(x = level, y = runtime_sec, color = factor(iteration))) +
  facet_wrap(~ msize, scales = "free_y") +
  geom_boxplot(outlier.size = 1) +
  scale_y_log10() +
  annotation_logticks(sides = "l") +
  theme(legend.position = "bottom") +
  guides(color = guide_legend(nrow = 2, byrow = TRUE))

We see that the variation is higher for small messages, but this can be expected because small perturbations may significantly impact the collective latency. Except with no tuning, many outliers occur once. Let’s check the variation when the first measure is removed (Open MPI does some set up the first time a given function is called).

perf_tuning[1:nrow(perf_tuning) %% 100 != 1,] %>%
  filter(msize %in% c(1e2, 1e4, 1e6)) %>%
  filter(level != "no") %>%
  ggplot(aes(x = level, y = runtime_sec, color = factor(iteration))) +
  facet_wrap(~ msize, scales = "free_y") +
  geom_boxplot(outlier.size = 1) +
  scale_y_log10() +
  annotation_logticks(sides = "l") +
  theme(legend.position = "bottom") +
  guides(color = guide_legend(nrow = 2, byrow = TRUE))

The variation with 100 B are similar between both levels of tuning and they are less than with the no tuning approach. The coarse tuning performance is thus superior.

For 10 kB, the coarse performance is better but it has higher variability. This may be due to the fact that the coarse performance are closer to the optimal performance and this variability may thus be inevitable. Moreover, most of the outliers are still better than the median performance obtained with the direct approach. The coarse tuning is thus also superior.

There are still a lot of variations for the coarse tuned runs for 1 MB. The higher number of extreme values than with the direct tuned approach is a mystery. It is thus less clear than the coarse approach is superior here.

Finally, let’s plot the best overall achieved performance with Open MPI.

perf_tuning %>%
  group_by(msize, level, iteration) %>%
  summarise(median = median(runtime_sec)) %>%
  summarise(medmed = median(median)) %>%
  ggplot(aes(x = msize, y = medmed, color = level, shape = level)) +
  geom_line() +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  annotation_logticks(sides = "l")

Conclusion

This set of study about tuning Open MPI reveal the extent to which we may expect performance improvement through tuning (up to 3x). We can therefore conclude that any paper claiming that a new mechanism outperforms Open MPI with the default settings by less than a factor 2 or 3 could actually be outperformed by the existing mechanisms implemented in Open MPI when correctly tuned.

The methodology used in this study is quite satisfying, except that no statistical test was done to validate any improvement. Deadling with extreme values remains a difficult issue.

## R version 3.2.4 (2016-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_1.0.1 dplyr_0.4.3   tidyr_0.2.0   purrr_0.2.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1      knitr_1.10.5     magrittr_1.5     MASS_7.3-44     
##  [5] munsell_0.4.2    colorspace_1.2-6 R6_2.1.0         stringr_1.0.0   
##  [9] plyr_1.8.3       tools_3.2.4      parallel_3.2.4   grid_3.2.4      
## [13] gtable_0.1.2     DBI_0.3.1        htmltools_0.2.6  yaml_2.1.13     
## [17] assertthat_0.1   digest_0.6.8     reshape2_1.4.1   formatR_1.2     
## [21] evaluate_0.7     rmarkdown_0.7    stringi_0.5-5    scales_0.2.5    
## [25] proto_0.3-10

Open MPI algorithm selection study

L.-C. Canon

March 19th, 2016

MCA parameters of interest

Study preparation

Algorithm tuning

Chain

Pipeline

Binary

Binomial

In-order binary

Summary of algorithms performance

Coarsely tuned Open MPI and default performance

Conclusion