Several article compare new solutions to the default performance provided by Open MPI. The version is often different. Therefore, we look at the default performance for all version of Open MPI.

Installing

We first need all the archives:

for version in $(seq 0 8)
do
  wget https://www.open-mpi.org/software/ompi/v1.${version}/downloads/openmpi-1.${version}.tar.bz2
done

for version in 1.0.$(seq -s " 1.0." 1 2) \
               1.1.$(seq -s " 1.1." 1 5) \
               1.2.$(seq -s " 1.2." 1 9) \
               1.3.$(seq -s " 1.3." 1 4) \
               1.4.$(seq -s " 1.4." 1 5) \
               1.5.$(seq -s " 1.5." 1 5) \
               1.6.$(seq -s " 1.6." 1 5) \
               1.7.$(seq -s " 1.7." 1 5) \
               1.8.$(seq -s " 1.8." 1 8) \
               1.10.$(seq -s " 1.10." 0 2)
do
  wget https://www.open-mpi.org/software/ompi/v${version%.*}/downloads/openmpi-${version}.tar.bz2
done

There is a total number of versions of 60. We install them with default and debug (just in case) configurations:

mkdir -p ompi
cd ompi
for archive in $(ls *.tar.bz2)
do
    tar -xjf ${archive}
    version=${archive%.*.*}
    mv ${version} ${version}-build
    mkdir ${version} ${version}-debug
    cd ${version}-build
    ./configure --prefix=$HOME/ompi/${version} && make all && make install
    ./configure --prefix=$HOME/ompi/${version}-debug --enable-debug && make all && make install
    cd .. && rm -r ${version}-build
    cd ~ && rm bin && ln -s ompi/${version}/bin bin
    cd ~/mpibenchmark-0.6.0-src/ && make && mv mpibenchmark ../bin/
    cd ~ && rm bin && ln -s ompi/${version}-debug/bin bin
    cd ~/mpibenchmark-0.6.0-src/ && make && mv mpibenchmark ../bin/
    cd ~/ompi
done

This takes 7.9 GB (du -s -h ompi/). All versions 1.0 (1.0, 1.0.1, 1.0.2) and all versions 1.1 (1.1, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5) did not compile out-of-the-box, which represents 9 versions.

Benchmarking

The benchmark consists in measuring the MPI_Reduce call for different sizes (from 1 B to 10 MB) on the Jupiter cluster with 32x16 processes. We consider 100 repetitions with a timeout of 100 seconds (1 second allow the transfer of 1 GB on an 10 Gb/s network, which should be more than enough for reducing 10 MB).

# Check nobody is on the cluster
for i in $(seq 0 35); do echo -n $i; ssh jupiter$i w; done | grep -v USER

# Directory containing final results
VERSIONS_DIR=$PWD/versions
mkdir -p ${VERSIONS_DIR}
cd ${VERSIONS_DIR}

# Nodes to use for XP
> hostfile
for i in $(seq 3 18) $(seq 20 35)
do
    echo jupiter$i >> hostfile
done
    
# Launch XP
REPETITION=100
TIMEOUT=100
ARCHIVES=$(cd ~/ompi && ls *.tar.bz2)
for archive in ${ARCHIVES}
do  
    # Setup the environment
    version=${archive%.*.*}
    cd ~ && rm bin && ln -s ompi/${version}/bin bin
    if [ ! -f bin/mpirun ] ; then continue ; fi
    echo Using version ${version}
    # Required for versions 1.4 to 1.6.5
    export LD_LIBRARY_PATH=$PWD/ompi/${version}/lib

    echo Forcing NFS synchro
    for i in $(seq 0 35); do ssh jupiter$i ls -l; done > /dev/null

    # Ready to start the benchmarks
    mkdir -p ${VERSIONS_DIR}/${version}
    mpirun --version > ${VERSIONS_DIR}/${version}/version
    for size in 1 3 10 30 100 300 1000 3000 10000 30000 100000 300000 1000000 3000000 10000000
    do
        echo Launch benchmark for size ${size}
        timeout ${TIMEOUT} mpirun -x LD_LIBRARY_PATH \
                -n 512 --npernode 16 --hostfile ${VERSIONS_DIR}/hostfile \
                mpibenchmark --calls-list=MPI_Reduce -r ${REPETITION} \
                        --msizes-list=${size} 2>&1 > ${VERSIONS_DIR}/${version}/result_${size}
    done
done

Out of the 51 versions, 17 versions produced errors. All versions 1.2 (1.2, 1.2.1, 1.2.2, 1.2.3, 1.2.4, 1.2.5, 1.2.6, 1.2.7, 1.2.8, 1.2.9) and most versions 1.3 (1.3, 1.3.1, 1.3.2, 1.3.3) could not deploy the processes. Versions 1.3.4 and 1.4 took too much time. Version 1.4.2 ended up with a segmentation fault.

Finally, 34 versions produced exploitable results: 16 versions required setting LD_LIBRARY_PATH (all versions 1.4 and 1.6 with 1.4 and 1.4.2 having problems as mentioned above) and 28 versions worked out-of-the-box by only setting PATH` (all versions 1.7, 1.8 and 1.10).

Processing

RESULT_DIR=results
mkdir -p ${RESULT_DIR}
scp -r jupiter:versions/* ${RESULT_DIR}/
for version in $(ls ${RESULT_DIR} | grep openmpi-)
do
    ver=$(echo ${version} | grep -o [0-9\.]*)
    > ${RESULT_DIR}/${version}/summary.txt
    for file in $(ls ${RESULT_DIR}/${version}/result_*)
    do
        size=$(echo ${file} | grep -o "_[^_]*$" | grep -o [0-9]*)
        awk -v ver="${ver}" -v size="${size}" \
                '$1 ~ /MPI_Reduce/ { print ver "," size "," $4; }' ${file} \
                >> ${RESULT_DIR}/${version}/summary.txt
    done
done

cat $(ls ${RESULT_DIR}/*/summary.txt) | sort -V > ${RESULT_DIR}/summary.txt

Analysis

Let’s parse the data:

versions <- read.table("results/summary.txt", sep = ",")
names(versions) <- c("version", "size", "time")
levs <- unique(versions$version)
versions$version <- factor(versions$version, levels = levs)

Let’s plot the latency for each versions and size:

ggplot(versions, aes(x = factor(size), y = time, color = version)) +
    geom_boxplot(outlier.size = 0.5) +
    scale_y_log10() +
    annotation_logticks(sides = "l")

Observations:

there seems to be an overall decrease for small messages but it hard to see
several measurements are around 30 ms for small messages

Let’s zoom on the 30 ms area

ggplot(versions, aes(x = factor(size), y = time, color = version)) +
    geom_boxplot() +
    scale_y_log10() +
    annotation_logticks(sides = "l") +
    coord_cartesian(ylim = c(29e-3, 31e-3))

There is indeed a concentration of points here.

Let’s organize data by size to have distinct scales (and hopefully, to see better).

ggplot(versions, aes(x = version, y = time)) +
    geom_boxplot() +
    scale_y_log10() +
    facet_wrap(~size, scales = "free", ncol = 3) +
    annotation_logticks(sides = "l") +
    stat_smooth(aes(x = as.numeric(version)), method = "lm") +
    theme(axis.text.x = element_text(angle = 90))

Observations:

performance improvement for small messages (confirm previous observation)
versions 1.7 to 1.7.3 have a regression with small messages
versions 1.7 to 1.8.4 have a low variation for small messages
some of the versions up to 1.6.3 have huge variations for messages of 3 MB
lot of outliers that complicate the study of the average/median case
plots until 1 kB are similar

We keep only a subset of plots (300 to 3 MB by step of x10) and without the outliers (using the same computation as done by the boxplot function).

library(dplyr)

versions_filtered <- versions %>%
    group_by(size, version) %>%
    mutate(q1 = quantile(time, 0.25)) %>%
    mutate(q2 = quantile(time, 0.75)) %>%
    filter(time <= q2 + 1.5 * (q2 - q1) &
           time >= q1 - 1.5 * (q2 - q1)) %>%
    filter(size %% 3 == 0 & size >= 30)

ggplot(versions_filtered, aes(x = version, y = time)) +
    geom_boxplot() +
    scale_y_log10() +
    facet_wrap(~size, scales = "free", ncol = 2) +
    annotation_logticks(sides = "l") +
    stat_smooth(aes(x = as.numeric(version)), method = "lm") +
    theme(axis.text.x = element_text(angle = 90))

Observations:

version 1.8.4 seems to be the most efficient for medium and large messages
version 1.8.6 has particularly bad performance for 3 MB messages
version 1.10.0 has particularly bad performance for 300 kB messages

Second run

To be sure we are not affected by some artifact, here is the last plot of a second run:

versions_second <- read.table("results_second/summary.txt", sep = ",")
names(versions_second) <- c("version", "size", "time")
levs <- unique(versions_second$version)
versions_second$version <- factor(versions_second$version, levels = levs)

versions_second %>%
    group_by(size, version) %>%
    mutate(q1 = quantile(time, 0.25)) %>%
    mutate(q2 = quantile(time, 0.75)) %>%
    filter(time <= q2 + 1.5 * (q2 - q1) &
           time >= q1 - 1.5 * (q2 - q1)) %>%
    filter(size %% 3 == 0 & size >= 30) %>%
    ggplot(aes(x = version, y = time)) +
    geom_boxplot() +
    scale_y_log10() +
    facet_wrap(~size, scales = "free", ncol = 2) +
    annotation_logticks(sides = "l") +
    stat_smooth(aes(x = as.numeric(version)), method = "lm") +
    theme(axis.text.x = element_text(angle = 90))

We redid measurements for 1.5.1 as it was two order of magnitude slower (maybe an interference from an external connection, except nobody connected…). There is still some significant large value that could results from interference (1.5 for 300 B, 1.5.5 for 3 kB, …) but it could also be due to unstable performance.

Note that the measures are missing for 30 B with 1.6.3, 1.6.4 and 1.6.5 for unknown reasons (the timeout was reached). Also, the measures are significantly higher with versions 1.5 and 1.6 than before.

We see similar effects that confirm previous observations:

versions 1.7 to 1.7.3 have a regression with small messages
some of the versions up to 1.6.1 have huge variations for messages of 3 MB (versions up to 1.6.3 were concerned before)
version 1.8.4 seems to be the most efficient for medium and large messages
version 1.8.6 has particularly bad performance for 3 MB messages
version 1.10.0 has particularly bad performance for 300 kB messages

Conclusion

We see that recent versions provide

First, the stability of the default performance between releases is imperfect. The default performance of some versions seems to be pathologically bad and they should probably be avoided (versions 1.7 to 1.7.3, versions 1.8.6, versions 1.10.0). On the other hand, recent versions provide the lowest variations and best default performance (with some exceptions) in particular for small messages. If we keep 1.10.2, we should probably be fine (versions 1.8.4 is also worth considering).

Lastly, sending 10 MB was never possible given the 100 second timeout. This could be related to some specific parameter that need to be set.

Confirming this study would require to perform several runs (5 for instance) while making sure nobody is using the cluster to limit interference. To save time, only versions known to work directly could be used for message sizes distinct from 10 MB. Also, it would be good to randomize the execution and to use the MPI Benchmark feature allowing passing several sizes as arguements.

## R version 3.2.3 (2015-12-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_1.0.1 dplyr_0.4.3   tidyr_0.2.0   purrr_0.2.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1      knitr_1.10.5     magrittr_1.5     MASS_7.3-44     
##  [5] munsell_0.4.2    colorspace_1.2-6 R6_2.1.0         stringr_1.0.0   
##  [9] plyr_1.8.3       tools_3.2.3      parallel_3.2.3   grid_3.2.3      
## [13] gtable_0.1.2     DBI_0.3.1        htmltools_0.2.6  yaml_2.1.13     
## [17] assertthat_0.1   digest_0.6.8     reshape2_1.4.1   formatR_1.2     
## [21] evaluate_0.7     rmarkdown_0.7    stringi_0.5-5    scales_0.2.5    
## [25] proto_0.3-10

Open MPI versions study

L.-C. Canon

March 15, 2016

Installing

Benchmarking

Processing

Analysis

Second run

Conclusion