Several article compare new solutions to the default performance provided by Open MPI. The version is often different. Therefore, we look at the default performance for all version of Open MPI.
We first need all the archives:
for version in $(seq 0 8)
do
wget https://www.open-mpi.org/software/ompi/v1.${version}/downloads/openmpi-1.${version}.tar.bz2
done
for version in 1.0.$(seq -s " 1.0." 1 2) \
1.1.$(seq -s " 1.1." 1 5) \
1.2.$(seq -s " 1.2." 1 9) \
1.3.$(seq -s " 1.3." 1 4) \
1.4.$(seq -s " 1.4." 1 5) \
1.5.$(seq -s " 1.5." 1 5) \
1.6.$(seq -s " 1.6." 1 5) \
1.7.$(seq -s " 1.7." 1 5) \
1.8.$(seq -s " 1.8." 1 8) \
1.10.$(seq -s " 1.10." 0 2)
do
wget https://www.open-mpi.org/software/ompi/v${version%.*}/downloads/openmpi-${version}.tar.bz2
done
There is a total number of versions of 60. We install them with default and debug (just in case) configurations:
mkdir -p ompi
cd ompi
for archive in $(ls *.tar.bz2)
do
tar -xjf ${archive}
version=${archive%.*.*}
mv ${version} ${version}-build
mkdir ${version} ${version}-debug
cd ${version}-build
./configure --prefix=$HOME/ompi/${version} && make all && make install
./configure --prefix=$HOME/ompi/${version}-debug --enable-debug && make all && make install
cd .. && rm -r ${version}-build
cd ~ && rm bin && ln -s ompi/${version}/bin bin
cd ~/mpibenchmark-0.6.0-src/ && make && mv mpibenchmark ../bin/
cd ~ && rm bin && ln -s ompi/${version}-debug/bin bin
cd ~/mpibenchmark-0.6.0-src/ && make && mv mpibenchmark ../bin/
cd ~/ompi
done
This takes 7.9 GB (du -s -h ompi/
). All versions 1.0 (1.0, 1.0.1, 1.0.2) and all versions 1.1 (1.1, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5) did not compile out-of-the-box, which represents 9 versions.
The benchmark consists in measuring the MPI_Reduce
call for different sizes (from 1 B to 10 MB) on the Jupiter cluster with 32x16 processes. We consider 100 repetitions with a timeout of 100 seconds (1 second allow the transfer of 1 GB on an 10 Gb/s network, which should be more than enough for reducing 10 MB).
# Check nobody is on the cluster
for i in $(seq 0 35); do echo -n $i; ssh jupiter$i w; done | grep -v USER
# Directory containing final results
VERSIONS_DIR=$PWD/versions
mkdir -p ${VERSIONS_DIR}
cd ${VERSIONS_DIR}
# Nodes to use for XP
> hostfile
for i in $(seq 3 18) $(seq 20 35)
do
echo jupiter$i >> hostfile
done
# Launch XP
REPETITION=100
TIMEOUT=100
ARCHIVES=$(cd ~/ompi && ls *.tar.bz2)
for archive in ${ARCHIVES}
do
# Setup the environment
version=${archive%.*.*}
cd ~ && rm bin && ln -s ompi/${version}/bin bin
if [ ! -f bin/mpirun ] ; then continue ; fi
echo Using version ${version}
# Required for versions 1.4 to 1.6.5
export LD_LIBRARY_PATH=$PWD/ompi/${version}/lib
echo Forcing NFS synchro
for i in $(seq 0 35); do ssh jupiter$i ls -l; done > /dev/null
# Ready to start the benchmarks
mkdir -p ${VERSIONS_DIR}/${version}
mpirun --version > ${VERSIONS_DIR}/${version}/version
for size in 1 3 10 30 100 300 1000 3000 10000 30000 100000 300000 1000000 3000000 10000000
do
echo Launch benchmark for size ${size}
timeout ${TIMEOUT} mpirun -x LD_LIBRARY_PATH \
-n 512 --npernode 16 --hostfile ${VERSIONS_DIR}/hostfile \
mpibenchmark --calls-list=MPI_Reduce -r ${REPETITION} \
--msizes-list=${size} 2>&1 > ${VERSIONS_DIR}/${version}/result_${size}
done
done
Out of the 51 versions, 17 versions produced errors. All versions 1.2 (1.2, 1.2.1, 1.2.2, 1.2.3, 1.2.4, 1.2.5, 1.2.6, 1.2.7, 1.2.8, 1.2.9) and most versions 1.3 (1.3, 1.3.1, 1.3.2, 1.3.3) could not deploy the processes. Versions 1.3.4 and 1.4 took too much time. Version 1.4.2 ended up with a segmentation fault.
Finally, 34 versions produced exploitable results: 16 versions required setting LD_LIBRARY_PATH
(all versions 1.4 and 1.6 with 1.4 and 1.4.2 having problems as mentioned above) and 28 versions worked out-of-the-box by only setting PATH` (all versions 1.7, 1.8 and 1.10).
RESULT_DIR=results
mkdir -p ${RESULT_DIR}
scp -r jupiter:versions/* ${RESULT_DIR}/
for version in $(ls ${RESULT_DIR} | grep openmpi-)
do
ver=$(echo ${version} | grep -o [0-9\.]*)
> ${RESULT_DIR}/${version}/summary.txt
for file in $(ls ${RESULT_DIR}/${version}/result_*)
do
size=$(echo ${file} | grep -o "_[^_]*$" | grep -o [0-9]*)
awk -v ver="${ver}" -v size="${size}" \
'$1 ~ /MPI_Reduce/ { print ver "," size "," $4; }' ${file} \
>> ${RESULT_DIR}/${version}/summary.txt
done
done
cat $(ls ${RESULT_DIR}/*/summary.txt) | sort -V > ${RESULT_DIR}/summary.txt
Let’s parse the data:
versions <- read.table("results/summary.txt", sep = ",")
names(versions) <- c("version", "size", "time")
levs <- unique(versions$version)
versions$version <- factor(versions$version, levels = levs)
Let’s plot the latency for each versions and size:
ggplot(versions, aes(x = factor(size), y = time, color = version)) +
geom_boxplot(outlier.size = 0.5) +
scale_y_log10() +
annotation_logticks(sides = "l")
Observations:
Let’s zoom on the 30 ms area
ggplot(versions, aes(x = factor(size), y = time, color = version)) +
geom_boxplot() +
scale_y_log10() +
annotation_logticks(sides = "l") +
coord_cartesian(ylim = c(29e-3, 31e-3))
There is indeed a concentration of points here.
Let’s organize data by size to have distinct scales (and hopefully, to see better).
ggplot(versions, aes(x = version, y = time)) +
geom_boxplot() +
scale_y_log10() +
facet_wrap(~size, scales = "free", ncol = 3) +
annotation_logticks(sides = "l") +
stat_smooth(aes(x = as.numeric(version)), method = "lm") +
theme(axis.text.x = element_text(angle = 90))
Observations:
We keep only a subset of plots (300 to 3 MB by step of x10) and without the outliers (using the same computation as done by the boxplot function).
library(dplyr)
versions_filtered <- versions %>%
group_by(size, version) %>%
mutate(q1 = quantile(time, 0.25)) %>%
mutate(q2 = quantile(time, 0.75)) %>%
filter(time <= q2 + 1.5 * (q2 - q1) &
time >= q1 - 1.5 * (q2 - q1)) %>%
filter(size %% 3 == 0 & size >= 30)
ggplot(versions_filtered, aes(x = version, y = time)) +
geom_boxplot() +
scale_y_log10() +
facet_wrap(~size, scales = "free", ncol = 2) +
annotation_logticks(sides = "l") +
stat_smooth(aes(x = as.numeric(version)), method = "lm") +
theme(axis.text.x = element_text(angle = 90))
Observations:
To be sure we are not affected by some artifact, here is the last plot of a second run:
versions_second <- read.table("results_second/summary.txt", sep = ",")
names(versions_second) <- c("version", "size", "time")
levs <- unique(versions_second$version)
versions_second$version <- factor(versions_second$version, levels = levs)
versions_second %>%
group_by(size, version) %>%
mutate(q1 = quantile(time, 0.25)) %>%
mutate(q2 = quantile(time, 0.75)) %>%
filter(time <= q2 + 1.5 * (q2 - q1) &
time >= q1 - 1.5 * (q2 - q1)) %>%
filter(size %% 3 == 0 & size >= 30) %>%
ggplot(aes(x = version, y = time)) +
geom_boxplot() +
scale_y_log10() +
facet_wrap(~size, scales = "free", ncol = 2) +
annotation_logticks(sides = "l") +
stat_smooth(aes(x = as.numeric(version)), method = "lm") +
theme(axis.text.x = element_text(angle = 90))
We redid measurements for 1.5.1 as it was two order of magnitude slower (maybe an interference from an external connection, except nobody connected…). There is still some significant large value that could results from interference (1.5 for 300 B, 1.5.5 for 3 kB, …) but it could also be due to unstable performance.
Note that the measures are missing for 30 B with 1.6.3, 1.6.4 and 1.6.5 for unknown reasons (the timeout was reached). Also, the measures are significantly higher with versions 1.5 and 1.6 than before.
We see similar effects that confirm previous observations:
We see that recent versions provide
First, the stability of the default performance between releases is imperfect. The default performance of some versions seems to be pathologically bad and they should probably be avoided (versions 1.7 to 1.7.3, versions 1.8.6, versions 1.10.0). On the other hand, recent versions provide the lowest variations and best default performance (with some exceptions) in particular for small messages. If we keep 1.10.2, we should probably be fine (versions 1.8.4 is also worth considering).
Lastly, sending 10 MB was never possible given the 100 second timeout. This could be related to some specific parameter that need to be set.
Confirming this study would require to perform several runs (5 for instance) while making sure nobody is using the cluster to limit interference. To save time, only versions known to work directly could be used for message sizes distinct from 10 MB. Also, it would be good to randomize the execution and to use the MPI Benchmark feature allowing passing several sizes as arguements.
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
##
## locale:
## [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_1.0.1 dplyr_0.4.3 tidyr_0.2.0 purrr_0.2.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.1 knitr_1.10.5 magrittr_1.5 MASS_7.3-44
## [5] munsell_0.4.2 colorspace_1.2-6 R6_2.1.0 stringr_1.0.0
## [9] plyr_1.8.3 tools_3.2.3 parallel_3.2.3 grid_3.2.3
## [13] gtable_0.1.2 DBI_0.3.1 htmltools_0.2.6 yaml_2.1.13
## [17] assertthat_0.1 digest_0.6.8 reshape2_1.4.1 formatR_1.2
## [21] evaluate_0.7 rmarkdown_0.7 stringi_0.5-5 scales_0.2.5
## [25] proto_0.3-10