The objective is to study how the performance of MVAPICH2 varies form version to version (similarly to the previous one with Open MPI). We could find 12 tarballs in the rpm source packages available on www.rpmfind.net, www.rpmseek.com and www.filewatcher.com. They correspond to major versions from 2006 to 2016.
Let’s first install all the versions:
function install_mvapich2 {
ARCHIVE=$1
CONFIG_OPTIONS=$2
mkdir -p mvapich2
cd ~/mvapich2
tar -xzf ${ARCHIVE}
VERSION=${ARCHIVE%.tar.gz}
VERSION=${VERSION%.tgz}
mv ${VERSION} ${VERSION}-build
cd ${VERSION}-build
./configure --prefix=$HOME/mvapich2/${VERSION} ${CONFIG_OPTIONS} 2>&1
make 2>&1 # parallel build does not work for old versions
make install
cd ~ && rm bin ; ln -s mvapich2/${VERSION}/bin
cd ~/mpibenchmark-0.9.4-src/
make clean
rm CMakeCache.txt CMakeFiles/ -r
cmake .
sed "s/ENABLE_RDTSCP:BOOL=OFF/ENABLE_RDTSCP:BOOL=ON/" -i CMakeCache.txt
sed "s/ENABLE_DOUBLE_BARRIER:BOOL=OFF/ENABLE_DOUBLE_BARRIER:BOOL=ON/" -i CMakeCache.txt
cmake .
make 2>&1
mv mpibenchmark ../bin/mpibenchmark-0.9.4
}
install_mvapich2 mvapich2-0.9.8.tar.gz >> ~/mvapich2_out
install_mvapich2 mvapich2-1.0.3.tar.gz >> ~/mvapich2_out
install_mvapich2 mvapich2-1.2p1.tgz >> ~/mvapich2_out
install_mvapich2 mvapich2-1.4.tgz >> ~/mvapich2_out
install_mvapich2 mvapich2-1.6.tgz --without-hwloc 2>&1 >> ~/mvapich2_out
install_mvapich2 mvapich2-1.8a2.tgz --disable-fc 2>&1 >> ~/mvapich2_out
install_mvapich2 mvapich2-1.9a.tgz 2>&1 >> ~/mvapich2_out
# CMA is disabled because it is available starting from kernel 3.2 only
# (currently 2.6.32 on Jupiter, more than 6 years old)
install_mvapich2 mvapich2-2.0.tar.gz --without-cma 2>&1 >> ~/mvapich2_out
install_mvapich2 mvapich2-2.0b.tgz --without-cma 2>&1 >> ~/mvapich2_out
install_mvapich2 mvapich2-2.1.tar.gz --without-cma 2>&1 >> ~/mvapich2_out
install_mvapich2 mvapich2-2.2a.tar.gz --without-cma 2>&1 >> ~/mvapich2_out
install_mvapich2 mvapich2-2.2b.tar.gz --without-cma 2>&1 >> ~/mvapich2_out
We commented out each call to MPI_Reduce_scatter_block
and MPI_Reduce_local
in MPI Benchmark because they were not supported in versions 0.9.8, 1.0.3, 1.2p1 and 1.4. Moreover, the previous installation script enable high resolution timers and the provided MPI barrier implementation to avoid interference with MPI implementation. For some MVAPICH2 versions, the build system has some issues and some components were deactivated by hand (hwloc and fortran). For the last versions, CMA was disabled. This is however an important component, especially on hierarchical machines. Hopefully, the following observations will stand nonetheless.
Edit (from June 27th, 2016): the script enables the double barrier instead of the provided barrier implementation. Hopefully, this does not invalidate this study.
Memory problems were detected/suspected on the following nodes: 4, 8, 21 and 33. Moreover, node 19 is still heterogeneous. This eliminates 5 nodes. However, we still need 32 nodes. We therefore keep the heterogeneous nodes in the following experiments.
# Directory containing final results
RESULT_DIR=${PWD}/results/mvapich2-version
mkdir -p ${RESULT_DIR}
mv ~/mvapich2_out ${RESULT_DIR}
# Nodes to use for XP
> ${RESULT_DIR}/hostfile
for i in $(seq 0 3) $(seq 5 7) $(seq 9 20) $(seq 22 32) $(seq 34 35)
do
# Version 1.2p1 does not support the shorcut version "host:16"
for j in $(seq 1 16)
do
echo jupiter$i >> ${RESULT_DIR}/hostfile
done
done
# Check nobody is on the cluster
for i in $(seq 0 35); do echo -n $i; ssh jupiter$i w; done | grep -v USER
Now everything is ready for the measures. We are interested in the following observations: how is the performance of the MVAPICH2 implementations with default settings impacted by both the versions and the message size in terms of variability (noise, stability) and central tendency?
We will measure the time to perform a reduction on 32 16-cores nodes with varying message sizes and implementations versions. We selected 6 sizes from 30 B (performance starts to change around 100 B for small messages) to 3 MB (performance is proportional to the message sizes for messages larger than 1 MB). We repeat the run 30 times and capture 100 measurements each time. Sizes are shuffled.
# Launch XP
TIMEOUT=100
REPETITION=100
REPEAT=30
SIZES=30,300,3000,30000,300000,3000000
ARCHIVES=$(cd ~/mvapich2 && ls *.t*gz)
for ARCHIVE in ${ARCHIVES}
do
# Set MVAPICH2 version
VERSION=${ARCHIVE%.tar.gz}
VERSION=${VERSION%.tgz}
# Old versions require MPD deamon (I did not spend time to make it work)
if [ "${VERSION}" = "mvapich2-0.9.8" ] || [ "${VERSION}" = "mvapich2-1.0.3" ]
then
continue
fi
cd ~ && rm bin && ln -s mvapich2/${VERSION}/bin
echo Forcing NFS synchro for version ${VERSION}
for i in $(seq 0 35); do ssh jupiter$i ls -l; done > /dev/null
# Ready to start the benchmarks
mkdir -p ${RESULT_DIR}/${VERSION}
mpiname -a -c 2>&1 > ${RESULT_DIR}/${VERSION}/name
mpich2version 2>&1 > ${RESULT_DIR}/${VERSION}/version
for i in $(seq 1 ${REPEAT})
do
echo Iteration ${i} on ${REPEAT} with ${REPETITION} measures per size
# Version 1.4 requires the absolute path for the command
timeout ${TIMEOUT} mpirun_rsh -hostfile ${RESULT_DIR}/hostfile -n 512 \
${PWD}/bin/mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
--params=version:${VERSION},iteration:${i} \
--msizes-list=${SIZES} -r ${REPETITION} --shuffle-jobs 2>&1 > \
${RESULT_DIR}/${VERSION}/result_${i}
done
done
We can get the results with:
rsync --recursive jupiter_venus:results/mvapich2-version/* results/
Let’s read the data:
read.table.versions <- function(versions) {
read.table.version <- function(version) {
dirname <- paste("results/mvapich2-", version, sep = "")
files <- list.files(dirname, pattern = "result_*")
read.table.file <- function(filename) {
con <- file(filename, open = "r")
info <- readLines(con) %>%
map(~str_match(., "#@(.*)=(.*)")[2:3]) %>%
discard(~any(is.na(.)))
close(con)
data <- read.table(filename, header = TRUE)
for (i in info)
data[i[1]] <- type.convert(i[2])
data
}
map_df(paste(dirname, files, sep = "/"), read.table.file)
}
map_df(versions, read.table.version)
}
versions <- c("1.2p1", "1.4", "1.6", "1.8a2", "1.9a",
"2.0", "2.0b", "2.1", "2.2a", "2.2b")
perf.versions <- read.table.versions(versions)
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
#perf.versions$version <- factor(perf.versions$version, levels = versions)
Let’s plot the data. First, we want to observe the variability and the global effect of the size. This may also provide a preliminary comparison between versions.
perf.versions %>%
filter(msize %in% c(300, 30e3, 3e6)) %>%
ggplot(aes(x = factor(iteration), y = runtime_sec)) +
geom_boxplot(outlier.size = 1) +
facet_grid(msize ~ version, scales = "free_y") +
scale_y_log10() +
annotation_logticks(sides = "l")
We can do the following observations:
Let’s focus on the median of each run:
perf.versions %>%
group_by(iteration, version, msize) %>%
summarise(median = median(runtime_sec)) %>%
ggplot(aes(x = factor(msize), y = median, color = version)) +
geom_boxplot(outlier.size = 1) +
scale_y_log10() +
annotation_logticks(sides = "l") +
theme(legend.position = "bottom") +
guides(color = guide_legend(nrow = 2, byrow = TRUE))
This confirms that the last versions (2.0 and above) have regression for large message sizes (>= 30 kB). However, they are better for 3 kB. For small message, there have been steady improvement until version 2.0b and then a slight degradation.
The difference in the performance with the default settings can be up to a factor 3 (or 4) for 300 kB, and a factor 2 is not infrequent.
In conclusion, version 1.6 should probably be avoided (large variability and pathological performance for 3 kB). Also, the overall best version is probably 1.9a, which is only marginally suboptimal for 3 kB and has low variability.
Similarly to Open MPI (see the study from March 15th), the choice of the version for MVAPICH2 may have significant impact on the performance when the default settings are used (a factor two being usual). Therefore, any moderate speedup of a new method over an arbitrary version of MVAPICH2 may be related to an issue in the selected version and not to the superiority of the proposed approach.
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
##
## locale:
## [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_0.9.3.1 dplyr_0.4.3 tidyr_0.2.0 purrr_0.2.0
## [5] stringr_1.0.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.1 knitr_1.10.5 magrittr_1.5
## [4] MASS_7.3-44 munsell_0.4 colorspace_1.2-2
## [7] R6_2.1.2 plyr_1.8.1 tools_3.3.0
## [10] dichromat_2.0-0 parallel_3.3.0 grid_3.3.0
## [13] gtable_0.1.2 pacman_0.4.1 DBI_0.3.1
## [16] htmltools_0.2.6 yaml_2.1.13 assertthat_0.1
## [19] digest_0.6.9 RColorBrewer_1.0-5 reshape2_1.2.2
## [22] formatR_0.10 evaluate_0.7 rmarkdown_0.7
## [25] labeling_0.1 stringi_0.5-5 scales_0.2.3
## [28] proto_0.3-10