Our previous studies revealed that the default performance of Open MPI could be improved from 10-20% to a factor 2-3 after some coarse tuning. Let’s analyze if a similar observation stands for MVAPICH2.

Version study

Similarly to Open MPI, we would like to see how stable are the performance from version to version. Let’s fetch all the versions:

VERSIONS=$(grep "^MVAPICH2[- ]" CHANGELOG | grep -i -v RC | grep -o "[0-2]\.[0-9][^ ]*")
VERSIONS=2.2b 2.2a 2.1
for version in ${VERSIONS}
do
    wget http://mvapich.cse.ohio-state.edu/download/mvapich/mv2/mvapich2-${version}.tar.gz
done

It appears that only the last three versions are available. According to DK Panda, the team does not have the resources to support previous versions. Moreover, it seems they do not have the time to keep the previous tarballs online, which puzzles me at this time. This is an issue because this complicates reproducibility attempts and is antagonist with the open science movement.

This comparison would have been interesting to assess the reproducibility we could expect in the literature given that multiple studies use different versions. However, we will focus on the last version only.

VERSION=mvapich2-2.2b
tar -xzvf ${VERSION}.tar.gz
mv ${VERSION} ${VERSION}-build
cd ${VERSION}-build
# CMA is disabled because it is available starting from kernel 3.2 (currently 2.6.32 on Jupiter, more than 6 years old)
./configure --prefix=$HOME/mvapich2/${VERSION} --without-cma
make -j 4
make install
./configure --prefix=$HOME/mvapich2/${VERSION}-debug --without-cma ./configure --enable-g=all --enable-error-messages=all
make -j 4
make install
cd ~ && rm bin && ln -s mvapich2/${VERSION}/bin
cd ~/mpibenchmark-0.9.4-src/ && make && mv mpibenchmark ../bin/mpibenchmark-0.9.4
cd ~ && rm bin && ln -s mvapich2/${VERSION}-debug/bin
cd ~/mpibenchmark-0.9.4-src/ && make && mv mpibenchmark ../bin/mpibenchmark-0.9.4

Tuning parameters

Let’s consider alternatively each tuning parameter among a restricted set to see if we can significantly improved the default performance of MVAPICH2 with a limited tuning effort. We thus discard any interaction between the effect of the parameters.

There are many parameters: 170 are described in the user guide of versions 2.2b and 400 are actually available in the source code (grep getenv -R | grep -o "MV2[^\"]*" | sort | uniq | wc -l).

We consider only the most promising ones:

Note that MV2_TWO_LEVEL_COMM_THRESHOLD is undocumented, along with 13 other options that contain the chain REDUCE.

Experiment preparation

Let’s set up this environment:

# Directory containing final results
RESULT_DIR=${PWD}/results/mvapich2
mkdir -p ${RESULT_DIR}

# Nodes to use for XP
> ${RESULT_DIR}/hostfile
for i in $(seq 3 18) $(seq 20 35)
do
    echo jupiter$i:16 >> ${RESULT_DIR}/hostfile
done

# Check nobody is on the cluster
for i in $(seq 0 35); do echo -n $i; ssh jupiter$i w; done | grep -v USER

# Set MVAMPICH2 version
VERSION=mvapich2-2.2b
cd ~ && rm bin && ln -s mvapich2/${VERSION}/bin
echo Forcing NFS synchro
for i in $(seq 0 35); do ssh jupiter$i ls -l; done > /dev/null

And the script to launch the benchmark:

# General variables
function variate_parameter {
    TIMEOUT=100
    REPETITION=30
    REPEAT=30
    SIZES=100,300,1000,3000,10000,30000,100000,300000,1000000,3000000,10000000
    RESULT_DIR=${PWD}/results/mvapich2

    PARAMETER=$1
    VALUES=$2
    SETTINGS=${@:3}
    mkdir -p ${RESULT_DIR}/${PARAMETER}
    mpiname -a -c > ${RESULT_DIR}/${PARAMETER}/info
    for i in $(seq 1 ${REPEAT})
    do
        echo Iteration ${i} on ${REPEAT} with ${REPETITION} measures per size
        for VALUE in $(shuf -e -- ${VALUES})
        do
            echo -n " Launch benchmark"
            echo " for parameter ${PARAMETER} with value ${VALUE}"
            timeout ${TIMEOUT} mpirun_rsh -hostfile ${RESULT_DIR}/hostfile \
                 -n 512 ${PARAMETER}=${VALUE} ${SETTINGS} \
                mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
                --params=parameter:${PARAMETER},value:${VALUE},iteration:${i} \
                --msizes-list=${SIZES} -r ${REPETITION} --shuffle-jobs 2>&1 > \
                    ${RESULT_DIR}/${PARAMETER}/result_${VALUE}_${i}
        done
    done
}

Retrieving will be done as simply as rsync --recursive jupiter:results/mvapich2/* results/ and we will use the same function to parse the results:

library(stringr)
read.table.parameter <- function(parameter) {
  dirname <- paste("results", parameter, sep = "/")
  files <- list.files(dirname, pattern = "result_*")
  read.table.file <- function(filename) {
    con <- file(filename, open = "r")
    info <- map(readLines(con), ~ str_match(., "#@(.*)=(.*)")[2:3]) %>%
      discard(~ any(is.na(.)))
    close(con)
    data <- read.table(filename, header = TRUE)
    for (i in info)
      data[i[1]] <- type.convert(i[2])
    data
  }
  map_df(paste(dirname, files, sep = "/"), read.table.file)
}

Plotting will be done with:

ggplot.median <- function(data) {
  group_by(data, msize, value, iteration) %>%
    summarise(median = median(runtime_sec)) %>%
    ggplot(aes(x = factor(value), y = median)) +
    facet_wrap(~ msize, ncol = 4, scales = "free_y") +
    geom_boxplot() +
    scale_y_log10() +
    annotation_logticks(sides = "l")
}

General tuning parameters

Let’s start with MV2_TWO_LEVEL_COMM_THRESHOLD. It is an undocumented parameter but it may have significant impact on the benchmark as the algorithm may change during a run (after 16 calls by default). It was pointed out by colleagues who assessed this behavior. We put 0 to force the shared-memory mechanism as soon as possible. Since it may be interpreted as NULL, we test with 1 too.

variate_parameter MV2_TWO_LEVEL_COMM_THRESHOLD "0 1 16 1000000"
rsync --recursive jupiter:results/mvapich2/* results/ # on local machine

Let’s plot the medians:

comm_threshold <- read.table.parameter("MV2_TWO_LEVEL_COMM_THRESHOLD")
ggplot.median(comm_threshold)

The shared-memory mechanism significantly improves the performance for small messages (by a factor of two). For size 10 kB, the performance is slightly better with the non-shared memory mechanism. It may be related to a switch to another algorithm that occurs for this size.

The runs with 0, 1 and 16 are extremely close. This is strange because the median should have been impacted if the first 16 runs used another algorithm. Let’s plot all the iterations for size 100, 10 kB and 1 MB:

comm_threshold %>%
  filter(msize %in% c(1e2, 1e4, 1e6)) %>%
  group_by(iteration, value, msize) %>%
  ggplot(aes(x = factor(value), y = runtime_sec, color = factor(iteration))) +
  facet_wrap(~ msize, ncol = 4, scales = "free_y") +
  geom_boxplot(outlier.size = 1) +
  scale_y_log10() +
  annotation_logticks(sides = "l") +
  theme(legend.position = "bottom") +
  guides(color = guide_legend(nrow = 2, byrow = TRUE))

It seems as if 0, 1 and 16 have the same effect. Let’s analyze the effect of this parameter in more detail. We consider 100 repetitions and one run for each value of this parameter between 1 and 500:

RESULT_DIR=${PWD}/results/mvapich2
for i in $(seq 0 500)
do
    mpirun_rsh -n 512 -hostfile ${RESULT_DIR}/hostfile \
        MV2_TWO_LEVEL_COMM_THRESHOLD=$i mpibenchmark-0.9.4 \
            --calls-list=MPI_Reduce -r 100 --msizes-list=100 --summary \
    | grep " MPI_Reduce " | grep 100
done > ${RESULT_DIR}/MV2_TWO_LEVEL_COMM_THRESHOLD/detail
rsync --recursive jupiter:results/mvapich2/* results/ # on local machine

Let’s plot how the means and medians behave:

read.table("results/MV2_TWO_LEVEL_COMM_THRESHOLD/detail", sep = "") %>%
  mutate(mean = V5, median = V6, id = 1:n()) %>%
  gather(measure, time, mean, median) %>%
  ggplot(aes(x = id, y = time)) +
  facet_wrap(~ measure, scales = "free_y") +
  geom_point()

A possible explanation is that the reduction code that increments this value is called twice for each global call to MPI_Reduce. Then, in the previous test, only the first 8 calls would use the first algorithm and the median may have hidden this. What is happening is actually that MPI Benchmark is using a barrier from the MPI implementation and this counts for this threshold. This also explains the two modes for the means. When the threshold is odd, the switch to another algorithm is done during the barrier, which is not measured.

This is an example of an interaction between the benchmark code and the MPI implementation. We expect this to occur as long as the benchmark code does rely on the MPI implementation. From now on, in addition to using the non-default high resolution timer, we configure MPI Benchmark to use also its own barrier implementation, which relies on MPI send and receive calls.

Let’s try now the registered memory option MV2_USE_LAZY_MEM_UNREGISTER. We leave the previous parameter unset to see if there is a difference.

variate_parameter MV2_USE_LAZY_MEM_UNREGISTER "0 1"
rsync --recursive jupiter:results/mvapich2/* results/ # on local machine

Let’s plot the medians:

mem_unregister <- read.table.parameter("MV2_USE_LAZY_MEM_UNREGISTER")
ggplot.median(mem_unregister)

The difference is only significant for sizes >= 30 kB and it is no longer significant for sizes >= 3 MB. We will keep the default.

Let’s try now the eager limit. Note that it is 12288 by default for Open MPI and a smaller value improves performance. Let’s

variate_parameter MV2_IBA_EAGER_THRESHOLD "65536" \
    "MV2_VBUF_TOTAL_SIZE=65536 MV2_TWO_LEVEL_COMM_THRESHOLD=0"
variate_parameter MV2_IBA_EAGER_THRESHOLD "131072" \
    "MV2_VBUF_TOTAL_SIZE=131072 MV2_TWO_LEVEL_COMM_THRESHOLD=0"
variate_parameter MV2_IBA_EAGER_THRESHOLD "262144" \
    "MV2_VBUF_TOTAL_SIZE=262144 MV2_TWO_LEVEL_COMM_THRESHOLD=0"
rsync --recursive jupiter:results/mvapich2/* results/ # on local machine

Let’s plot the medians:

eager_threshold <- read.table.parameter("MV2_IBA_EAGER_THRESHOLD")
ggplot.median(eager_threshold)

The effect is not clear. We will keep the default settings because it provides similar performance.

There is also a parameter for the maximum size of the message. The default value is 131072.

variate_parameter MV2_SHMEM_COLL_MAX_MSG_SIZE "65536 131072 262144" \
    "MV2_TWO_LEVEL_COMM_THRESHOLD=0"
rsync --recursive jupiter:results/mvapich2/* results/ # on local machine

Let’s plot the medians:

max_size <- read.table.parameter("MV2_SHMEM_COLL_MAX_MSG_SIZE")
ggplot.median(max_size)

Again, the effect is not sensible. We will keep the default.

Let’s see if the effect of the binding policy is the same as with Open MPI:

variate_parameter MV2_CPU_BINDING_POLICY "scatter" \
    "MV2_TWO_LEVEL_COMM_THRESHOLD=0"
variate_parameter MV2_ENABLE_AFFINITY "0" \
    "MV2_TWO_LEVEL_COMM_THRESHOLD=0"
rsync --recursive jupiter:results/mvapich2/* results/ # on local machine

Let’s aggregate the results and plot the medians:

binding <- rbind(read.table.parameter("MV2_CPU_BINDING_POLICY"),
                 read.table.parameter("MV2_ENABLE_AFFINITY") %>%
                   mutate(value = as.character(value)),
                 max_size %>% filter(value == 131072) %>%
                   mutate(value = as.character(value)))
ggplot.median(binding)

Without binding, the number of outliers is much higher. Other than that, still no significant effect, even though the default binding (“bunch” from previous experiment) may be better.

Summary of MVAPICH2 tuning

Finding relevant parameters is more difficult than with Open MPI because of the lack of documentation, but also because it is not clear which part of the code is called (the MCA architecture makes it easy to deactivate entire components). The only parameters that had an effect were:

Let’s compare the minimum medians we got for each size for the default ones. We need an execution with the default parameters since the first one where MV2_TWO_LEVEL_COMM_THRESHOLD was put to 16 was impacted by the barrier in MPI Benchmark

variate_parameter DEFAULT "\"\""
rsync --recursive jupiter:results/mvapich2/* results/ # on local machine

Let’s keep the median of the all medians over the iterations. We plot the minimum for

read.table.parameter("DEFAULT") %>%
  cbind(tuned = "no") %>%
  rbind(rbind(comm_threshold, knomial) %>%
          mutate(value = as.character(value)) %>%
          cbind(tuned = "coarse")) %>%
  group_by(tuned, msize, value, iteration) %>%
  summarise(median = median(runtime_sec)) %>%
  summarise(medmed = median(median)) %>%
  summarise(minmed = min(medmed)) %>%
  # spread(tuned, minmed) %>%
  # ggplot(aes(x = msize, y = no / coarse)) +
  ggplot(aes(x = msize, y = minmed, color = tuned, shape = tuned)) +
  geom_line() +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  annotation_logticks(sides = "l")

There is some improvements, 2x speedup for messages <= 1 kB (most probably due to the parameter MV2_TWO_LEVEL_COMM_THRESHOLD) and around 20% for the other sizes.

Comparison with Open MPI

In contract to Open MPI, selecting a specific algorithm and providing specific dynamic rules is significantly more difficult. On the other hand, it is more difficult to improve the performance when by changing the default settings than with Open MPI.

Even when Open MPI is coarsely tuned, MVAPICH2 provides better performance for small messages. We suspect it is due to the two-level approach that we did not test with Open MPI. Let’s test the hierarchical components (hierarch and ml) by using some direct tuning parameters.

# Directory containing final results
RESULT_DIR=${PWD}/results/ompi
mkdir -p ${RESULT_DIR}

# Nodes to use for XP
> ${RESULT_DIR}/hostfile
for i in $(seq 3 18) $(seq 20 35)
do
    echo jupiter$i >> ${RESULT_DIR}/hostfile
done

# Clean environment
mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean

# Set Open MPI version
VERSION=openmpi-1.10.2
cd ~ && rm bin && ln -s ompi/${VERSION}/bin bin
echo Forcing NFS synchro
for i in $(seq 0 35); do ssh jupiter$i ls -l; done > /dev/null

function launch_ompi_hierarchy {
    TIMEOUT=100
    REPETITION=30
    REPEAT=30
    SIZES=100,300,1000,3000,10000,30000,100000,300000,1000000,3000000,10000000
    
    HIERARCHY=$1
    MCA_ARG=${@:2}
    SETTINGS="--mca hwloc_base_binding_policy core \
        --mca mpi_leave_pinned 0 \
        --mca btl_openib_eager_limit 4096"
    mkdir -p ${RESULT_DIR}/${HIERARCHY}
    ompi_info -c > ${RESULT_DIR}/${HIERARCHY}/info
    mpirun --version > ${RESULT_DIR}/${HIERARCHY}/version
    echo Launch benchmark with ${HIERARCHY} tuning
    for i in $(seq 1 ${REPEAT})
    do
        echo Iteration ${i} on ${REPEAT} with ${REPETITION} measures per size
        timeout ${TIMEOUT} mpirun ${SETTINGS} ${MCA_ARG} \
            -n 512 --npernode 16 --hostfile ${RESULT_DIR}/hostfile \
            mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
            --params=parameter:hierarchy,value:${HIERARCHY},iteration:${i} \
            --msizes-list=${SIZES} -r ${REPETITION} --shuffle-jobs 2>&1 > \
                ${RESULT_DIR}/${HIERARCHY}/result_${i}
    done
}

launch_ompi_hierarchy hierarch --mca coll_hierarch_priority 90
launch_ompi_hierarchy ml --mca coll_ml_priority 90
rsync --recursive jupiter:results/ompi/ results/ompi # on local machine

Let’s read the values and plot the medians:

read.table.parameter("ompi/ml") %>%
  rbind(read.table.parameter("ompi/hierarch")) %>%
  select(msize, value, iteration, runtime_sec) %>%
  rbind(read.table.parameter("../../ompi/algorithms/results/tuned/coarse") %>%
          mutate(value = "ompi-coarse") %>%
          select(msize, value, iteration, runtime_sec)) %>%
  group_by(msize, value, iteration) %>%
  summarise(median = median(runtime_sec)) %>%
  summarise(medmed = median(median)) %>%
  rbind(rbind(comm_threshold, knomial) %>%
          group_by(msize, value, iteration) %>%
          summarise(median = median(runtime_sec)) %>%
          summarise(medmed = median(median)) %>%
          summarise(minmed = min(medmed)) %>%
          mutate(medmed = minmed, value = "best-MVAPICH2") %>%
          select(msize, value, medmed)) %>%
  group_by(msize) %>%
  mutate(normtime = medmed / min(medmed)) %>%
  gather(perf, time, medmed, normtime) %>%
  ggplot(aes(x = msize, y = time, color = value, shape = value)) +
  facet_wrap(~ perf, scales = "free_y") +
  geom_line() +
  geom_point() +
  scale_x_log10() +
  scale_y_log10(breaks = c(0.001, 0.1, 1:10)) +
  annotation_logticks(sides = "l") +
  theme(legend.position = "bottom")

We see that for messages <= 1 kB, the hierarch Open MPI component provides better performance than with coarsely tuned parameters. It even outperforms MVAPICH2 for size 100 B. Open MPI without hierarchical component is however the best for size >= 3 kB. In total, MVAPICH2 is significantly better only for 1 kB. Note that MVAPICH2 is significantly outperfomed for large messages (a factor two for size >= 100 kB). Its performance are comparable to the ml component, which offers a similar hierarchical mechanism.

Conclusion

Let’s conclude with some pros/cons for Open MPI and MVAPICH2:

Open MPI

Pros:

  • MCA architecture provides great flexibility (many available components and alternatives)
  • large FAQ

Cons:

  • outdated FAQ
    • no vader tuning (only old sm)
    • vader is mentioned to require CMA but it works even without
  • no hierarchical collective by default
  • MPI leave pinned by default (better to deactivate it)
  • socket binding by default (core is better)
  • bad algorithm selection by default

MVAPICH2

Pros:

  • provide success tuning stories (best practices)
  • mpiname -a provides the configure command

Cons:

  • old versions are unavailable
  • private developement (empty commit messages on the public svn)
  • 230 undocumented options (MV2_TWO_LEVEL_COMM_THRESHOLD for instance, which impacts the performance greatly)
  • installation does not detect CMA unavailability
  • options are not sorted alphabetically in the documentation
## R version 3.2.4 Revised (2016-03-16 r70336)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] stringr_1.0.0 ggplot2_1.0.1 dplyr_0.4.3   tidyr_0.2.0   purrr_0.2.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1      knitr_1.12.3     magrittr_1.5     MASS_7.3-44     
##  [5] munsell_0.4.2    colorspace_1.2-6 R6_2.1.0         plyr_1.8.3      
##  [9] tools_3.2.4      parallel_3.2.4   grid_3.2.4       gtable_0.1.2    
## [13] DBI_0.3.1        htmltools_0.2.6  yaml_2.1.13      assertthat_0.1  
## [17] digest_0.6.8     reshape2_1.4.1   formatR_1.2      evaluate_0.8.3  
## [21] rmarkdown_0.7    stringi_0.5-5    scales_0.2.5     proto_0.3-10