The objective is to study how the performance of MVAPICH2 varies form version to version (similarly to the previous one with Open MPI). We could find 12 tarballs in the rpm source packages available on www.rpmfind.net, www.rpmseek.com and www.filewatcher.com. They correspond to major versions from 2006 to 2016.

Installing

Let’s first install all the versions:

function install_mvapich2 {
  ARCHIVE=$1
  CONFIG_OPTIONS=$2
  mkdir -p mvapich2
  cd ~/mvapich2
  tar -xzf ${ARCHIVE}
  VERSION=${ARCHIVE%.tar.gz}
  VERSION=${VERSION%.tgz}
  mv ${VERSION} ${VERSION}-build
  cd ${VERSION}-build
  ./configure --prefix=$HOME/mvapich2/${VERSION} ${CONFIG_OPTIONS} 2>&1
  make 2>&1 # parallel build does not work for old versions
  make install
  cd ~ && rm bin ; ln -s mvapich2/${VERSION}/bin
  cd ~/mpibenchmark-0.9.4-src/
  make clean
  rm CMakeCache.txt CMakeFiles/ -r
  cmake .
  sed "s/ENABLE_RDTSCP:BOOL=OFF/ENABLE_RDTSCP:BOOL=ON/" -i CMakeCache.txt
  sed "s/ENABLE_DOUBLE_BARRIER:BOOL=OFF/ENABLE_DOUBLE_BARRIER:BOOL=ON/" -i CMakeCache.txt
  cmake .
  make 2>&1
  mv mpibenchmark ../bin/mpibenchmark-0.9.4
}

install_mvapich2 mvapich2-0.9.8.tar.gz >> ~/mvapich2_out
install_mvapich2 mvapich2-1.0.3.tar.gz >> ~/mvapich2_out
install_mvapich2 mvapich2-1.2p1.tgz >> ~/mvapich2_out
install_mvapich2 mvapich2-1.4.tgz >> ~/mvapich2_out
install_mvapich2 mvapich2-1.6.tgz --without-hwloc 2>&1 >> ~/mvapich2_out
install_mvapich2 mvapich2-1.8a2.tgz --disable-fc 2>&1 >> ~/mvapich2_out
install_mvapich2 mvapich2-1.9a.tgz 2>&1 >> ~/mvapich2_out
# CMA is disabled because it is available starting from kernel 3.2 only
# (currently 2.6.32 on Jupiter, more than 6 years old)
install_mvapich2 mvapich2-2.0.tar.gz --without-cma 2>&1 >> ~/mvapich2_out
install_mvapich2 mvapich2-2.0b.tgz --without-cma 2>&1 >> ~/mvapich2_out
install_mvapich2 mvapich2-2.1.tar.gz --without-cma 2>&1 >> ~/mvapich2_out
install_mvapich2 mvapich2-2.2a.tar.gz --without-cma 2>&1 >> ~/mvapich2_out
install_mvapich2 mvapich2-2.2b.tar.gz --without-cma 2>&1 >> ~/mvapich2_out

We commented out each call to MPI_Reduce_scatter_block and MPI_Reduce_local in MPI Benchmark because they were not supported in versions 0.9.8, 1.0.3, 1.2p1 and 1.4. Moreover, the previous installation script enable high resolution timers and the provided MPI barrier implementation to avoid interference with MPI implementation. For some MVAPICH2 versions, the build system has some issues and some components were deactivated by hand (hwloc and fortran). For the last versions, CMA was disabled. This is however an important component, especially on hierarchical machines. Hopefully, the following observations will stand nonetheless.

Edit (from June 27th, 2016): the script enables the double barrier instead of the provided barrier implementation. Hopefully, this does not invalidate this study.

Experiment preparation

Memory problems were detected/suspected on the following nodes: 4, 8, 21 and 33. Moreover, node 19 is still heterogeneous. This eliminates 5 nodes. However, we still need 32 nodes. We therefore keep the heterogeneous nodes in the following experiments.

# Directory containing final results
RESULT_DIR=${PWD}/results/mvapich2-version
mkdir -p ${RESULT_DIR}
mv ~/mvapich2_out ${RESULT_DIR}

# Nodes to use for XP
> ${RESULT_DIR}/hostfile
for i in $(seq 0 3) $(seq 5 7) $(seq 9 20) $(seq 22 32) $(seq 34 35)
do
    # Version 1.2p1 does not support the shorcut version "host:16"
    for j in $(seq 1 16)
    do
        echo jupiter$i >> ${RESULT_DIR}/hostfile
    done
done

# Check nobody is on the cluster
for i in $(seq 0 35); do echo -n $i; ssh jupiter$i w; done | grep -v USER

Launching script

Now everything is ready for the measures. We are interested in the following observations: how is the performance of the MVAPICH2 implementations with default settings impacted by both the versions and the message size in terms of variability (noise, stability) and central tendency?

We will measure the time to perform a reduction on 32 16-cores nodes with varying message sizes and implementations versions. We selected 6 sizes from 30 B (performance starts to change around 100 B for small messages) to 3 MB (performance is proportional to the message sizes for messages larger than 1 MB). We repeat the run 30 times and capture 100 measurements each time. Sizes are shuffled.

# Launch XP
TIMEOUT=100
REPETITION=100
REPEAT=30
SIZES=30,300,3000,30000,300000,3000000

ARCHIVES=$(cd ~/mvapich2 && ls *.t*gz)
for ARCHIVE in ${ARCHIVES}
do
    # Set MVAPICH2 version
    VERSION=${ARCHIVE%.tar.gz}
    VERSION=${VERSION%.tgz}
    # Old versions require MPD deamon (I did not spend time to make it work)
    if [ "${VERSION}" = "mvapich2-0.9.8" ] || [ "${VERSION}" = "mvapich2-1.0.3" ]
    then
        continue
    fi
    cd ~ && rm bin && ln -s mvapich2/${VERSION}/bin
    echo Forcing NFS synchro for version ${VERSION}
    for i in $(seq 0 35); do ssh jupiter$i ls -l; done > /dev/null

    # Ready to start the benchmarks
    mkdir -p ${RESULT_DIR}/${VERSION}
    mpiname -a -c 2>&1 > ${RESULT_DIR}/${VERSION}/name
    mpich2version 2>&1 > ${RESULT_DIR}/${VERSION}/version
    for i in $(seq 1 ${REPEAT})
    do
        echo Iteration ${i} on ${REPEAT} with ${REPETITION} measures per size
        # Version 1.4 requires the absolute path for the command
        timeout ${TIMEOUT} mpirun_rsh -hostfile ${RESULT_DIR}/hostfile -n 512 \
            ${PWD}/bin/mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
            --params=version:${VERSION},iteration:${i} \
            --msizes-list=${SIZES} -r ${REPETITION} --shuffle-jobs 2>&1 > \
                ${RESULT_DIR}/${VERSION}/result_${i}
    done
done

We can get the results with:

rsync --recursive jupiter_venus:results/mvapich2-version/* results/

Data processing and plotting scripts

Let’s read the data:

read.table.versions <- function(versions) {
  read.table.version <- function(version) {
    dirname <- paste("results/mvapich2-", version, sep = "")
    files <- list.files(dirname, pattern = "result_*")
    read.table.file <- function(filename) {
      con <- file(filename, open = "r")
      info <- readLines(con) %>%
        map(~str_match(., "#@(.*)=(.*)")[2:3]) %>%
        discard(~any(is.na(.)))
      close(con)
      data <- read.table(filename, header = TRUE)
      for (i in info)
        data[i[1]] <- type.convert(i[2])
      data
    }
    map_df(paste(dirname, files, sep = "/"), read.table.file)
  }
  map_df(versions, read.table.version)
}
versions <- c("1.2p1", "1.4", "1.6", "1.8a2", "1.9a",
              "2.0", "2.0b", "2.1", "2.2a", "2.2b")
perf.versions <- read.table.versions(versions)
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
#perf.versions$version <- factor(perf.versions$version, levels = versions)

Let’s plot the data. First, we want to observe the variability and the global effect of the size. This may also provide a preliminary comparison between versions.

perf.versions %>%
  filter(msize %in% c(300, 30e3, 3e6)) %>%
  ggplot(aes(x = factor(iteration), y = runtime_sec)) +
  geom_boxplot(outlier.size = 1) +
  facet_grid(msize ~ version, scales = "free_y") +
  scale_y_log10() +
  annotation_logticks(sides = "l")

We can do the following observations:

Let’s focus on the median of each run:

perf.versions %>%
  group_by(iteration, version, msize) %>%
  summarise(median = median(runtime_sec)) %>%
  ggplot(aes(x = factor(msize), y = median, color = version)) +
  geom_boxplot(outlier.size = 1) +
  scale_y_log10() +
  annotation_logticks(sides = "l") +
  theme(legend.position = "bottom") +
  guides(color = guide_legend(nrow = 2, byrow = TRUE))

This confirms that the last versions (2.0 and above) have regression for large message sizes (>= 30 kB). However, they are better for 3 kB. For small message, there have been steady improvement until version 2.0b and then a slight degradation.

The difference in the performance with the default settings can be up to a factor 3 (or 4) for 300 kB, and a factor 2 is not infrequent.

In conclusion, version 1.6 should probably be avoided (large variability and pathological performance for 3 kB). Also, the overall best version is probably 1.9a, which is only marginally suboptimal for 3 kB and has low variability.

Conclusion

Similarly to Open MPI (see the study from March 15th), the choice of the version for MVAPICH2 may have significant impact on the performance when the default settings are used (a factor two being usual). Therefore, any moderate speedup of a new method over an arbitrary version of MVAPICH2 may be related to an issue in the selected version and not to the superiority of the proposed approach.

## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_0.9.3.1 dplyr_0.4.3     tidyr_0.2.0     purrr_0.2.0    
## [5] stringr_1.0.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1        knitr_1.10.5       magrittr_1.5      
##  [4] MASS_7.3-44        munsell_0.4        colorspace_1.2-2  
##  [7] R6_2.1.2           plyr_1.8.1         tools_3.3.0       
## [10] dichromat_2.0-0    parallel_3.3.0     grid_3.3.0        
## [13] gtable_0.1.2       pacman_0.4.1       DBI_0.3.1         
## [16] htmltools_0.2.6    yaml_2.1.13        assertthat_0.1    
## [19] digest_0.6.9       RColorBrewer_1.0-5 reshape2_1.2.2    
## [22] formatR_0.10       evaluate_0.7       rmarkdown_0.7     
## [25] labeling_0.1       stringi_0.5-5      scales_0.2.3      
## [28] proto_0.3-10