Reproducing ma2012a

The objective consists in trying to reproduce the tendencies obtained in Figure 5 of “HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters” from 2012.

Source code build

Let’s retrieve the archive (locally):

hg clone https://bitbucket.org/tengma/hierknem
tar -czf hierknem.tar.gz hierknem
ssh jupiter_venus mkdir -p hierknem
scp hierknem.tar.gz jupiter_venus:hierknem/
ssh jupiter_venus "cd hierknem && tar -xzf hierknem/hierknem.tar.gz"

Let’s build the source (on Jupiter). We first need autoconf:

wget http://ftp.gnu.org/gnu/autoconf/autoconf-2.65.tar.gz
scp autoconf-2.65.tar.gz jupiter_venus:hierknem/
# on Jupiter
cd ~/hierknem
tar -xzf autoconf-2.65.tar.gz
cd autoconf-2.65
./configure --prefix=$HOME/hierknem/ && make all && make install

Then, we need libtool:

wget http://fr.mirror.babylon.network/gnu/libtool/libtool-2.2.6b.tar.gz
scp libtool-2.2.6b.tar.gz jupiter_venus:hierknem/
# on Jupiter
cd ~/hierknem
tar -xzf libtool-2.2.6b.tar.gz
cd libtool-2.2.6b
./configure --prefix=$HOME/hierknem/ && make all && make install

We also need patch:

wget ftp://ftp.gnu.org/gnu/patch/patch-2.7.5.tar.gz
scp patch-2.7.5.tar.gz jupiter_venus:hierknem/
# on Jupiter
cd ~/hierknem
tar -xzf patch-2.7.5.tar.gz
cd patch-2.7.5
./configure --prefix=$HOME/hierknem/ && make all && make install

And also Bison and then Flex:

wget http://ftp.gnu.org/gnu/bison/bison-3.0.tar.gz
scp bison-3.0.tar.gz jupiter_venus:hierknem/
# on Jupiter
cd ~/hierknem
tar -xzf bison-3.0.tar.gz
cd bison-3.0
./configure --prefix=$HOME/hierknem/ && make all && make install

wget http://downloads.sourceforge.net/project/flex/flex-2.6.0.tar.gz
scp flex-2.6.0.tar.gz jupiter_venus:hierknem/
# on Jupiter
cd ~/hierknem
tar -xzf flex-2.6.0.tar.gz
cd flex-2.6.0
./configure --prefix=$HOME/hierknem/ && make all && make install

We need some files from the official Open MPI source:

cd ~/hierknem/hierknem
cp ../openmpi-1.5.4-src/ompi/mca/io/romio/romio/test/Makefile.in \
  ./ompi/mca/io/romio/romio/test/
cp -r ../openmpi-1.5.4-src/ompi/contrib/vt/vt/tools/opari/lib/ 
  ./ompi/contrib/vt/vt/tools/opari/

Now for HierKNEM:

PATH=$HOME/hierknem/bin:$PATH
./autogen.sh
./configure --prefix=$HOME/hierknem/ && make all && make install
cd ~ && rm bin && ln -s hierknem/bin bin
# Build the benchmark
cd ~/mpibenchmark-0.9.4-src/
make clean
rm -r CMakeCache.txt CMakeFiles/
cmake .
sed "s/ENABLE_RDTSCP:BOOL=OFF/ENABLE_RDTSCP:BOOL=ON/" -i CMakeCache.txt
sed "s/ENABLE_BENCHMARK_BARRIER:BOOL=OFF/ENABLE_BENCHMARK_BARRIER:BOOL=ON/" -i CMakeCache.txt
cmake .
make 2>&1
mv mpibenchmark ../bin/mpibenchmark-0.9.4

Environment check

KNEM is not installed. However, the article states that what is important is that all approaches rely on the same shared-memory mechanism (so that “any performance difference roots solely in the proposed collective operation innovations”).

Let’s check how to use the three Open MPI mechanisms: HierKNEM, hier-OMPI and the tuned component. Looking at the log of the mercurial repository, it seems that the HierKNEM code was introduced in the hierarch mca module. Let’s try to use it with a simple benchmark and compare the performance with the default hierarch module. Strangely, the source is based on version 1.5.4 whereas the article states that it is based on version 1.5.3. We did not find significant difference between both versions, so we will use the last one for convenience.

# Directory containing final results
RESULT_DIR=${PWD}/results/ma2012a
mkdir -p ${RESULT_DIR}

# Nodes to use for XP
> ${RESULT_DIR}/hostfile
for i in $(seq 0 3) $(seq 5 7) $(seq 9 20) $(seq 22 32) $(seq 34 35)
do
    echo jupiter$i >> ${RESULT_DIR}/hostfile
done

function launch_ompi_hierarchy {
  VERSION=$1

  # Set Open MPI version
  cd ~ && rm bin && ln -s ${VERSION}/bin bin
  echo Forcing NFS synchro
  for i in $(seq 0 35); do ssh jupiter$i ls -l; done > /dev/null

  # Required for Open MPI 1.5.4
  export LD_LIBRARY_PATH=$PWD/bin/../lib

  # Clean environment
  mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean

  # Launch
  mpirun --mca coll_hierarch_priority 90 -n 512 --npernode 16 \
      --hostfile ${RESULT_DIR}/hostfile -x LD_LIBRARY_PATH \
    mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
      --msizes-list=3000000,30000,300 -r 100 --summary
}

launch_ompi_hierarchy hierknem
launch_ompi_hierarchy ompi/openmpi-1.5.4

Size are in decreasing order to let the system warm itself on large messages for which there should be no impact. We obtain:

      test      msize total_nrep valid_nrep       mean_sec     median_sec        min_sec        max_sec
MPI_Reduce    3000000        100        100   0.1445856543   0.1171314470   0.1133404924   1.8699647207
MPI_Reduce      30000        100        100   0.0022100767   0.0022005541   0.0021528071   0.0029199384
MPI_Reduce        300        100        100   0.0001036845   0.0001023058   0.0000845483   0.0002016388

And:

      test      msize total_nrep valid_nrep       mean_sec     median_sec        min_sec        max_sec
MPI_Reduce    3000000        100        100   0.1791463049   0.1781879868   0.1355293011   0.1995848902
MPI_Reduce      30000        100        100   0.0007292582   0.0007208847   0.0006974721   0.0011461144
MPI_Reduce        300        100        100   0.0000855087   0.0000851271   0.0000612736   0.0001231739

We have distinct performance for both versions. As a remainder, the previous XP showed that the best times were around 40 µs for 300 B, 0.4 ms for 30 kB and 20 ms for 3 MB. Even the best among these measures are still far from these best times. We will use some basic settings for improving the performance.

The article mentions that “Both HierKNEM’s Broadcast and Reduce algorithms use the pipeline size in Table I in the following tests.” However, there is no need to specify the pipeline size because it is hard-coded.

Experiment preparation

Let’s enforce the per-core process binding because Section IV.A suggests this is the case. The sizes go from 2 kB to 16 MB (we suspect this is actually from 2 kiB to 16 MiB).The number of iterations defined by IMB is size dependent. We will use the same as when repeating venkata2013a REPETITION=100 and REPEAT=30. IMB uses a different rank at each iteration in the latest version (4.1). Since there are 512 processes, we will increase the number of the root process by 17 at each step (30*17=510).

RESULT_DIR=${PWD}/results/ma2012a
TIMEOUT=500
REPETITION=30
REPEAT=30
SIZES=2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216
SETTINGS="--mca hwloc_base_binding_policy core \
          --mca mpi_leave_pinned 0"

function launch_hierknem_ompi {
  VERSION=$1
  PRIORITY=$2

  # Set Open MPI version
  if [ "${VERSION}" = "openmpi-1.5.4" ] || [ "${VERSION}" = "openmpi-1.10.2" ]
  then
    PREFIX=ompi/
  fi
  cd ~ && rm bin && ln -s ${PREFIX}${VERSION}/bin bin
  echo Forcing NFS synchro for ${VERSION}
  for i in $(seq 0 35); do ssh jupiter$i ls -l; done > /dev/null

  # Required for Open MPI 1.5.4
  export LD_LIBRARY_PATH=$PWD/${PREFIX}${VERSION}/lib

  # Clean environment
  mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean

  # Ready to start the benchmarks
  mkdir -p ${RESULT_DIR}/${VERSION}_${PRIORITY}
  ompi_info -c > ${RESULT_DIR}/${VERSION}_${PRIORITY}/info
  mpirun --version &> ${RESULT_DIR}/${VERSION}_${PRIORITY}/version
  ROOT=0
  for i in $(seq 1 ${REPEAT})
  do
    echo Iteration ${i} on ${REPEAT} with ${REPETITION} measures per size
    timeout ${TIMEOUT} mpirun ${SETTINGS} -x LD_LIBRARY_PATH \
        --mca coll_hierarch_priority ${PRIORITY} -n 512 --npernode 16 \
        --hostfile ${RESULT_DIR}/hostfile \
      mpibenchmark-0.9.4 --calls-list=MPI_Reduce --msizes-list=${SIZES} \
        --params=version:${VERSION}_${PRIORITY},iteration:${i} \
        -r ${REPETITION} --shuffle-jobs --root-proc=${ROOT} &> \
                ${RESULT_DIR}/${VERSION}_${PRIORITY}/result_${i}
    ROOT=$((ROOT+17))
  done
}

launch_hierknem_ompi hierknem 90 # hierknem (proposed mechanism)
launch_hierknem_ompi hierknem 0 # tuned at the time
launch_hierknem_ompi openmpi-1.5.4 90 # hierarch at the time
launch_hierknem_ompi openmpi-1.5.4 0 # tuned at the time
launch_hierknem_ompi openmpi-1.10.2 90 # hierarch
launch_hierknem_ompi openmpi-1.10.2 0 # tuned

We completed the measures with a version that is known to have few performance issues.

MVAPICH2 version 1.7 is no longer available. We will thus use the closest version that has the same features (more recent thus) and that has no performance problem: 1.8a2.

# Nodes to use for XP
> ${RESULT_DIR}/hostfile_mvapich2
for i in $(seq 0 3) $(seq 5 7) $(seq 9 20) $(seq 22 32) $(seq 34 35)
do
    echo jupiter$i:16 >> ${RESULT_DIR}/hostfile_mvapich2
done

cd ~ && rm bin && ln -s mvapich2/mvapich2-1.8a2/bin
echo Forcing NFS synchro for mvapich2-1.8a2
for i in $(seq 0 35); do ssh jupiter$i ls -l; done > /dev/null

mkdir -p ${RESULT_DIR}/mvapich2-1.8a2
mpiname -a -c 2>&1 > ${RESULT_DIR}/mvapich2-1.8a2/name
mpich2version 2>&1 > ${RESULT_DIR}/mvapich2-1.8a2/version
ROOT=0
for i in $(seq 1 ${REPEAT})
do
  echo Iteration ${i} on ${REPEAT} with ${REPETITION} measures per size
  timeout ${TIMEOUT} mpirun_rsh -hostfile ${RESULT_DIR}/hostfile_mvapich2 -n 512 \
    mpibenchmark-0.9.4 --calls-list=MPI_Reduce --msizes-list=${SIZES} \
      --params=version:mvapich2-1.8a2,iteration:${i} \
      -r ${REPETITION} --shuffle-jobs --root-proc=${ROOT} 2>&1 > \
              ${RESULT_DIR}/mvapich2-1.8a2/result_${i}
  ROOT=$((ROOT+17))
done

We can get the results with:

rsync --recursive jupiter_venus:results/ma2012a/* results/

Analysis

Let’s read the data:

read.table.hierknem <- function(versions) {
  read.table.hier <- function(version) {
    dirname <- paste("results/", version, sep = "")
    files <- list.files(dirname, pattern = "result_*")
    read.table.file <- function(filename) {
      con <- file(filename, open = "r")
      info <- readLines(con) %>%
        map(~str_match(., "#@(.*)=(.*)")[2:3]) %>%
        discard(~any(is.na(.)))
      close(con)
      data <- read.table(filename, header = TRUE)
      for (i in info)
        data[i[1]] <- type.convert(i[2])
      data
    }
    map_df(paste(dirname, files, sep = "/"), read.table.file)
  }
  map_df(versions, read.table.hier)
}
hier <- c("hierknem_90", "openmpi-1.5.4_90", "openmpi-1.10.2_90",
          "hierknem_0", "openmpi-1.5.4_0", "openmpi-1.10.2_0", "mvapich2-1.8a2")
perf.hier <- read.table.hierknem(hier)

## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character

perf.hier$version <- factor(perf.hier$version, levels = hier)

Let’s plot an equivalent to Figure 5. We suspect that the measure (the aggregate reduce bandwidth) is the total amount of transmitted data divided by the time: msize*(512-1)/runtime_sec.

perf.hier %>%
  mutate(measure = msize*(512-1)/runtime_sec) %>%
  group_by(version, msize, iteration) %>%
  summarise(med_meas = median(measure)) %>%
  summarise(med_med = median(med_meas)) %>%
  ggplot(aes(x = factor(msize), y = med_med, fill = version)) +
  geom_bar(position = "dodge")

## Mapping a variable to y and also using stat="bin".
##   With stat="bin", it will attempt to set the y value to the count of cases in each group.
##   This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
##   If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
##   If you want y to represent values in the data, use stat="identity".
##   See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)

We observe that the order of magnitude is similar to Figure 5 (around 100 GB/s). However, there are several issues.

First, hierknem_90 is supposed to be HierKNEM, whereas hierknem_0 is supposed to be Tuned-OMPI. However, there have the same performance. Using --mca coll_base_verbose 99 reveals that the tuned component is used in both cases. The configuration file config.log reveals that the hierarch component cannot be build and this is due to the absence of KNEM (from configure). We tried to install KNEM with the following:

wget http://gforge.inria.fr/frs/download.php/28824/knem-0.9.7.tar.gz
scp knem-0.9.7.tar.gz jupiter_venus:hierknem/
# on Jupiter
cd ~/hierknem
tar -xzf knem-0.9.7.tar.gz
cd knem-0.9.7
./configure --prefix=$HOME/hierknem/ && make all && make install

But the configuration fails because the kernel headers are not present. Making this works is left to future work.

Also, Open MPI 1.5.4 has unstable and poor performance compared to version 1.10.2. The slowdown with hierarch at 512 kiB (openmpi-1.5.4_90) disappears in the newest version and the tuned component performs better as well. This may have impacted the XP in the HierKNEM article.

An a non-issue, we remark that the performance with Open MPI 1.10.2 is consistent with our previous studies even though the root changes. Therefore, it seems that the tuned mechanism is not strongly sensitive to the root as it was mentioned in the literature (“A novel MPI reduction algorithm resilient to imbalances in process arrival times”).

Overall, MVAPICH2 is the best for medium sizes (16 kiB to 512 kiB) and it outperforms Open MPI 1.10.2. It was expected because the default algorithm choice is sub-optimal for Open MPI. Also, MVAPICH2 performs worse for large sizes, which is consistent with the previous measures.

Conclusion

On overall, the software used were distinct (not the same versions or methods) and the platform was also different. Thus the discrepancies between our results and Figure 5 may be related to this.

Even though we fail to execute the correct algorithm, the reproduction does not seem promising for now. First, the Open MPI baseline is an old version with problematic performance and poor default algorithm choice. A more stable release (like version 1.10.2) with tuning algorithm selection would certainly serve as a more solid baseline. Second, executing the code seems difficult due to some files that needed to be manually edited and many dependencies (for autogen.sh and KNEM).

The next step is to try to install the required dependencies to complete this study.

## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_0.9.3.1 dplyr_0.4.3     tidyr_0.2.0     purrr_0.2.0    
## [5] stringr_1.0.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1        knitr_1.10.5       magrittr_1.5      
##  [4] MASS_7.3-44        munsell_0.4        colorspace_1.2-2  
##  [7] R6_2.1.2           plyr_1.8.1         tools_3.3.0       
## [10] dichromat_2.0-0    parallel_3.3.0     grid_3.3.0        
## [13] gtable_0.1.2       pacman_0.4.1       DBI_0.3.1         
## [16] htmltools_0.2.6    lazyeval_0.1.10    yaml_2.1.13       
## [19] assertthat_0.1     digest_0.6.9       RColorBrewer_1.0-5
## [22] reshape2_1.2.2     formatR_0.10       codetools_0.2-14  
## [25] evaluate_0.7       rmarkdown_0.7      labeling_0.1      
## [28] stringi_0.5-5      scales_0.2.3       proto_0.3-10