Reproducing venkata2013a

The objective is to reproduce the experiments of the paper “Optimizing Blocking and Nonblocking Reduction Operations for Multicore Systems: Hierarchical Design and Implementation” from Venkata et al. published at Cluster in 2013. In particular, we focus on the proposed hierarchical implementation of MPI_Reduce and MPI_Allreduce.

The paper introduces a hierarchical mechanism that was implemented in Open MPI. It consists of two frameworks (bcol and sbgp) and the component ml. This is the most modern hierarchical mechanism present in Open MPI.

Note that there is a bug in the documentation in file ompi/mca/bcol/base/bcol_base_frame.c on Line 190 (basemuma appears twice).

Analysis of experiments to reproduce

The article contains 13 figures that present empirical data (Figures 6 to 18):

Figures 6 and 7 measure MPI_Allreduce
Figures 10 and 11 measure MPI_Reduce
Figures 17 and 18 compare the proposed mechanism with itself by varying a parameter
Figure 14 requires 6144 processes, which is beyond our available platforms
Figure 16 benchmarks an application
the other figures concern non-blocking operations

We will therefore consider Figures 6, 7, 10 and 11, which are the closest to what we are looking for. Figure 17 and 18 are left for future work.

Settings

Even though it is claimed that the default settings are used with Open MPI, we suspect that it may concern only the choice of the default algorithm. Therefore, we will try with both the default settings and slightly improved ones (binding of the process and MPI_leave_pinned).

The paper does not provide the version of Open MPI that was used in the experiments. According to the git history, ml was present in version 1.7 (commit from Aug 16 19:11:35 2012). However, the optimization presented in this paper regarding MPI_Reduce must have been introduced in version 1.7.5 (commit from Jan 22 15:39:19 2014). Additionally, lots of cleanup were done in version 1.8.2. The component should not be part of upcoming version 2 https://www.open-mpi.org/community/lists/devel/2015/06/17564.php. We will therefore try with versions 1.7, 1.7.5, 1.8.2 and 1.10.2.

In the figures we want to reproduce, the number of processes is varied from 16 to 512. We suspect that the initial 16 processes are all on the same node and that the number of used nodes is multiplied by two until 32. The size is fixed to some value (8 B, 128 B or 1 MB). We suspect that the size used for Figure 17 is 1 MB given the time taken and by comparing it to the scale of Figure 7.

Platform

We use exclusively jupiter, a 16-cores 32-nodes environment with Infiniband interconnect, which is similar to Smoky, the tested platform. We could extend this study to a Grid5000 cluster to check the stability of the results.

Experiment preparation

Let’s set up the environment:

# Directory containing final results
RESULT_DIR=${PWD}/results/venkata2013a
mkdir -p ${RESULT_DIR}

# Nodes to use for XP
> ${RESULT_DIR}/hostfile
# jupiter19 is heterogneous and thus excluded
# jupiter4 and jupiter33 may have memory problem and are thus excluded
#for i in $(seq 3 18) $(seq 20 35)
for i in $(seq 1 3) $(seq 5 18) $(seq 20 32) $(seq 34 35)
do
    echo jupiter$i >> ${RESULT_DIR}/hostfile
done

# Check nobody is on the cluster
for i in $(seq 0 35); do echo -n $i; ssh jupiter$i w; done | grep -v USER

# Clean environment
mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean

Let’s see the script to launch one benchmark. For each run, we must specify the operation and the size. We perform 30 runs of 100 measures instead of the 10000 used in the article and we select the median of each run instead of the total completion time. The version 1.7 requires a special handling for the library path and the ml parameters. The MPI Benchmark code was used with high resolution timer and the provided barrier implementation to avoid interference with Open MPI.

function variate_parameter {
  OPERATION=$1
  SIZE=$2
  VERSIONS=$3
  
  SETTINGS="none basic"
  
  TIMEOUT=100
  REPETITION=100
  REPEAT=30
  NB_PROCESS="16 32 64 128 256 512"
  RESULT_DIR=${PWD}/results/venkata2013a/
  FINAL_DIR=${RESULT_DIR}/${OPERATION}/${SIZE}
  mkdir -p ${FINAL_DIR}
  
  for VERSION_NB in ${VERSIONS}
  do
    # Set Open MPI version
    VERSION=openmpi-${VERSION_NB}
    cd ~ && rm bin && ln -s ompi/${VERSION}/bin bin
    if [ "${VERSION_NB}" = "1.7" ]
    then
      # Required for versions 1.4 to 1.7.3
      export LD_LIBRARY_PATH=$PWD/ompi/${VERSION}/lib
      MCA="-x LD_LIBRARY_PATH --mca coll_ml_enable_fragmentation 1"
    else
      MCA=""
    fi
    # Forcing NFS synchronization
    echo Forcing NFS synchro for Open MPI ${VERSION_NB}
    for i in $(seq 0 35); do ssh jupiter$i ls -l; done > /dev/null
    
    ompi_info -c > ${FINAL_DIR}/info_${VERSION_NB}
    mpirun --version > ${FINAL_DIR}/version_${VERSION_NB}
    # Perform measures
    for SETTING in ${SETTINGS}
    do
      if [ "${SETTING}" = "basic" ]
      then
        MCA="${MCA} --mca hwloc_base_binding_policy core"
        MCA="${MCA} --mca mpi_leave_pinned 0"
      fi
      for PROCESS in ${NB_PROCESS}
      do
        for ALGO in 0 90
        do
          echo -n "Version ${VERSION}, setting ${SETTING}, algorithm ${ALGO}"
          echo " and ${PROCESS} processes"
          for REP in $(seq 1 ${REPEAT})
          do
            echo "  Iteration ${REP} on ${REPEAT}"
            PARAMS="version:${VERSION_NB},setting:${SETTING}"
            PARAMS="${PARAMS},algorithm:${ALGO},iteration:${REP}"
            timeout ${TIMEOUT} mpirun ${MCA} \
              --mca bcol basesmuma,ptpcoll --mca coll_ml_priority ${ALGO} \
              -n ${PROCESS} --npernode 16 --hostfile ${RESULT_DIR}/hostfile \
              mpibenchmark-0.9.4 --calls-list=${OPERATION} --params=${PARAMS} \
              --msizes-list=${SIZE} -r ${REPETITION} --shuffle-jobs 2>&1 > \
                  ${FINAL_DIR}/result_${VERSION_NB}_${SETTING}_${PROCESS}_${ALGO}_${REP}
          done
        done
      done
    done
  done
}

This can be called as follow (for Figure 10):

variate_parameter MPI_Reduce 128 "1.7 1.7.5 1.8.2 1.10.2"
# On local machine
rsync --recursive jupiter_venus:results/venkata2013a/* results/
sed -i 1,15d results/*/*/*_1.7.5_* # remove deprecated warnings
sed -i "s/=1.7/=1.7.0/" results/*/*/*_1.7_* # simplify parsing

Let’s do the reading script:

read.table.parameter <- function(operation, size) {
  dirname <- paste("results", operation, format(size, scientific = FALSE), sep = "/")
  files <- list.files(dirname, pattern = "result_*")
  read.table.file <- function(filename) {
    con <- file(filename, open = "r")
    info <- readLines(con) %>%
      map(~str_match(., "#@(.*)=(.*)")[2:3]) %>%
      discard(~any(is.na(.)))
    close(con)
    data <- read.table(filename, header = TRUE)
    for (i in info)
      data[i[1]] <- type.convert(i[2])
    data
  }
  map_df(paste(dirname, files, sep = "/"), read.table.file)
}

Reproducing Figure 10

Let’s analyze the data corresponding to Figure 10 (MPI_Reduce, 128 B and a varying number of processes). We tried several versions of Open MPI with different settings.

Let’s get the data from Figure 11. They were measured with a screen ruler (5% to 10% precision):

figure10 <- data.frame(iteration = 1, nprocs = 16*2 ^ (0:5),
                       fig10_0 = c(5.9e-6, 7e-6, 8.1e-6, 8.5e-6, 32e-6, 96e-6),
                       fig10_90 = c(1.7e-6, 2.4e-6, 5e-6, 4.8e-6, 10.4e-6, 8.7e-6)) %>%
  gather(algorithm, medtime, fig10_0, fig10_90) %>%
  merge(data.frame(version = c("1.7.0", "1.7.5", "1.8.2", "1.10.2"))) %>%
  merge(data.frame(setting = c("none", "basic")))

Let’s plot the median time with the default settings.

reduce128 <- read.table.parameter("MPI_Reduce", 128)

## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character

## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character

## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character

reduce128$setting <- factor(reduce128$setting, levels = c("none", "basic"))
reduce128$version <- factor(reduce128$version,
                            levels = c("1.7.0", "1.7.5", "1.8.2", "1.10.2"))
reduce128 %>%
  group_by(version, setting, algorithm, iteration, nprocs) %>%
  summarise(medtime = median(runtime_sec)) %>%
  select(nprocs, algorithm, medtime, version, setting) %>%
  rbind(figure10) %>%
  ggplot(aes(x = factor(nprocs), y = medtime, color = factor(algorithm))) +
  facet_grid(version ~ setting) +
  geom_boxplot() +
  stat_summary(fun.y = median, geom = "line", aes(group = algorithm), size = 1.5) +
  coord_cartesian(ylim = extendrange(c(0, 150e-6)))

Algorithm 0 is the default one. Algorithm 90 is ml.

The difference between the default algorithm and ml is around 20 % (between 0 and 50 %). This is nowhere near Figure 10, in which the difference may be up to a factor 10.

Moreover, the time to reduce 128 B on 512 processes is between 30 and 100 µs for ml and between 50 and 100 µs for the default (when excluding the default settings with version 1.7.0, which is pathological). In Figure 10, it is around 10 µs for ml and around 100 µs for the default setting. We can conclude that the performance for the default setting is thus consistent between our experiments and Figure 10, but it is not with ml. Moreover, we suspect than reducing 128 B on 512 nodes in 10 µs may be quite challenging (with a latency of 1.3 µs on the interconnect and 32 nodes, it should take at least \(log_2(32)\times 1.3=6.5\) µs).

Preliminary measurements showed that the difference between the default Open MPI 1.10.2 and the ml component was more significant. But this was with 8 B, which may explain why it was different.

Figure 6

Let’s analyze a new experiment:

variate_parameter MPI_Allreduce 8 "1.7 1.7.5 1.8.2 1.10.2"
# On local machine
rsync --recursive jupiter_venus:results/venkata2013a/* results/
sed -i 1,15d results/*/*/*_1.7.5_* # remove deprecated warnings
sed -i "s/=1.7/=1.7.0/" results/*/*/*_1.7_* # simplify parsing

Due to interference, experiments were relaunched for less than the last half of the tests. In particular, jupiter4 and jupiter33 were excluded due to possible memory problems.

Let’s get the results from Figure 6:

figure6 <- data.frame(iteration = 1, nprocs = 16*2 ^ (0:5),
                       fig6_0 = c(13.8e-6, 18.5e-6, 23.1e-6, 27.7e-6, 60e-6, 106e-6),
                       fig6_90 = c(18.5e-6, 304e-6, 807e-6, 1328e-6, 1794e-6, 2375e-6)) %>%
  gather(algorithm, medtime, fig6_0, fig6_90) %>%
  merge(data.frame(version = c("1.7.0", "1.7.5", "1.8.2", "1.10.2"))) %>%
  merge(data.frame(setting = c("none", "basic")))

Let’s plot the results:

allreduce8 <- read.table.parameter("MPI_Allreduce", 8)

## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character

## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character

## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character

allreduce8$setting <- factor(allreduce8$setting, levels = c("none", "basic"))
allreduce8$version <- factor(allreduce8$version,
                             levels = c("1.7.0", "1.7.5", "1.8.2", "1.10.2"))
allreduce8 %>%
  group_by(version, setting, algorithm, iteration, nprocs) %>%
  summarise(medtime = median(runtime_sec)) %>%
  select(nprocs, algorithm, medtime, version, setting) %>%
  rbind(figure6) %>%
  ggplot(aes(x = factor(nprocs), y = medtime, color = factor(algorithm))) +
  facet_grid(version ~ setting) +
  geom_boxplot() +
  stat_summary(fun.y = median, geom = "line", aes(group = algorithm), size = 1.5) +
  coord_cartesian(ylim = extendrange(c(0, 150e-6)))

Again, the difference is nowhere near the one in Figure 6. With 512 processes, they obtained around 100 µs, which was 23x faster than without ml (around 2.4 ms, which is pathological). In our case, ml takes between 30 µs and 125 µs, while the default run takes between 90 µs and 125 µs. Therefore, we are unable to reproduce the pathological performance of the default in Figure 6.

Figure 6 and 10 suggest that Allreduce is 10x longer than Reduce for 512 processes. Our experiments suggest it is rather a factor 2. Again, we fail to reproduce the tendency of the considered article.

The results are consistent with preliminary measurements with version 1.10.2. However, with version 1.7.5, the difference between both ml and the default performance was a bit more significant.

Again, we observe the same performance boost in version 1.10.2, but this time, it is more significant with ml than with the default setting.

Until now, we did not see a significant difference depending on the global options we used for both settings (none or basic). For small messages, it is not surprising. We expect a stronger difference with large messages. Also, version 1.7.0 has pathological performance and does not provide more insight than version 1.7.5. We thus remove it from future experiments.

Figure 11

Even before starting, the figure seems to provide problematic measurements: it is reasonable to suspect that the huge jump in the curve for the reference execution is related to some bug, configuration error, artifact… Let’s check this:

variate_parameter MPI_Reduce 1000000 "1.7.5 1.8.2 1.10.2"
# On local machine
rsync --recursive jupiter_venus:results/venkata2013a/* results/
sed -i 1,15d results/*/*/*_1.7.5_* # remove deprecated warnings
sed -i "s/=1.7/=1.7.0/" results/*/*/*_1.7_* # simplify parsing

Let’s get the results from Figure 11:

figure11 <- data.frame(iteration = 1, nprocs = 16*2 ^ (0:5),
                       fig11_0 = c(6.7e-3, 7.5e-3, 177e-3, 176e-3, 186e-3, 172e-3),
                       fig11_90 = c(6.3e-3, 7.5e-3, 9.2e-3, 10.5e-3, 10.9e-3, 12.2e-3)) %>%
  gather(algorithm, medtime, fig11_0, fig11_90) %>%
  merge(data.frame(version = c("1.7.5", "1.8.2", "1.10.2"))) %>%
  merge(data.frame(setting = c("none", "basic")))

Let’s plot the results:

reduce1M <- read.table.parameter("MPI_Reduce", 1000000)

## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character

## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character

## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character

reduce1M$setting <- factor(reduce1M$setting, levels = c("none", "basic"))
reduce1M$version <- factor(reduce1M$version,
                           levels = c("1.7.5", "1.8.2", "1.10.2"))
reduce1M %>%
  group_by(version, setting, algorithm, iteration, nprocs) %>%
  summarise(medtime = median(runtime_sec)) %>%
  select(nprocs, algorithm, medtime, version, setting) %>%
  rbind(figure11) %>%
  ggplot(aes(x = factor(nprocs), y = medtime, color = factor(algorithm))) +
  facet_grid(version ~ setting) +
  geom_boxplot() +
  stat_summary(fun.y = median, geom = "line", aes(group = algorithm), size = 1.5) +
  coord_cartesian(ylim = extendrange(c(0, 0.03)))

This is consistent with preliminary tests (we have the same artifact/jump when the number of nodes is 32). The default performance has remained stable across the three versions. However, the ml performance has significantly decreased after 1.7.5, which is when it was performing better for Allreduce in the previous test. On Figure 11, it takes around 10 ms with ml with 512 nodes, which is similar to what we get with the default settings, and it is 13x longer with the default. In our experiments, ml is more than twice less efficient.

Figure 7

Let’s consider the last figure:

variate_parameter MPI_Allreduce 1000000 "1.7.5 1.8.2 1.10.2"
# On local machine
rsync --recursive jupiter_venus:results/venkata2013a/* results/
sed -i 1,15d results/*/*/*_1.7.5_* # remove deprecated warnings
sed -i "s/=1.7/=1.7.0/" results/*/*/*_1.7_* # simplify parsing

Let’s get the results from Figure 7:

figure7 <- data.frame(iteration = 1, nprocs = 16*2 ^ (0:5),
                       fig7_0 = c(12e-3, 14.8e-3, 15.2e-3, 15.2e-3, 14.3e-3, 247e-3),
                       fig7_90 = c(14.8e-3, 10.2e-3, 12e-3, 15.7e-3, 17.1e-3, 43e-3)) %>%
  gather(algorithm, medtime, fig7_0, fig7_90) %>%
  merge(data.frame(version = c("1.7.5", "1.8.2", "1.10.2"))) %>%
  merge(data.frame(setting = c("none", "basic")))

Let’s plot the results:

allreduce1M <- read.table.parameter("MPI_Allreduce", 1000000)

## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character

## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character

## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character

allreduce1M$setting <- factor(allreduce1M$setting, levels = c("none", "basic"))
allreduce1M$version <- factor(allreduce1M$version,
                              levels = c("1.7.5", "1.8.2", "1.10.2"))
allreduce1M %>%
  group_by(version, setting, algorithm, iteration, nprocs) %>%
  summarise(medtime = median(runtime_sec)) %>%
  select(nprocs, algorithm, medtime, version, setting) %>%
  rbind(figure7) %>%
  ggplot(aes(x = factor(nprocs), y = medtime, color = factor(algorithm))) +
  facet_grid(version ~ setting) +
  geom_boxplot() +
  stat_summary(fun.y = median, geom = "line", aes(group = algorithm), size = 1.5) +
  coord_cartesian(ylim = extendrange(c(0, 0.03)))

Again, we do not see the same behavior as on Figure 7. There is a clear superiority of the default Open MPI, especially with the basic settings. For version 1.7.5, ml performs as the default one, which may be because it was not yet used for allreduce (this was suggested by the previous analysis on Figure 6).

In terms of absolute values, the measures on Figure 7 for up to 16 nodes are not really far from ours. However, the ratio between the default and ml is not consistent with our experiments.

Figure 7 data looks really pathological whereas our experiments provide smoother results.

Conclusion

Due to lack of details in the experiment description, we tested several settings:

different Open MPI versions
different default settings

Overall, we compared with our measurements:

the ratio between the proposed method and the baseline (because this is the primary measure)
the absolute timings (because we have a similar platform)
the ratio between the performance of allreduce and reduce (because allreduce should be consistently slower than reduce and not too much slower)
the tendency (because the dependency with the number of used nodes should be consistent)

What is missing:

the guarantee that ml config and implementation are correct
test on another similar cluster (on Grid5000 for instance)
a better understanding of the performance improvement in 1.10.2 (maybe a better algorithm selection)

We made several observations:

our measures are significantly smoother, which suggests pathological performance issues in Venkata et al. (even for the baseline, the default Open MPI)
only some consistency in the absolute timings could be found
all the other observations differ significantly

We can first conclude that given a reasonable effort, the results presented in Venkata et al. were only marginally reproduced and there are heavy suspicions on the possibility to reproduce them at all.

The second conclusion is that even though ml provides some advantages for small messages, its performance degrades for large messages, which may be expected because a pipeline approach (even non-hierarchical) can be expected to perform well for these messages.

## R version 3.2.4 Revised (2016-03-16 r70336)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_1.0.1 dplyr_0.4.3   tidyr_0.2.0   purrr_0.2.0   stringr_1.0.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1      knitr_1.12.3     magrittr_1.5     MASS_7.3-44     
##  [5] munsell_0.4.2    colorspace_1.2-6 R6_2.1.0         plyr_1.8.3      
##  [9] tools_3.2.4      parallel_3.2.4   grid_3.2.4       gtable_0.1.2    
## [13] pacman_0.4.1     DBI_0.3.1        htmltools_0.2.6  lazyeval_0.1.10 
## [17] yaml_2.1.13      assertthat_0.1   digest_0.6.8     reshape2_1.4.1  
## [21] formatR_1.2      evaluate_0.8.3   rmarkdown_0.7    stringi_0.5-5   
## [25] scales_0.2.5     proto_0.3-10

Reproducing venkata2013a

L.-C. Canon

April 6 and 11, 2016

Analysis of experiments to reproduce

Settings

Platform

Experiment preparation

Reproducing Figure 10

Figure 6

Figure 11

Figure 7

Conclusion