The number of parameters in Open MPI is huge. There are 721 of them in version 1.10.2 (ompi_info --all | grep "^[ ]*MCA [^:]*:" -c) even though 74 of them are related to verbosity level (ompi_info --all | grep _verbose -c) and some are no even used.

Let’s study the impact of some selected parameters independently because doing a Cartesian product is intractable (especially given that some parameters are numerical). Again, we keep use the Jupiter Infiniband cluster with 32 nodes and 16 processes per node.

Parameter selection

We select parameters with three different approaches:

Classic parameters

Preliminary experiments and the FAQ on tuning suggested that the two most important parameters are:

  • hwloc_base_binding_policy
  • mpi_leave_pinned

Different variations exist for the first in older versions: mpi_paffinity_alone and orte_process_binding (set to core). It seemed to stabilize the performance for small messages. The second parameters seemed to improve the performance for large messages.

Infiniband tuning parameters

There are 81 parameters for openib (ompi_info --all | grep -o "parameter \"btl_openib" -c). We will focused on the following:

  • btl_openib_eager_limit and btl_openib_rndv_eager_limit
  • btl_openib_receive_queues
  • btl_openib_use_eager_rdma
  • btl_openib_rdma_pipeline_send_length
  • btl_openib_flags

Shared-memory tuning parameters

There are 16 parameters for sm and 15 for vader (concurrent shared-memory component). We will focused on the following:

  • mpool_sm_min_size
  • btl_sm_eager_limit and btl_sm_rndv_eager_limit
  • btl_sm_max_send_size
  • btl_sm_free_list_num
  • btl_sm_num_fifos
  • btl_sm_fifo_size
  • and the same for vader

Network model

Other network model could be used like “cm” and “yalla” that would rely on MXM transport. MXM is however not built, so we discard these parameters.

Study preparation

We will consider each parameter in turn and in this order. Any time a setting is significantly better, we will keep it in the following run.

Let’s start with some environment setting:

# Check nobody is on the cluster
for i in $(seq 0 35); do echo -n $i; ssh jupiter$i w; done | grep -v USER

# Set Open MPI version
VERSION=openmpi-1.10.2
cd ~ && rm bin && ln -s ompi/${VERSION}/bin bin
echo Forcing NFS synchro
for i in $(seq 0 35); do ssh jupiter$i ls -l; done > /dev/null

# Directory containing final results
RESULT_DIR=$PWD/parameters
mkdir -p ${RESULT_DIR}

# Nodes to use for XP
> ${RESULT_DIR}/hostfile
for i in $(seq 3 18) $(seq 20 35)
do
    echo jupiter$i >> ${RESULT_DIR}/hostfile
done

# Clean environment
mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean

# General variables
TIMEOUT=100
REPETITION=30
REPEAT=30
SIZES=1,3,10,30,100,300,1000,3000,10000,30000,100000,300000,1000000,3000000

Let’s write a function to try Open MPI with different values for a given parameter (we use MPI Benchmark 0.6.0):

function variate_parameter {
    PARAMETER=$1
    VALUES=$2
    SETTINGS=${@:3}
    mkdir -p ${RESULT_DIR}/${PARAMETER}
    ompi_info -c > ${RESULT_DIR}/${PARAMETER}/info
    mpirun --version > ${RESULT_DIR}/${PARAMETER}/version
    for i in $(seq 1 ${REPEAT})
    do
        echo Iteration ${i} on ${REPEAT} with ${REPETITION} measures per size
        for VALUE in $(shuf -e -- ${VALUES})
        do
        echo Launch benchmark for parameter ${PARAMETER} with value ${VALUE}
            timeout ${TIMEOUT} mpirun --mca ${PARAMETER} ${VALUE} ${SETTINGS} \
                    -n 512 --npernode 16 --hostfile ${RESULT_DIR}/hostfile \
                    mpibenchmark --calls-list=MPI_Reduce -r ${REPETITION} \
                        --msizes-list=${SIZES} 2>&1 > \
                            ${RESULT_DIR}/${PARAMETER}/result_${VALUE}_${i}
        done
    done
}

Let’s write the script to retrieve the data and process it:

function get_results {
    RESULT_DIR=results
    mkdir -p ${RESULT_DIR}
    echo Retreiving data
    rsync --recursive jupiter:parameters/* ${RESULT_DIR}/
    for PARAMETER in $(cd ${RESULT_DIR} && ls -d */ | sed "s/\///")
    do
        echo Summarizing ${PARAMETER}
        > ${RESULT_DIR}/${PARAMETER}/summary.txt
        for file in $(ls ${RESULT_DIR}/${PARAMETER}/result_*)
        do
            ITER=$(echo ${file} | sed "s/.*_[^_]*_\([^_]*\)$/\1/")
            VALUE=$(echo ${file} | sed "s/.*_\([^_]*\)_[^_]*$/\1/")
            awk -v param="${PARAMETER}" -v val="${VALUE}" -v iter="${ITER}" \
                '$1 ~ /MPI_Reduce/ \
                    { print param "," iter ",\"" val "\"," $3 "," $4; }' \
                    ${file} >> ${RESULT_DIR}/${PARAMETER}/summary.txt
        done
    done
}

Parameter tuning

We are ready to test investigate each parameter one after the other.

Classic parameters

Let’s start with hwloc_base_binding_policy. There are three common levels and we add the default value (which should correspond to socket in our settings):

variate_parameter hwloc_base_binding_policy "core socket none \"\""
get_results # on local machine

Let’s analyze the result:

read.table.parameter <- function(parameter) {
  filename <- paste("results", parameter, "summary.txt", sep = "/")
  data <- read.table(filename, sep = ",")
  names(data) <- c("parameter", "iteration", "value", "size", "time")
  data
}

binding <- read.table.parameter("hwloc_base_binding_policy")
levs <- c("none", "", "socket", "core")
binding$value <- factor(binding$value, levels = levs)

Let’s see how the performance behaves from iteration for iteration:

ggplot(binding, aes(x = factor(size), y = time, color = factor(iteration))) +
  facet_wrap(~value) +
  geom_boxplot(outlier.size = 0.5) +
  scale_y_log10() +
  annotation_logticks(sides = "l")

We see that the core binding policy presents more stability in particular for small messages. Additionally, the none policy is the one that is the less stable. This was expected because process may jump from core to core.

Observations on problematic performance:

  • some of the measures for size >= 300 kB are more than 10x higher than median
  • for core policy and size of 3 MB, there are a lot more of outliers (this may be related to a bug in MPI Benchmark 0.6.0)
  • the performance seems too much similar between 300 kB and 1 MB

Let’s focus on the median values for each iteration:

ggplot.median <- function(data) {
  group_by(data, parameter, iteration, value, size) %>%
    summarise(median = median(time)) %>%
    ggplot(aes(x = factor(size), y = median, color = factor(value))) +
      geom_boxplot() +
      scale_y_log10() +
      annotation_logticks(sides = "l")
}

ggplot.median(binding)

The performance of the core policy is better for large messages and more stable for small messages. The default setting is indeed socket.

Now let’s consider the mpi_leave_pinned. There are three values: -1 let the system choose whether to activate it (1) or not (0).

variate_parameter mpi_leave_pinned "-1 0 1" \
        "--mca hwloc_base_binding_policy core"
get_results # on local machine

Let’s plot the results (only the median over each set of 30 repetitions):

pinned <- read.table.parameter("mpi_leave_pinned")
ggplot.median(pinned)

  • the default setting is to activate the option
  • huge performance improvement for size >= 1 MB when deactivated, which accentuates the performance inflection with size 300 kB

We will thus deactivate mpi_leave_pinned in future tests.

Infiniband tuning parameters

Let’s start with the easiest parameters. By default, btl_openib_use_eager_rdma specifies to use device default for eager messages.

variate_parameter btl_openib_use_eager_rdma "-1 0 1" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0"
get_results # on local machine

Let’s plot the medians:

rdma <- read.table.parameter("btl_openib_use_eager_rdma")
ggplot.median(rdma)

The default value provide the best performance and RDMA should not be deactivated for eager messages.

Let’s continue with the maximum size of short messages btl_openib_eager_limit, which is set to 12288 by default. The parameters btl_openib_rndv_eager_limit and btl_openib_memalign_threshold defaults to this parameter value.

variate_parameter btl_openib_eager_limit \
        "1024 2048 4096 8192 12288 16384 73728" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0"
get_results # on local machine

It seems there is an upper bound for this parameter because 73728 resulted in warnings. It must actually be lower than the maximum size of a “phase 2” fragment (btl_openib_max_send_size default value is 65536). Let’s plot the medians:

ibeager <- read.table.parameter("btl_openib_eager_limit")
ggplot.median(ibeager)

The performance degrades significantly with 1024, 16384 and 73728. It is quite stable for other values with better performance for messages <= 1 kB when the value is low. We will thus put this value to 8192 by default. Reducing until 2048 could be possible but the gain would be small and this choice is more conservative. As wee saw, at least three other parameters depend on this one, so a conservative choice may be relevant as the interactions may be complex.

Let’s continue with the length of the fragments in phase 2 for large messages btl_openib_rdma_pipeline_send_length. The default value is 1048576.

variate_parameter btl_openib_rdma_pipeline_send_length \
        "262144 524288 1048576 2097152 4194304" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca btl_openib_eager_limit 8192"
get_results # on local machine

Let’s plot the medians:

ibpipeline <- read.table.parameter("btl_openib_rdma_pipeline_send_length")
ggplot.median(ibpipeline)

This parameters has no significant effect. And the improvement with the eager limit is not noticeable when compared to the previous plot.

The remaining two parameters are more complicated to set. btl_openib_receive_queues specifies the receive queues. We will alternatively deactivate each one to see the effect. The default settings is

Q1=P,128,256,192,128
Q2=S,2048,1024,1008,64
Q3=S,12288,1024,1008,64
Q4=S,65536,1024,1008,64
QUEUES0=${Q1}:${Q2}:${Q3}:${Q4}
QUEUES1=${Q2}:${Q3}:${Q4}
QUEUES2=${Q1}:${Q3}:${Q4}
QUEUES3=${Q1}:${Q2}:${Q4}
QUEUES4=${Q1}:${Q2}:${Q3}
variate_parameter btl_openib_receive_queues \
        "${QUEUES0} ${QUEUES1} ${QUEUES2} ${QUEUES3} ${QUEUES4}" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca btl_openib_eager_limit 8192"
get_results # on local machine

Open MPI complained when the largest queue (65536) was absent. Let’s plot the medians:

ibqueues <- read.table.parameter("btl_openib_receive_queues")
levels(ibqueues$value) <- levels(ibqueues$value) %>%
  strsplit(":") %>%
  map(~ strsplit(., ",") %>%
        map(~ c(.[1], .[2]) %>%
              paste(collapse = ":")) %>%
        flatten_chr() %>%
        paste(collapse = "\n ")) %>%
  flatten_chr()
ggplot.median(ibqueues)

Clearly, the largest queue is necessary for performance. For message sizes 300 and 1 kB, the default performance are a bit better. We will thus keep the default parameters for this one.

The last flag specifies the protocol to be used. The default is 310 which allows PUT and GET. Let’s try each semantic: send/receive (1), PUT (2), GET (4), all (7) and the default:

variate_parameter btl_openib_flags "305 306 308 310 311" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca btl_openib_eager_limit 8192"
get_results # on local machine

Let’s plot the medians:

ibflags <- read.table.parameter("btl_openib_flags")
ggplot.median(ibflags)

This parameter has no effect.

There are a lot more infiniband parameters, but at this point, we are confident they will not play a significant role if any (less than 10% difference). Of course the pathological performance to reduce 10 MB remains a mystery.

Shared-memory tuning parameters

Let’s start by changing the size of the mpool shared memory file mpool_sm_min_size. The default is 134217728. If the vader component is used by default (as advertised on the FAQ), then this should have no impact.

variate_parameter mpool_sm_min_size "67108864 134217728 268435456" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca btl_openib_eager_limit 8192"
get_results # on local machine

Let’s plot the medians:

mpool <- read.table.parameter("mpool_sm_min_size")
ggplot.median(mpool)

There is no variation. The default shared-memory component should thus be vader.

We will focus on btl_vader_eager_limit, which is similar to the previously analyzed eager limit. It defaults to 4096. Variations will confirm the component that is used.

variate_parameter btl_vader_eager_limit "1024 2048 4096 8192 16384" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca btl_openib_eager_limit 8192"
get_results # on local machine

Let’s plot the medians:

vadereager <- read.table.parameter("btl_vader_eager_limit")
ggplot.median(vadereager)

This confirms that the vader component is indeed used by default. Except for messages >= 1 kB and with 1024, there is no effect. The default value is thus fine.

Let’s focus on btl_vader_free_list_num, the initial number of shared memory fragments. It defaults to 8.

variate_parameter btl_vader_free_list_num "2 4 8 16 32" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca btl_openib_eager_limit 8192"
get_results # on local machine

Let’s plot the medians:

vaderfree <- read.table.parameter("btl_vader_free_list_num")
ggplot.median(vaderfree)

It might be better to have a larger number of fragments for message sizes <= 1 kB but this is not significant.

Let’s focus on btl_vader_max_send_size, the size of large messages. It defaults to 32768. It is expected to provide better performance when increased.

variate_parameter btl_vader_max_send_size "8192 16384 32768 65536 131072" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca btl_openib_eager_limit 8192"
get_results # on local machine

Let’s plot the medians:

vadermax <- read.table.parameter("btl_vader_max_send_size")
ggplot.median(vadermax)

Again, nothing significant.

As a final test, let’s compare with the previous sm component:

variate_parameter btl "^vader ^sm" \
        "--mca hwloc_base_binding_policy core" \
        "--mca mpi_leave_pinned 0" \
        "--mca btl_openib_eager_limit 8192"
get_results # on local machine

Let’s plot the medians:

vadersm <- read.table.parameter("btl")
ggplot.median(vadersm)

We can clearly see that vader provides significantly better performance than sm, except for message sizes between 10 kB and 300 kB. It may be because the reduction algorithm is sub-optimal with these settings as suggested by the inflection point for 300 kB. The intra-node communication could thus be hidden and completely overlapped.

We decided to discard btl_sm_fifo_size and btl_sm_num_fifos because these parameters have no correspondence with vader.

Note also that the default value for btl_vader_fbox_threshold can provoke a change in the performance after some given number of communications: after 16 exchanges with the same peer, some optimized mechanism is set up.

Conclusion

We saw that the two most important parameters (by far) are hwloc_base_binding_policy (core policy for stability for small messages) and mpi_leave_pinned (disable for performance for large messages). Some other parameters play some role like btl_openib_eager_limit and the chosen shared-memory component (vader) but the default values are globally well chosen and we do not expect to gain more than 10% improvement by tuning them.

By contrast, the algorithm selection must play a significant role because this study suggests that Open MPI uses a different algorithm for messages larger than 300 kB and this threshold should probably be lower for better performance. This is the focus on the next study.

For future experiments, all sizes lower than 100 B can be discarded because their performance is always similar. It could be better to go for 100 repetitions instead of the current 30, but this is debatable as the current set of studies are not definitive. The new MPI Benchmark version (0.9.3) fix some issue with large messages (> 2 MB) that caused the system to swap and has features that could simplify the processing by importing directly the data into R.

## R version 3.2.4 (2016-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_1.0.1 dplyr_0.4.3   tidyr_0.2.0   purrr_0.2.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1      knitr_1.10.5     magrittr_1.5     MASS_7.3-44     
##  [5] munsell_0.4.2    colorspace_1.2-6 R6_2.1.0         stringr_1.0.0   
##  [9] plyr_1.8.3       tools_3.2.4      parallel_3.2.4   grid_3.2.4      
## [13] gtable_0.1.2     DBI_0.3.1        htmltools_0.2.6  yaml_2.1.13     
## [17] assertthat_0.1   digest_0.6.8     reshape2_1.4.1   formatR_1.2     
## [21] evaluate_0.7     rmarkdown_0.7    stringi_0.5-5    scales_0.2.5    
## [25] proto_0.3-10