The number of parameters in Open MPI is huge. There are 721 of them in version 1.10.2 (ompi_info --all | grep "^[ ]*MCA [^:]*:" -c
) even though 74 of them are related to verbosity level (ompi_info --all | grep _verbose -c
) and some are no even used.
Let’s study the impact of some selected parameters independently because doing a Cartesian product is intractable (especially given that some parameters are numerical). Again, we keep use the Jupiter Infiniband cluster with 32 nodes and 16 processes per node.
We select parameters with three different approaches:
ompi_info
Preliminary experiments and the FAQ on tuning suggested that the two most important parameters are:
hwloc_base_binding_policy
mpi_leave_pinned
Different variations exist for the first in older versions: mpi_paffinity_alone
and orte_process_binding
(set to core). It seemed to stabilize the performance for small messages. The second parameters seemed to improve the performance for large messages.
There are 81 parameters for openib (ompi_info --all | grep -o "parameter \"btl_openib" -c
). We will focused on the following:
btl_openib_eager_limit
and btl_openib_rndv_eager_limit
btl_openib_receive_queues
btl_openib_use_eager_rdma
btl_openib_rdma_pipeline_send_length
btl_openib_flags
Other network model could be used like “cm” and “yalla” that would rely on MXM transport. MXM is however not built, so we discard these parameters.
We will consider each parameter in turn and in this order. Any time a setting is significantly better, we will keep it in the following run.
Let’s start with some environment setting:
# Check nobody is on the cluster
for i in $(seq 0 35); do echo -n $i; ssh jupiter$i w; done | grep -v USER
# Set Open MPI version
VERSION=openmpi-1.10.2
cd ~ && rm bin && ln -s ompi/${VERSION}/bin bin
echo Forcing NFS synchro
for i in $(seq 0 35); do ssh jupiter$i ls -l; done > /dev/null
# Directory containing final results
RESULT_DIR=$PWD/parameters
mkdir -p ${RESULT_DIR}
# Nodes to use for XP
> ${RESULT_DIR}/hostfile
for i in $(seq 3 18) $(seq 20 35)
do
echo jupiter$i >> ${RESULT_DIR}/hostfile
done
# Clean environment
mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean
# General variables
TIMEOUT=100
REPETITION=30
REPEAT=30
SIZES=1,3,10,30,100,300,1000,3000,10000,30000,100000,300000,1000000,3000000
Let’s write a function to try Open MPI with different values for a given parameter (we use MPI Benchmark 0.6.0):
function variate_parameter {
PARAMETER=$1
VALUES=$2
SETTINGS=${@:3}
mkdir -p ${RESULT_DIR}/${PARAMETER}
ompi_info -c > ${RESULT_DIR}/${PARAMETER}/info
mpirun --version > ${RESULT_DIR}/${PARAMETER}/version
for i in $(seq 1 ${REPEAT})
do
echo Iteration ${i} on ${REPEAT} with ${REPETITION} measures per size
for VALUE in $(shuf -e -- ${VALUES})
do
echo Launch benchmark for parameter ${PARAMETER} with value ${VALUE}
timeout ${TIMEOUT} mpirun --mca ${PARAMETER} ${VALUE} ${SETTINGS} \
-n 512 --npernode 16 --hostfile ${RESULT_DIR}/hostfile \
mpibenchmark --calls-list=MPI_Reduce -r ${REPETITION} \
--msizes-list=${SIZES} 2>&1 > \
${RESULT_DIR}/${PARAMETER}/result_${VALUE}_${i}
done
done
}
Let’s write the script to retrieve the data and process it:
function get_results {
RESULT_DIR=results
mkdir -p ${RESULT_DIR}
echo Retreiving data
rsync --recursive jupiter:parameters/* ${RESULT_DIR}/
for PARAMETER in $(cd ${RESULT_DIR} && ls -d */ | sed "s/\///")
do
echo Summarizing ${PARAMETER}
> ${RESULT_DIR}/${PARAMETER}/summary.txt
for file in $(ls ${RESULT_DIR}/${PARAMETER}/result_*)
do
ITER=$(echo ${file} | sed "s/.*_[^_]*_\([^_]*\)$/\1/")
VALUE=$(echo ${file} | sed "s/.*_\([^_]*\)_[^_]*$/\1/")
awk -v param="${PARAMETER}" -v val="${VALUE}" -v iter="${ITER}" \
'$1 ~ /MPI_Reduce/ \
{ print param "," iter ",\"" val "\"," $3 "," $4; }' \
${file} >> ${RESULT_DIR}/${PARAMETER}/summary.txt
done
done
}
We are ready to test investigate each parameter one after the other.
Let’s start with hwloc_base_binding_policy
. There are three common levels and we add the default value (which should correspond to socket in our settings):
variate_parameter hwloc_base_binding_policy "core socket none \"\""
get_results # on local machine
Let’s analyze the result:
read.table.parameter <- function(parameter) {
filename <- paste("results", parameter, "summary.txt", sep = "/")
data <- read.table(filename, sep = ",")
names(data) <- c("parameter", "iteration", "value", "size", "time")
data
}
binding <- read.table.parameter("hwloc_base_binding_policy")
levs <- c("none", "", "socket", "core")
binding$value <- factor(binding$value, levels = levs)
Let’s see how the performance behaves from iteration for iteration:
ggplot(binding, aes(x = factor(size), y = time, color = factor(iteration))) +
facet_wrap(~value) +
geom_boxplot(outlier.size = 0.5) +
scale_y_log10() +
annotation_logticks(sides = "l")
We see that the core binding policy presents more stability in particular for small messages. Additionally, the none policy is the one that is the less stable. This was expected because process may jump from core to core.
Observations on problematic performance:
Let’s focus on the median values for each iteration:
ggplot.median <- function(data) {
group_by(data, parameter, iteration, value, size) %>%
summarise(median = median(time)) %>%
ggplot(aes(x = factor(size), y = median, color = factor(value))) +
geom_boxplot() +
scale_y_log10() +
annotation_logticks(sides = "l")
}
ggplot.median(binding)
The performance of the core policy is better for large messages and more stable for small messages. The default setting is indeed socket.
Now let’s consider the mpi_leave_pinned
. There are three values: -1 let the system choose whether to activate it (1) or not (0).
variate_parameter mpi_leave_pinned "-1 0 1" \
"--mca hwloc_base_binding_policy core"
get_results # on local machine
Let’s plot the results (only the median over each set of 30 repetitions):
pinned <- read.table.parameter("mpi_leave_pinned")
ggplot.median(pinned)
We will thus deactivate mpi_leave_pinned
in future tests.
Let’s start with the easiest parameters. By default, btl_openib_use_eager_rdma
specifies to use device default for eager messages.
variate_parameter btl_openib_use_eager_rdma "-1 0 1" \
"--mca hwloc_base_binding_policy core" \
"--mca mpi_leave_pinned 0"
get_results # on local machine
Let’s plot the medians:
rdma <- read.table.parameter("btl_openib_use_eager_rdma")
ggplot.median(rdma)
The default value provide the best performance and RDMA should not be deactivated for eager messages.
Let’s continue with the maximum size of short messages btl_openib_eager_limit
, which is set to 12288 by default. The parameters btl_openib_rndv_eager_limit
and btl_openib_memalign_threshold
defaults to this parameter value.
variate_parameter btl_openib_eager_limit \
"1024 2048 4096 8192 12288 16384 73728" \
"--mca hwloc_base_binding_policy core" \
"--mca mpi_leave_pinned 0"
get_results # on local machine
It seems there is an upper bound for this parameter because 73728 resulted in warnings. It must actually be lower than the maximum size of a “phase 2” fragment (btl_openib_max_send_size
default value is 65536). Let’s plot the medians:
ibeager <- read.table.parameter("btl_openib_eager_limit")
ggplot.median(ibeager)
The performance degrades significantly with 1024, 16384 and 73728. It is quite stable for other values with better performance for messages <= 1 kB when the value is low. We will thus put this value to 8192 by default. Reducing until 2048 could be possible but the gain would be small and this choice is more conservative. As wee saw, at least three other parameters depend on this one, so a conservative choice may be relevant as the interactions may be complex.
Let’s continue with the length of the fragments in phase 2 for large messages btl_openib_rdma_pipeline_send_length
. The default value is 1048576.
variate_parameter btl_openib_rdma_pipeline_send_length \
"262144 524288 1048576 2097152 4194304" \
"--mca hwloc_base_binding_policy core" \
"--mca mpi_leave_pinned 0" \
"--mca btl_openib_eager_limit 8192"
get_results # on local machine
Let’s plot the medians:
ibpipeline <- read.table.parameter("btl_openib_rdma_pipeline_send_length")
ggplot.median(ibpipeline)
This parameters has no significant effect. And the improvement with the eager limit is not noticeable when compared to the previous plot.
The remaining two parameters are more complicated to set. btl_openib_receive_queues
specifies the receive queues. We will alternatively deactivate each one to see the effect. The default settings is
Q1=P,128,256,192,128
Q2=S,2048,1024,1008,64
Q3=S,12288,1024,1008,64
Q4=S,65536,1024,1008,64
QUEUES0=${Q1}:${Q2}:${Q3}:${Q4}
QUEUES1=${Q2}:${Q3}:${Q4}
QUEUES2=${Q1}:${Q3}:${Q4}
QUEUES3=${Q1}:${Q2}:${Q4}
QUEUES4=${Q1}:${Q2}:${Q3}
variate_parameter btl_openib_receive_queues \
"${QUEUES0} ${QUEUES1} ${QUEUES2} ${QUEUES3} ${QUEUES4}" \
"--mca hwloc_base_binding_policy core" \
"--mca mpi_leave_pinned 0" \
"--mca btl_openib_eager_limit 8192"
get_results # on local machine
Open MPI complained when the largest queue (65536) was absent. Let’s plot the medians:
ibqueues <- read.table.parameter("btl_openib_receive_queues")
levels(ibqueues$value) <- levels(ibqueues$value) %>%
strsplit(":") %>%
map(~ strsplit(., ",") %>%
map(~ c(.[1], .[2]) %>%
paste(collapse = ":")) %>%
flatten_chr() %>%
paste(collapse = "\n ")) %>%
flatten_chr()
ggplot.median(ibqueues)
Clearly, the largest queue is necessary for performance. For message sizes 300 and 1 kB, the default performance are a bit better. We will thus keep the default parameters for this one.
The last flag specifies the protocol to be used. The default is 310 which allows PUT and GET. Let’s try each semantic: send/receive (1), PUT (2), GET (4), all (7) and the default:
variate_parameter btl_openib_flags "305 306 308 310 311" \
"--mca hwloc_base_binding_policy core" \
"--mca mpi_leave_pinned 0" \
"--mca btl_openib_eager_limit 8192"
get_results # on local machine
Let’s plot the medians:
ibflags <- read.table.parameter("btl_openib_flags")
ggplot.median(ibflags)
This parameter has no effect.
There are a lot more infiniband parameters, but at this point, we are confident they will not play a significant role if any (less than 10% difference). Of course the pathological performance to reduce 10 MB remains a mystery.
We saw that the two most important parameters (by far) are hwloc_base_binding_policy
(core policy for stability for small messages) and mpi_leave_pinned
(disable for performance for large messages). Some other parameters play some role like btl_openib_eager_limit
and the chosen shared-memory component (vader) but the default values are globally well chosen and we do not expect to gain more than 10% improvement by tuning them.
By contrast, the algorithm selection must play a significant role because this study suggests that Open MPI uses a different algorithm for messages larger than 300 kB and this threshold should probably be lower for better performance. This is the focus on the next study.
For future experiments, all sizes lower than 100 B can be discarded because their performance is always similar. It could be better to go for 100 repetitions instead of the current 30, but this is debatable as the current set of studies are not definitive. The new MPI Benchmark version (0.9.3) fix some issue with large messages (> 2 MB) that caused the system to swap and has features that could simplify the processing by importing directly the data into R.
## R version 3.2.4 (2016-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
##
## locale:
## [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_1.0.1 dplyr_0.4.3 tidyr_0.2.0 purrr_0.2.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.1 knitr_1.10.5 magrittr_1.5 MASS_7.3-44
## [5] munsell_0.4.2 colorspace_1.2-6 R6_2.1.0 stringr_1.0.0
## [9] plyr_1.8.3 tools_3.2.4 parallel_3.2.4 grid_3.2.4
## [13] gtable_0.1.2 DBI_0.3.1 htmltools_0.2.6 yaml_2.1.13
## [17] assertthat_0.1 digest_0.6.8 reshape2_1.4.1 formatR_1.2
## [21] evaluate_0.7 rmarkdown_0.7 stringi_0.5-5 scales_0.2.5
## [25] proto_0.3-10