In the previous study of the performance of HierKNEM, KNEM was not installed and thus the mechanism was not used. The results suggest that the Open MPI version may strongly affect the conclusions. We will try to run the code on the same hardware: parapluie (Rennes site) and/or StRemi (Reims site). While parapluie has InfiniBand, there are only 19 nodes left from the original 40 (the nodes are dead since at least 2 years). StRemi is more recent (2011 instead of 2010) and still has 39 nodes.
Let’s get the same code (locally):
hg clone https://bitbucket.org/tengma/hierknem
tar -czf hierknem.tar.gz hierknem
ssh rennes mkdir -p hierknem
scp hierknem.tar.gz rennes:hierknem/
ssh rennes "cd hierknem && tar -xzf hierknem.tar.gz"
We also need the original version:
scp ~/Research/mpireduce/prog/ompi-tarball/openmpi-1.5.4.tar.bz2 rennes:hierknem/
ssh rennes "cd hierknem && tar -xjf openmpi-1.5.4.tar.bz2"
Let’s complete the files with the official Open MPI source (remotely):
cd ~/hierknem/hierknem
cp ../openmpi-1.5.4/ompi/mca/io/romio/romio/test/Makefile.in \
./ompi/mca/io/romio/romio/test/
cp -r ../openmpi-1.5.4/ompi/contrib/vt/vt/tools/opari/lib/ \
./ompi/contrib/vt/vt/tools/opari/
Now, let’s deploy the big image with everything for development. Deployment seems necessary because KNEM is a module that need root access to be loaded. The document is here.
oarsub -I -l nodes=1,walltime=2:00 -t deploy -p "cluster='parapluie'"
kadeploy3 -f $OAR_NODE_FILE -e jessie-x64-big -k
ssh $(head -n 1 ${OAR_NODE_FILE})
Let’s install KNEM. The installation of version 0.9.7 (the one from the article) failed. However, the last one worked fine.
cd ~/hierknem/
wget http://gforge.inria.fr/frs/download.php/34521/knem-1.1.2.tar.gz
tar -xzf knem-1.1.2.tar.gz
cd knem-1.1.2
./configure --prefix=$HOME/hierknem/ && make all && make install
The hierknem code generates warnings that interrupt the compilation because of the Werror GCC flags (we deactivate them):
cd ~/hierknem/hierknem/
./autogen.sh
sed -i "s/-Werror-implicit-function-declaration //" configure
sed -i "s/-Werror //" configure
./configure --prefix=$HOME/hierknem/ --with-knem=$HOME/hierknem --disable-vt
make
make install
Let’s install the benchmark to test whether the hierknem collective module is activated. Before that, we need GSL:
cd ~/hierknem/
wget http://mirror.ibcp.fr/pub/gnu/gsl/gsl-2.1.tar.gz
tar -xzf gsl-2.1.tar.gz
cd gsl-2.1
./configure --prefix=$HOME/hierknem/ && make && make install
PATH=$HOME/hierknem/bin:$PATH
cd ~/hierknem/mpibenchmark-0.9.4-src/
make clean
rm -r CMakeCache.txt CMakeFiles/
cmake .
sed "s/ENABLE_RDTSCP:BOOL=OFF/ENABLE_RDTSCP:BOOL=ON/" -i CMakeCache.txt
sed "s/ENABLE_BENCHMARK_BARRIER:BOOL=OFF/ENABLE_BENCHMARK_BARRIER:BOOL=ON/" -i CMakeCache.txt
cmake .
make 2>&1
mv mpibenchmark ../bin/mpibenchmark-0.9.4
It is actually unnecessary to rely on deployment because the tool sudo-g5k can be used with standard submission to load modules. To activate it:
oarsub -I -l nodes=1,walltime=2:00 -p "cluster='parapluie'"
sudo-g5k insmod hierknem/lib/modules/3.16.0-4-amd64/knem.ko
sudo-g5k chown $USER /dev/knem
Let’s test hierknem activation on single node with KNEM module loaded and unloaded:
PATH=$HOME/hierknem/bin:$PATH
# Directory containing final results
RESULT_DIR=${HOME}/results/ma2012a
mkdir -p ${RESULT_DIR}
# Nodes to use for XP
uniq ${OAR_NODE_FILE} > ${RESULT_DIR}/hostfile
export LD_LIBRARY_PATH=$HOME/hierknem/lib
mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean
mpirun --mca coll_hierarch_priority 90 -n 16 --npernode 16 \
--hostfile ${RESULT_DIR}/hostfile -x LD_LIBRARY_PATH \
mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
--msizes-list=3000000,30000,300 -r 100 --summary
This results in an error:
mpibenchmark-0.9.4: connect/btl_openib_connect_udcm.c:699: udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
[parapluie-38:02844] *** Process received signal ***
[parapluie-38:02844] Signal: Aborted (6)
[parapluie-38:02844] Signal code: (-6)
[parapluie-38:02844] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7f17f84c18d0]
Let’s do the exact same configuration for Open MPI 1.5.4 to see if this is due to the code:
cd ~/hierknem/
mkdir ompi-1.5.4
cd ~/hierknem/openmpi-1.5.4
#sed -i "s/-Werror-implicit-function-declaration //" configure
#sed -i "s/-Werror //" configure
./configure --prefix=$HOME/hierknem/ompi-1.5.4 --with-knem=$HOME/hierknem --disable-vt
make
make install
We still need MPI benchmark and SDL:
cd ~/hierknem/gsl-2.1
./configure --prefix=$HOME/hierknem/ompi-1.5.4 && make && make install
PATH=$HOME/hierknem/ompi-1.5.4/bin:$PATH
cd ~/hierknem/mpibenchmark-0.9.4-src/
make clean
rm -r CMakeCache.txt CMakeFiles/
cmake .
sed "s/ENABLE_RDTSCP:BOOL=OFF/ENABLE_RDTSCP:BOOL=ON/" -i CMakeCache.txt
sed "s/ENABLE_BENCHMARK_BARRIER:BOOL=OFF/ENABLE_BENCHMARK_BARRIER:BOOL=ON/" -i CMakeCache.txt
cmake .
make 2>&1
mv mpibenchmark ../ompi-1.5.4/bin/mpibenchmark-0.9.4
And now, the test:
PATH=$HOME/hierknem/ompi-1.5.4/bin:$PATH
export LD_LIBRARY_PATH=$HOME/hierknem/ompi-1.5.4/lib
RESULT_DIR=${HOME}/results/ma2012a
uniq ${OAR_NODE_FILE} > ${RESULT_DIR}/hostfile
mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean
mpirun --mca coll_hierarch_priority 90 -n 16 --npernode 16 \
--hostfile ${RESULT_DIR}/hostfile -x LD_LIBRARY_PATH \
mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
--msizes-list=3000000,30000,300 -r 100 --summary
It results in a warning but produces a result:
WARNING: There was an error initializing an OpenFabrics device.
Local host: parapluie-38.rennes.grid5000.fr
Local device: mlx4_0
This warning and the previous error message disappear when running the code as root (required a deployment). For hierknem, we obtain:
test msize total_nrep valid_nrep mean_sec median_sec min_sec max_sec
MPI_Reduce 3000000 100 100 0.0343280423 0.0342662370 0.0338874835 0.0361984935
MPI_Reduce 30000 100 100 0.0012567593 0.0012497407 0.0012352648 0.0014875478
MPI_Reduce 300 100 100 0.0000212134 0.0000206711 0.0000194196 0.0000356252
And for Open MPI 1.5.4:
MPI_Reduce 3000000 100 100 0.0135458890 0.0135464109 0.0132605078 0.0138213896
MPI_Reduce 30000 100 100 0.0002002044 0.0001935004 0.0001901948 0.0004190317
MPI_Reduce 300 100 100 0.0000107330 0.0000105117 0.0000087139 0.0000185657
The performance is much lower with hierknem. However, the hierarch module was not selected (in both cases, it was the tuned collective module) even when using a priority of 90.
Making the hierarch module work is no longer an objective for two reasons:
This implies that the code base has problem and even if it had not, the baseline does not provide good evidence of good performance.
We will focus on these two points. Let’s see if the performance of the maximum number of available nodes (up to the maximum, which is 19). There are currently some activity, we will settle for 5 nodes.
oarsub -I -l nodes=5,walltime=2:00 -t deploy -p "cluster='parapluie'"
kadeploy3 -f ${OAR_NODE_FILE} -e jessie-x64-big -k
Let’s configure the machines with KNEM:
for host in $(uniq ${OAR_NODE_FILE})
do
ssh root@${host} insmod ~/hierknem/lib/modules/3.16.0-4-amd64/knem.ko
scp .ssh/id_rsa root@${host}:/root/.ssh
done
Let’s store the machine list:
RESULT_DIR=${HOME}/results/ma2012a
uniq ${OAR_NODE_FILE} > ${RESULT_DIR}/hostfile
Let’s launch a simple script:
ssh root@$(head -n 1 ${OAR_NODE_FILE})
HOME=/home/lccanon
RESULT_DIR=${HOME}/results/ma2012a
PATH=$HOME/hierknem/bin:$PATH
export LD_LIBRARY_PATH=$HOME/hierknem/lib
mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean
mpirun -n 5 --npernode 1 \
--hostfile ${RESULT_DIR}/hostfile -x LD_LIBRARY_PATH \
mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
--msizes-list=3000000,30000,300 -r 100 --summary
This does not work (Open MPI was not designed to be used as root). OK, let’s try to run in user mode but without infiniband (which was causing issues). We first need the right to access KNEM:
for host in $(uniq ${OAR_NODE_FILE})
do
ssh root@${host} chown $USER /dev/knem
done
Let’s launch the code:
ssh $(head -n 1 ${OAR_NODE_FILE})
RESULT_DIR=${HOME}/results/ma2012a
PATH=$HOME/hierknem/bin:$PATH
export LD_LIBRARY_PATH=$HOME/hierknem/lib
mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean
mpirun --mca btl self,sm,tcp -n 5 --npernode 1 \
--hostfile ${RESULT_DIR}/hostfile -x LD_LIBRARY_PATH \
mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
--msizes-list=3000000,30000,300 -r 100 --summary
Actually, this does not run. Even a basic mpirun hostname
fails. The installed mpirun work fine and the tutorial only differ in the way Open MPI is compiled by an incorrect libdir flag and a seemingly useless memory management flag.
OK, we can actually solve this issue by using absolute paths for mpirun
(different Open MPI versions were used on remote nodes). We will therefore try to run the hierknem code as root (for the InfiniBand support) and then version 1.5.4 in the same condition with 5 nodes:
Let’s start by reserving nodes (15 are available). We need root access to avoid infiniband issues.
oarsub -I -l nodes=15,walltime=2:00 -t deploy -p "cluster='parapluie'"
kadeploy3 -f ${OAR_NODE_FILE} -e jessie-x64-big -k
Let’s configure the machines with KNEM:
for host in $(uniq ${OAR_NODE_FILE})
do
ssh root@${host} insmod ~/hierknem/lib/modules/3.16.0-4-amd64/knem.ko
scp .ssh/id_rsa root@${host}:/root/.ssh
done
Let’s store the machine list:
RESULT_DIR=${HOME}/results/ma2012a
uniq ${OAR_NODE_FILE} > ${RESULT_DIR}/hostfile
Let’s launch a simple script:
ssh root@$(head -n 1 ${OAR_NODE_FILE})
USER_HOME=/home/lccanon
RESULT_DIR=${USER_HOME}/results/ma2012a
PATH=${USER_HOME}/hierknem/bin:$PATH
export LD_LIBRARY_PATH=${USER_HOME}/hierknem/lib
${USER_HOME}/hierknem/bin/mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean
${USER_HOME}/hierknem/bin/mpirun -n 360 --npernode 24 \
--hostfile ${RESULT_DIR}/hostfile -x LD_LIBRARY_PATH \
mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
--msizes-list=3000000,30000,300 -r 100 --summary
We get a result (finally):
test msize total_nrep valid_nrep mean_sec median_sec min_sec max_sec
MPI_Reduce 3000000 100 100 0.0686075647 0.0656665517 0.0641192496 0.2161087565
MPI_Reduce 30000 100 100 0.0023956711 0.0023696037 0.0022798439 0.0028028270
MPI_Reduce 300 100 100 0.0003717525 0.0002941050 0.0001592109 0.0009123335
Let’s get the results with Open MPI 1.5.4 (with and without hierarch):
PATH=${USER_HOME}/hierknem/ompi-1.5.4/bin:$PATH
export LD_LIBRARY_PATH=${USER_HOME}/hierknem/ompi-1.5.4/lib
${USER_HOME}/hierknem/ompi-1.5.4/bin/mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean
${USER_HOME}/hierknem/ompi-1.5.4/bin/mpirun -n 360 --npernode 24 \
--hostfile ${RESULT_DIR}/hostfile -x LD_LIBRARY_PATH \
mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
--msizes-list=3000000,30000,300 -r 100 --summary
${USER_HOME}/hierknem/ompi-1.5.4/bin/mpirun --mca coll_hierarch_priority 90 \
-n 360 --npernode 24 \
--hostfile ${RESULT_DIR}/hostfile -x LD_LIBRARY_PATH \
mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
--msizes-list=3000000,30000,300 -r 100 --summary
We get:
test msize total_nrep valid_nrep mean_sec median_sec min_sec max_sec
MPI_Reduce 3000000 100 100 0.0467649261 0.0198663083 0.0194497852 0.6242527330
MPI_Reduce 30000 100 100 0.0009524739 0.0009099533 0.0008457539 0.0011328874
MPI_Reduce 300 100 100 0.0003981949 0.0003463561 0.0001744470 0.0008854852
and:
test msize total_nrep valid_nrep mean_sec median_sec min_sec max_sec
MPI_Reduce 3000000 100 100 0.0816672251 0.0780993096 0.0774702270 0.2245223861
MPI_Reduce 30000 100 100 0.0008852206 0.0008504661 0.0007597622 0.0013312226
MPI_Reduce 300 100 100 0.0003525603 0.0003197963 0.0001519404 0.0008233278
And finally with version 1.10.2 (this time, being root is an issue and there is no infiniband issue when being a user):
ssh $(head -n 1 ${OAR_NODE_FILE})
RESULT_DIR=${HOME}/results/ma2012a
PATH=${HOME}/openmpi-1.10.2/bin:$PATH
export LD_LIBRARY_PATH=${HOME}/openmpi-1.10.2/lib
${HOME}/openmpi-1.10.2/bin/mpirun --pernode \
--hostfile ${RESULT_DIR}/hostfile orte-clean
${HOME}/openmpi-1.10.2/bin/mpirun -n 360 --npernode 24 \
--hostfile ${RESULT_DIR}/hostfile \
mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
--msizes-list=3000000,30000,300 -r 100 --summary
${HOME}/openmpi-1.10.2/bin/mpirun --allow-run-as-root \
-n 16 --npernode 16 --mca coll_hierarch_priority 90 \
--hostfile ${RESULT_DIR}/hostfile \
mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
--msizes-list=3000000,30000,300 -r 100 --summary
The output is:
test msize total_nrep valid_nrep mean_sec median_sec min_sec max_sec
MPI_Reduce 3000000 100 100 0.0700630455 0.0683079087 0.0676755826 0.1815986922
MPI_Reduce 30000 100 100 0.0017387800 0.0017138167 0.0015250622 0.0020754500
MPI_Reduce 300 100 100 0.0004783298 0.0004696061 0.0003139174 0.0007003843
and:
test msize total_nrep valid_nrep mean_sec median_sec min_sec max_sec
MPI_Reduce 3000000 100 100 0.0137720546 0.0136190100 0.0128717800 0.0152411713
MPI_Reduce 30000 100 100 0.0002522744 0.0002353930 0.0002118504 0.0011307217
MPI_Reduce 300 100 100 0.0000079071 0.0000074117 0.0000067974 0.0000323183
This effort shows the general difficulty of using outdated hardware (only a fraction of nodes remains) and unsupported software (evolving parameter usage).
In addition, the code associated with HierKNEM presents some issues (missing files, error with OpenFabrics, poor performance even with the default components).
Even though the code would produce reasonable results, there would still be a problem with the baseline. Version 1.5.4 is not an appropriate choice for measure performance.
As a reminder for future tests: carefully select directories for MPI versions and put helpful libraries (like GSL) in a common directory.
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
##
## locale:
## [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_0.9.3.1 dplyr_0.4.3 tidyr_0.2.0 purrr_0.2.0
## [5] stringr_1.0.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.1 knitr_1.10.5 magrittr_1.5
## [4] MASS_7.3-44 munsell_0.4 colorspace_1.2-2
## [7] R6_2.1.2 plyr_1.8.1 tools_3.3.1
## [10] dichromat_2.0-0 parallel_3.3.1 grid_3.3.1
## [13] gtable_0.1.2 pacman_0.4.1 DBI_0.3.1
## [16] htmltools_0.2.6 yaml_2.1.13 assertthat_0.1
## [19] digest_0.6.9 RColorBrewer_1.0-5 reshape2_1.2.2
## [22] formatR_0.10 evaluate_0.7 rmarkdown_0.7
## [25] labeling_0.1 stringi_0.5-5 scales_0.2.3
## [28] proto_0.3-10