In the previous study of the performance of HierKNEM, KNEM was not installed and thus the mechanism was not used. The results suggest that the Open MPI version may strongly affect the conclusions. We will try to run the code on the same hardware: parapluie (Rennes site) and/or StRemi (Reims site). While parapluie has InfiniBand, there are only 19 nodes left from the original 40 (the nodes are dead since at least 2 years). StRemi is more recent (2011 instead of 2010) and still has 39 nodes.

Environment settings

Let’s get the same code (locally):

hg clone https://bitbucket.org/tengma/hierknem
tar -czf hierknem.tar.gz hierknem
ssh rennes mkdir -p hierknem
scp hierknem.tar.gz rennes:hierknem/
ssh rennes "cd hierknem && tar -xzf hierknem.tar.gz"

We also need the original version:

scp ~/Research/mpireduce/prog/ompi-tarball/openmpi-1.5.4.tar.bz2 rennes:hierknem/
ssh rennes "cd hierknem && tar -xjf openmpi-1.5.4.tar.bz2"

Let’s complete the files with the official Open MPI source (remotely):

cd ~/hierknem/hierknem
cp ../openmpi-1.5.4/ompi/mca/io/romio/romio/test/Makefile.in \
  ./ompi/mca/io/romio/romio/test/
cp -r ../openmpi-1.5.4/ompi/contrib/vt/vt/tools/opari/lib/ \
  ./ompi/contrib/vt/vt/tools/opari/

Now, let’s deploy the big image with everything for development. Deployment seems necessary because KNEM is a module that need root access to be loaded. The document is here.

oarsub -I -l nodes=1,walltime=2:00 -t deploy -p "cluster='parapluie'"
kadeploy3 -f $OAR_NODE_FILE -e jessie-x64-big -k
ssh $(head -n 1 ${OAR_NODE_FILE})

Let’s install KNEM. The installation of version 0.9.7 (the one from the article) failed. However, the last one worked fine.

cd ~/hierknem/
wget http://gforge.inria.fr/frs/download.php/34521/knem-1.1.2.tar.gz
tar -xzf knem-1.1.2.tar.gz
cd knem-1.1.2
./configure --prefix=$HOME/hierknem/ && make all && make install

The hierknem code generates warnings that interrupt the compilation because of the Werror GCC flags (we deactivate them):

cd ~/hierknem/hierknem/
./autogen.sh
sed -i "s/-Werror-implicit-function-declaration //" configure
sed -i "s/-Werror //" configure
./configure --prefix=$HOME/hierknem/ --with-knem=$HOME/hierknem --disable-vt
make
make install

Let’s install the benchmark to test whether the hierknem collective module is activated. Before that, we need GSL:

cd ~/hierknem/
wget http://mirror.ibcp.fr/pub/gnu/gsl/gsl-2.1.tar.gz
tar -xzf gsl-2.1.tar.gz
cd gsl-2.1
./configure --prefix=$HOME/hierknem/ && make && make install
PATH=$HOME/hierknem/bin:$PATH
cd ~/hierknem/mpibenchmark-0.9.4-src/
make clean
rm -r CMakeCache.txt CMakeFiles/
cmake .
sed "s/ENABLE_RDTSCP:BOOL=OFF/ENABLE_RDTSCP:BOOL=ON/" -i CMakeCache.txt
sed "s/ENABLE_BENCHMARK_BARRIER:BOOL=OFF/ENABLE_BENCHMARK_BARRIER:BOOL=ON/" -i CMakeCache.txt
cmake .
make 2>&1
mv mpibenchmark ../bin/mpibenchmark-0.9.4

Environment check

It is actually unnecessary to rely on deployment because the tool sudo-g5k can be used with standard submission to load modules. To activate it:

oarsub -I -l nodes=1,walltime=2:00 -p "cluster='parapluie'"
sudo-g5k insmod hierknem/lib/modules/3.16.0-4-amd64/knem.ko
sudo-g5k chown $USER /dev/knem

Let’s test hierknem activation on single node with KNEM module loaded and unloaded:

PATH=$HOME/hierknem/bin:$PATH

# Directory containing final results
RESULT_DIR=${HOME}/results/ma2012a
mkdir -p ${RESULT_DIR}

# Nodes to use for XP
uniq ${OAR_NODE_FILE} > ${RESULT_DIR}/hostfile

export LD_LIBRARY_PATH=$HOME/hierknem/lib
mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean
mpirun --mca coll_hierarch_priority 90 -n 16 --npernode 16 \
      --hostfile ${RESULT_DIR}/hostfile -x LD_LIBRARY_PATH \
    mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
      --msizes-list=3000000,30000,300 -r 100 --summary

This results in an error:

mpibenchmark-0.9.4: connect/btl_openib_connect_udcm.c:699: udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
[parapluie-38:02844] *** Process received signal ***
[parapluie-38:02844] Signal: Aborted (6)
[parapluie-38:02844] Signal code:  (-6)
[parapluie-38:02844] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7f17f84c18d0]

Let’s do the exact same configuration for Open MPI 1.5.4 to see if this is due to the code:

cd ~/hierknem/
mkdir ompi-1.5.4
cd ~/hierknem/openmpi-1.5.4
#sed -i "s/-Werror-implicit-function-declaration //" configure
#sed -i "s/-Werror //" configure
./configure --prefix=$HOME/hierknem/ompi-1.5.4 --with-knem=$HOME/hierknem --disable-vt
make
make install

We still need MPI benchmark and SDL:

cd ~/hierknem/gsl-2.1
./configure --prefix=$HOME/hierknem/ompi-1.5.4 && make && make install
PATH=$HOME/hierknem/ompi-1.5.4/bin:$PATH
cd ~/hierknem/mpibenchmark-0.9.4-src/
make clean
rm -r CMakeCache.txt CMakeFiles/
cmake .
sed "s/ENABLE_RDTSCP:BOOL=OFF/ENABLE_RDTSCP:BOOL=ON/" -i CMakeCache.txt
sed "s/ENABLE_BENCHMARK_BARRIER:BOOL=OFF/ENABLE_BENCHMARK_BARRIER:BOOL=ON/" -i CMakeCache.txt
cmake .
make 2>&1
mv mpibenchmark ../ompi-1.5.4/bin/mpibenchmark-0.9.4

And now, the test:

PATH=$HOME/hierknem/ompi-1.5.4/bin:$PATH
export LD_LIBRARY_PATH=$HOME/hierknem/ompi-1.5.4/lib
RESULT_DIR=${HOME}/results/ma2012a
uniq ${OAR_NODE_FILE} > ${RESULT_DIR}/hostfile
mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean
mpirun --mca coll_hierarch_priority 90 -n 16 --npernode 16 \
      --hostfile ${RESULT_DIR}/hostfile -x LD_LIBRARY_PATH \
    mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
      --msizes-list=3000000,30000,300 -r 100 --summary

It results in a warning but produces a result:

WARNING: There was an error initializing an OpenFabrics device.

  Local host:   parapluie-38.rennes.grid5000.fr
  Local device: mlx4_0

This warning and the previous error message disappear when running the code as root (required a deployment). For hierknem, we obtain:

      test      msize total_nrep valid_nrep       mean_sec     median_sec        min_sec        max_sec
MPI_Reduce    3000000        100        100   0.0343280423   0.0342662370   0.0338874835   0.0361984935 
MPI_Reduce      30000        100        100   0.0012567593   0.0012497407   0.0012352648   0.0014875478 
MPI_Reduce        300        100        100   0.0000212134   0.0000206711   0.0000194196   0.0000356252 

And for Open MPI 1.5.4:

MPI_Reduce    3000000        100        100   0.0135458890   0.0135464109   0.0132605078   0.0138213896 
MPI_Reduce      30000        100        100   0.0002002044   0.0001935004   0.0001901948   0.0004190317 
MPI_Reduce        300        100        100   0.0000107330   0.0000105117   0.0000087139   0.0000185657 

The performance is much lower with hierknem. However, the hierarch module was not selected (in both cases, it was the tuned collective module) even when using a priority of 90.

Measurements (or debugging multi-node issue)

Making the hierarch module work is no longer an objective for two reasons:

This implies that the code base has problem and even if it had not, the baseline does not provide good evidence of good performance.

We will focus on these two points. Let’s see if the performance of the maximum number of available nodes (up to the maximum, which is 19). There are currently some activity, we will settle for 5 nodes.

oarsub -I -l nodes=5,walltime=2:00 -t deploy -p "cluster='parapluie'"
kadeploy3 -f ${OAR_NODE_FILE} -e jessie-x64-big -k

Let’s configure the machines with KNEM:

for host in $(uniq ${OAR_NODE_FILE})
do
  ssh root@${host} insmod ~/hierknem/lib/modules/3.16.0-4-amd64/knem.ko
  scp .ssh/id_rsa root@${host}:/root/.ssh
done

Let’s store the machine list:

RESULT_DIR=${HOME}/results/ma2012a
uniq ${OAR_NODE_FILE} > ${RESULT_DIR}/hostfile

Let’s launch a simple script:

ssh root@$(head -n 1 ${OAR_NODE_FILE})
HOME=/home/lccanon
RESULT_DIR=${HOME}/results/ma2012a
PATH=$HOME/hierknem/bin:$PATH
export LD_LIBRARY_PATH=$HOME/hierknem/lib
mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean
mpirun -n 5 --npernode 1 \
      --hostfile ${RESULT_DIR}/hostfile -x LD_LIBRARY_PATH \
    mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
      --msizes-list=3000000,30000,300 -r 100 --summary

This does not work (Open MPI was not designed to be used as root). OK, let’s try to run in user mode but without infiniband (which was causing issues). We first need the right to access KNEM:

for host in $(uniq ${OAR_NODE_FILE})
do
  ssh root@${host} chown $USER /dev/knem
done

Let’s launch the code:

ssh $(head -n 1 ${OAR_NODE_FILE})
RESULT_DIR=${HOME}/results/ma2012a
PATH=$HOME/hierknem/bin:$PATH
export LD_LIBRARY_PATH=$HOME/hierknem/lib
mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean
mpirun --mca btl self,sm,tcp -n 5 --npernode 1 \
      --hostfile ${RESULT_DIR}/hostfile -x LD_LIBRARY_PATH \
    mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
      --msizes-list=3000000,30000,300 -r 100 --summary

Actually, this does not run. Even a basic mpirun hostname fails. The installed mpirun work fine and the tutorial only differ in the way Open MPI is compiled by an incorrect libdir flag and a seemingly useless memory management flag.

OK, we can actually solve this issue by using absolute paths for mpirun (different Open MPI versions were used on remote nodes). We will therefore try to run the hierknem code as root (for the InfiniBand support) and then version 1.5.4 in the same condition with 5 nodes:

Actual measurements

Let’s start by reserving nodes (15 are available). We need root access to avoid infiniband issues.

oarsub -I -l nodes=15,walltime=2:00 -t deploy -p "cluster='parapluie'"
kadeploy3 -f ${OAR_NODE_FILE} -e jessie-x64-big -k

Let’s configure the machines with KNEM:

for host in $(uniq ${OAR_NODE_FILE})
do
  ssh root@${host} insmod ~/hierknem/lib/modules/3.16.0-4-amd64/knem.ko
  scp .ssh/id_rsa root@${host}:/root/.ssh
done

Let’s store the machine list:

RESULT_DIR=${HOME}/results/ma2012a
uniq ${OAR_NODE_FILE} > ${RESULT_DIR}/hostfile

Let’s launch a simple script:

ssh root@$(head -n 1 ${OAR_NODE_FILE})
USER_HOME=/home/lccanon
RESULT_DIR=${USER_HOME}/results/ma2012a
PATH=${USER_HOME}/hierknem/bin:$PATH
export LD_LIBRARY_PATH=${USER_HOME}/hierknem/lib
${USER_HOME}/hierknem/bin/mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean
${USER_HOME}/hierknem/bin/mpirun -n 360 --npernode 24 \
      --hostfile ${RESULT_DIR}/hostfile -x LD_LIBRARY_PATH \
    mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
      --msizes-list=3000000,30000,300 -r 100 --summary

We get a result (finally):

      test      msize total_nrep valid_nrep       mean_sec     median_sec        min_sec        max_sec
MPI_Reduce    3000000        100        100   0.0686075647   0.0656665517   0.0641192496   0.2161087565
MPI_Reduce      30000        100        100   0.0023956711   0.0023696037   0.0022798439   0.0028028270
MPI_Reduce        300        100        100   0.0003717525   0.0002941050   0.0001592109   0.0009123335

Let’s get the results with Open MPI 1.5.4 (with and without hierarch):

PATH=${USER_HOME}/hierknem/ompi-1.5.4/bin:$PATH
export LD_LIBRARY_PATH=${USER_HOME}/hierknem/ompi-1.5.4/lib
${USER_HOME}/hierknem/ompi-1.5.4/bin/mpirun --pernode --hostfile ${RESULT_DIR}/hostfile orte-clean
${USER_HOME}/hierknem/ompi-1.5.4/bin/mpirun -n 360 --npernode 24 \
      --hostfile ${RESULT_DIR}/hostfile -x LD_LIBRARY_PATH \
    mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
      --msizes-list=3000000,30000,300 -r 100 --summary
${USER_HOME}/hierknem/ompi-1.5.4/bin/mpirun --mca coll_hierarch_priority 90 \
      -n 360 --npernode 24 \
      --hostfile ${RESULT_DIR}/hostfile -x LD_LIBRARY_PATH \
    mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
      --msizes-list=3000000,30000,300 -r 100 --summary

We get:

      test      msize total_nrep valid_nrep       mean_sec     median_sec        min_sec        max_sec
MPI_Reduce    3000000        100        100   0.0467649261   0.0198663083   0.0194497852   0.6242527330
MPI_Reduce      30000        100        100   0.0009524739   0.0009099533   0.0008457539   0.0011328874
MPI_Reduce        300        100        100   0.0003981949   0.0003463561   0.0001744470   0.0008854852

and:

      test      msize total_nrep valid_nrep       mean_sec     median_sec        min_sec        max_sec
MPI_Reduce    3000000        100        100   0.0816672251   0.0780993096   0.0774702270   0.2245223861
MPI_Reduce      30000        100        100   0.0008852206   0.0008504661   0.0007597622   0.0013312226
MPI_Reduce        300        100        100   0.0003525603   0.0003197963   0.0001519404   0.0008233278

And finally with version 1.10.2 (this time, being root is an issue and there is no infiniband issue when being a user):

ssh $(head -n 1 ${OAR_NODE_FILE})
RESULT_DIR=${HOME}/results/ma2012a
PATH=${HOME}/openmpi-1.10.2/bin:$PATH
export LD_LIBRARY_PATH=${HOME}/openmpi-1.10.2/lib
${HOME}/openmpi-1.10.2/bin/mpirun --pernode \
    --hostfile ${RESULT_DIR}/hostfile orte-clean
${HOME}/openmpi-1.10.2/bin/mpirun -n 360 --npernode 24 \
      --hostfile ${RESULT_DIR}/hostfile \
    mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
      --msizes-list=3000000,30000,300 -r 100 --summary
${HOME}/openmpi-1.10.2/bin/mpirun --allow-run-as-root \
      -n 16 --npernode 16 --mca coll_hierarch_priority 90 \
      --hostfile ${RESULT_DIR}/hostfile \
    mpibenchmark-0.9.4 --calls-list=MPI_Reduce \
      --msizes-list=3000000,30000,300 -r 100 --summary

The output is:

      test      msize total_nrep valid_nrep       mean_sec     median_sec        min_sec        max_sec 
MPI_Reduce    3000000        100        100   0.0700630455   0.0683079087   0.0676755826   0.1815986922
MPI_Reduce      30000        100        100   0.0017387800   0.0017138167   0.0015250622   0.0020754500
MPI_Reduce        300        100        100   0.0004783298   0.0004696061   0.0003139174   0.0007003843

and:

      test      msize total_nrep valid_nrep       mean_sec     median_sec        min_sec        max_sec
MPI_Reduce    3000000        100        100   0.0137720546   0.0136190100   0.0128717800   0.0152411713
MPI_Reduce      30000        100        100   0.0002522744   0.0002353930   0.0002118504   0.0011307217
MPI_Reduce        300        100        100   0.0000079071   0.0000074117   0.0000067974   0.0000323183

Conclusion

This effort shows the general difficulty of using outdated hardware (only a fraction of nodes remains) and unsupported software (evolving parameter usage).

In addition, the code associated with HierKNEM presents some issues (missing files, error with OpenFabrics, poor performance even with the default components).

Even though the code would produce reasonable results, there would still be a problem with the baseline. Version 1.5.4 is not an appropriate choice for measure performance.

As a reminder for future tests: carefully select directories for MPI versions and put helpful libraries (like GSL) in a common directory.

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_0.9.3.1 dplyr_0.4.3     tidyr_0.2.0     purrr_0.2.0    
## [5] stringr_1.0.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1        knitr_1.10.5       magrittr_1.5      
##  [4] MASS_7.3-44        munsell_0.4        colorspace_1.2-2  
##  [7] R6_2.1.2           plyr_1.8.1         tools_3.3.1       
## [10] dichromat_2.0-0    parallel_3.3.1     grid_3.3.1        
## [13] gtable_0.1.2       pacman_0.4.1       DBI_0.3.1         
## [16] htmltools_0.2.6    yaml_2.1.13        assertthat_0.1    
## [19] digest_0.6.9       RColorBrewer_1.0-5 reshape2_1.2.2    
## [22] formatR_0.10       evaluate_0.7       rmarkdown_0.7     
## [25] labeling_0.1       stringi_0.5-5      scales_0.2.3      
## [28] proto_0.3-10