HPC-X 2.0 Boosts Performance of Grid Benchmark


This post describes how the performance of GRID benchmark built with HPC-X 2.0 was enhanced by enabling UCX.


References 


Overview

What is GRID 

Grid is a C++ library,  Developed by Peter Boyle (U. of Edinburgh) et al, for Lattice Quantum Chromodynamics (Lattice QCD) calculations that is available on github and is designed to exploit parallelism at all levels:

  • SIMD (vector instructions)
  • OpenMP (shared memory parallelism)
  • MPI (distributed memory parallelism through message passing)

Building the library is straightforward. A number of tests and benchmarks are built along with it.  While building the library it also compiles a set of tests and benchmarks, one of which exercises the library thoroughly and prints out performance information.

HPC- X

HPC-X is the Mellanox application accelerator software package that includes the communications libraries you need to get the most out of your Mellanox InfiniBand interconnect.  The package may be downloaded here.  Building and running applications with OpenMPI and other support libraries that are provided in HPC-X is easy, as the following write-up shows.

  • UCX introduced two new accelerated transports, dc_x and rc_x, that use enhanced verbs. 
  • The best transports for GRID turned out to be rc_x and rc
  • In the mpirun example, we enabled hardware tag matching (UCX_RC_VERBS_TM_ENABLE=y) that perform the tag matching lookup in hardware instead of software.
  • The selection of a pml (point-to-point management layer): UCX pml and the MXM pml are mutually exclusive; thus, MXM was disabled

Setup

  • Thor
  • Helios
  • 128 Node cluster (Xeon E5-2697A v4, Broadwell) with EDR InfiniBand interconnect.

Configuration

Obtaining Grid

1. Download Grid from Github.  You can either download the zip file or clone the software directly onto your system.  For the DiRAC benchmark, a slightly different tarball was provided.

wget https://github.com/paboyle/Grid/archive/dirac-ITT-fix1.tar.gz
tar zxvf dirac-ITT-fix1.tar.gz
cd Grid-dirac-ITT-fix1
./bootstrap.sh


2. Read the readme.md file to find out about building Grid for your specific system.  Here we show the procedure we followed in building Grid with the Intel C++ compiler and either OpenMPI from HPC-X 2.0 or Intel MPI. 

The last line in the code block above does initial set-up (obtaining the Eigen package and generating the configure script).

Note: Grid requires two packages to be already installed: GMP (https://gmplib.org/) and MPFR (http://www.mpfr.org/). 

Building GRID

Bulding GRID with HPC-X 2.0 (accelerated Open MPI)

1. Set up your environment.  On our systems, the sysadmins have helpfully provided a set of software modules that set up various environment variables using the module command:

module purge
module load intel/compiler/2017.5.239 hpcx-2.0/icc-2017
export OMPI_CXX=icpc


2. Create the directory where the library and programs will be built and configure the build for your system:

mkdir build_hpcx
cd build_hpcx
../configure --enable-precision=single --enable-simd=AVX2 --enable-comms=mpi3 --enable-mkl CXX=mpicxx --prefix=/<path-to-installation-directory>/Grid_hpcx20


Note: The specific benchmark for which this effort was carried out required running in single precision; however, double precision tests were also made.

3. Build the library and the benchmarks, run the tests and install the library and the benchmarks:

make 2>&1 | tee make_i17hpcx20.log
make check 2>&1 | tee check_i17hpcx20.log
make install 2>&1 | tee install_i17hpcx20.log

Note: If your build node has processors that differ from those on compute nodes (e.g., they do not support the SIMD model chosen in the configuration step above) the check and install steps will fail. In that case you may copy the Benchmark_ITT executable from benchmark (under build_hpcx) to any suitable directory.

Note: AVX512 and KNL are among the other possible options for --enable-simd


If the three make steps succeed, you will have bin, lib and include subdirectories under the installation directory you provided in the configure script  (--prefix option on step 2 above).


Bulding GRID with Intel MPI

1. Set up your environment, load the intel compiler and intel mpi modules:

module purge
module load intel/compiler/2017.5.239 intel/impi/5.1.3.258


2. Create the directory where the library and programs will be built and configure the build for your system:

mkdir build_impi
cd build_impi
../configure --enable-precision=double --enable-simd=AVX2 --enable-comms=mpit-auto --enable-mkl CXX=icpc MPICXX=mpiicpc --prefix=/<path-to-installation-directory>/Grid_impit


3. Build the library and the benchmarks, run the tests and install the library and the benchmarks

make 2>&1 | tee make_i17impit.log
make check 2>&1 | tee check_i17impit.log
make install 2>&1 | tee install_i17impit.log


Note: It may be neccessary to add -lrt to the end of the line starting with "LIBS =" in each of sixteen Makefile files ({,*/,*/*/}Makefile).


Note: If your build node has processors that differ from those on compute nodes (e.g., they do not support the SIMD model chosen in the configuration step above) the check and install steps will fail.  In that case you may copy the Benchmark_ITT executable from benchmark (under build_impi) to any suitable directory.

Note: AVX512 and KNL are among the other possible options for --enable-simd

Executing the Benchmark

The Benchmark_ITT executable requires several arguments, most notably

  • --mpi M.N.P.Q,
  • --shm size
  • --threads nthreads
  • --shm-hugetlb (for Xeon Phi runs)

where M.N.P.Q is the decomposition of a 4-dimensional grid among MPI ranks (so that M*N*P*Q must be equal to the total number of MPI ranks to be spawned by the mpirun command), size is the size of shared memory segments to be used by the program (1024 worked for 4 and fewer nodes; 2048 for 8 and larger up to 128) and nthreads is the number of OpenMP threads per MPI rank, so that M*N*P*Q*nthreads is equal to the total number of cores used during the run.


I ran the programs using the slurm scheduler.  For the HPC-X 2.0 executable, the mpirun command was as follows:

/usr/bin/time -p mpirun -np $ranks --map-by ppr:$rpn:node --bind-to $object -report-bindings --display-map \
  -mca coll_hcoll_enable 0 -mca mtl ^mxm -x UCX_RC_VERBS_TM_ENABLE=y -mca pml ucx -x UCX_TLS=$transport,self,shm \
  /home/gerardo/Grid_hpcx20/bin/Benchmark_ITT --mpi M.N.P.Q --shm 2048 --threads $nthreads


Where:

$ranks is the total number of MPI ranks,

$rpn is the number of MPI ranks per node (at most equal to the number of cores in a node divided by the number of OpenMP threads),

$object is none if $rpn is equal to 1 and socket otherwise,

$transport is rc or rc_x

M.N.P.Q is the rank decomposition (M*N*P*Q == $ranks and >>> Q), and

$threads is the number of OpenMP threads.



MPI Application Profiling

In this case, we used IPM 2.06 to profile the application, as it can be seen, it uses mostly MPI SendReceive operations.







Results 

Output Example 

At the end of the run, Benchmark_ITT writes a Memory benchmark results table, a Communications benchmark results table (for runs on two or more nodes) and an overall floating point report.

Here is an example:

=================================================================================================================
 Memory benchmark 
=================================================================================================================
=================================================================================================================
= Benchmarking a*x + y bandwidth
=================================================================================================================
  L             bytes                   GB/s            Gflop/s          seconds                GB/s / node
----------------------------------------------------------
8               50331648.000            4371.720                364.310         1.492           273.232
12              254803968.000                   3818.499                318.208         1.708           238.656
16              805306368.000                   5851.782                487.649         1.115           365.736
20              1966080000.000                  5225.404                435.450         1.248           326.588
24              4076863488.000                  2918.898                243.242         2.235           182.431
28              7552892928.000                  3062.299                255.192         2.129           191.394
32              12884901888.000                 2952.386                246.032         2.208           184.524
36              20639121408.000                 2824.411                235.368         2.309           176.526
40              31457280000.000                 2825.827                235.486         2.304           176.614
44              46056603648.000                 2798.276                233.190         2.321           174.892
48              65229815808.000                 2801.425                233.452         2.328           175.089
==================================================================================================================
 Communications benchmark 
==================================================================================================================
==================================================================================================================
= Benchmarking threaded STENCIL halo exchange in 4 dimensions
==================================================================================================================
 L       Ls     bytes      MB/s uni (err/min/max)               MB/s bidi (err/min/max)
4       8       49152       1472.6  178.3 88.9 2531.4            2945.2  356.6 177.9 5062.9
8       8       393216      6147.1  605.2 381.8 8886.2          12294.2  1210.3 763.6 17772.5
12      8       1327104     8722.7  270.0 1960.3 9909.9         17445.3  539.9 3920.5 19819.8
16      8       3145728     9456.4  77.4 3753.1 10058.3         18912.8  154.8 7506.2 20116.6
20      8       6144000     9475.3  61.0 4976.9 10245.7         18950.5  121.9 9953.8 20491.4
24      8       10616832    9499.0  47.4 4991.3 10102.0         18998.1  94.8 9982.5 20204.1
28      8       16859136    9340.3  45.1 4918.4 9866.6          18680.5  90.3 9836.9 19733.2
32      8       25165824    9320.1  47.7 4917.7 9949.8          18640.2  95.4 9835.4 19899.5
==================================================================================================================
 Per Node Summary table Ls=16
==================================================================================================================
 L               Wilson          DWF4            DWF5 
8                3895.6          33637.9         35835.8
12               13308.6         74021.3         91408.2
16               35034.3         85631.9         114475.0
24               76632.2         98933.4         133285.8
==================================================================================================================
==================================================================================================================
 Comparison point     result: 79826.6 Mflop/s per node
 Comparison point is 0.5*(85631.9+74021.3) 
 Comparison point robustness: 0.819
==================================================================================


  • GRID was built and run using Intel 2017 compilers and HPC-X 2.0 and Intel MPI. The Intel MPI version was run on both InfiniBand (EDR) and OmniPath (OPA). 
  • The benchmark program prints a "Comparison point result" at the end of the run that is given in Mflop/s per node
  •  The bars in the diagrams below were obtained by multiplying the performance per node by the number of nodes and converting to Gflop/s.


Single vs. Double Precision

Single Precision numbers uses 32 bits while Double Precision numbers uses 64 bits
The double precision calculations generates more data transfers for the same test size. 

The differences in the results presented later emphasis the need for fast accelerated interconnect for best performance.


Performance 

Double Precision on Thor 


Single Precision on Helios 


Double Precision Performance on Large Scale 

  • Intel Xeon E5-2697A v4 (Broadwell)
  • Interconnect: EDR InfiniBand 


Single Precision Performance on Large Scale 

  • Intel Xeon Gold 6148 (Skylake)
  • Interconnect: EDR InfiniBand