Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Therefore, before testing it with HPC application, you should profile the application and look for high percentage of those calls MPI, otherwise, the SHARP value may be negligible. 

Prerequisites 

1. Make sure the latest MLNX_OFED is installed on the cluster.

2. Install the latest HPC-X version as a module (not a must, but easy to handle)


Note: HPC-X packages like openmpi, sharp, ucx, hcoll are also part of MLNX_OFED. However, it is recommended to download and build the latest GA versions of HPC-X as a module.


Set up the cluster and OpenSM

...

2. Login to the head node and raise the relevant modules.

Code Block
$ module load intel/2018.1.163gcc/11
$ module load hpcx/2.1.019


3. Make sure that both directories (HPCX_SHARP_DIR, OMPI_HOME) are available 

Code Block
$ echo $HPCX_SHARP_DIR
/global/software/centos-7rocky-9.x86_64/modules/intel/2018.1.163gcc/11/hpcx/2.1.019/sharp
$ echo $OMPI_HOME
/global/software/centos-7rocky-9.x86_64/modules/intel/2018.1.163gcc/11/hpcx/2.1.019/ompi

4. Set the parameters sharp_enabled 2 and routing_engine ftree,updn (or any other routing algorithm) in the opensm configuration file /etc/opensm/opensm.conf (or any other location)

...

Code Block
sudo $HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s -d sharp_am
Copying /global/software/centos-7rocky-9.x86_64/modules/intel/2018.1.163gcc/11/hpcx/2.1.019/sharp/systemd/system/sharp_am.                                               service to /etc/systemd/system/sharp_am.service
Service sharp_am is installed

...

2. Start SHARP AM service 

Code Block
$ sudosystemctl servicestart sharp_am
start
Redirecting to /bin/$ systemctl startstatus sharp_am.service

3. Setup sharpd on all cluster nodes (using pdsh or any other method)

Code Block
$ sudo $HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s -d sharpd
Copying /global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/sharp/systemd/system/sharpd.service to
● sharp_am.service - SHARP Aggregation Manager (sharp_am). Version: 3.7.0
     Loaded: loaded (/etc/systemd/system/sharp_am.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/sharpdsharp_am.service.d
 Service sharpd is installed

4. Start the SHARP daemon on all cluster nodes

Code Block
$ sudo service sharpd start Redirecting to /bin/systemctl start sharpd.service└─Service.conf
     Active: active (running)


SHARP Parameters 

1. hcoll must be enabled to use SHARP. Add -mca coll_hcoll_enable 1 to your mpirun.

...

Code Block
$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 osu_barrier -i 100000

# OSU MPI Barrier Latency Test v5.4.1
# Avg Latency(us)
             4.17

$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0  -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx5_0:1 osu_barrier -i 100000

# OSU MPI Barrier Latency Test v5.4.1
# Avg Latency(us)
             3.29

$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_SHARP=1 osu_barrier -i 100000

# OSU MPI Barrier Latency Test v5.4.1
# Avg Latency(us)
             1.64

Note: in above testing the first command runs osu_barrier without HCOLL, the second command runs it with HCOLL but without SHARP, the third command runs it with HCOLL and SHARP.

...