This is a brief guide to help you get started with SHARP technology.
What is SHARP?
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) is an in-network technology that improves the performance of MPI operations by offloading collective operations from the CPU to the switch network.
This innovative approach decreases the amount of data traversing the network as aggregation nodes are reached. Implementing collective communication algorithms in the network also has additional benefits, such as freeing up valuable CPU resources for computation rather than using them to process communication.
Table Of Contents
Table of Contents |
---|
Network Plan
Have a dedicated server for openSM and SHARP Aggregation Manager (SHARP AM).
Note that sharpd (SHARP daemon) and SHARP AM cannot run on the same compute server.
MPI API calls supported by SHARP
Note that SHARP is enabled only on the MPI_allreduce and MPI_barrier collectives.
Therefore, before testing it with HPC application, you should profile the application and look for high percentage of those calls MPI, otherwise, the SHARP value may be negligible.
Prerequisites
1. Make sure the latest MLNX_OFED is installed on the cluster.
2. Install the latest HPC-X version as a module (not a must, but easy to handle)
Note: HPC-X packages like openmpi, sharp, ucx, hcoll are also part of MLNX_OFED. However, it is recommended to download and build the latest GA versions of HPC-X as a module.
Set up the cluster and OpenSM
...
2. Login to the head node and raise the relevant modules.
Code Block |
---|
$ module load intel/2018.1.163gcc/11 $ module load hpcx/2.1.019 |
3. Make sure that both directories (HPCX_SHARP_DIR, OMPI_HOME) are available
Code Block |
---|
$ echo $HPCX_SHARP_DIR /global/software/centos-7rocky-9.x86_64/modules/intel/2018.1.163gcc/11/hpcx/2.1.019/sharp $ echo $OMPI_HOME /global/software/centos-7rocky-9.x86_64/modules/intel/2018.1.163gcc/11/hpcx/2.1.019/ompi |
4. Set the parameters sharp_enabled 2 and routing_engine ftree,updn (or any other routing algorithm) in the opensm configuration file /etc/opensm/opensm.conf (or any other location)
...
Code Block |
---|
root_guid_file /etc/opensm/root_guid.cfg |
6. Add the switch GUIDs for all the root switches of your InfiniBand network. Put the root GUIDs in /etc/opensm/root_guid.cfg, e.g.,
...
Note: Using OpenSM v4.9 or later doesn't require any special configuration in the Aggregation manager for fat-tree topologies, for other topologies or older OpenSM refer to the deployment guide.
Enable SHARP
...
Daemons
1. Setup SHARP Aggregation Manager (sharp_am) on the opensm node
Code Block |
---|
sudo $HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s -d sharp_am Copying /global/software/centos-7rocky-9.x86_64/modules/intel/2018.1.163gcc/11/hpcx/2.1.019/sharp/systemd/system/sharp_am. service to /etc/systemd/system/sharp_am.service Service sharp_am is installed |
...
2. Start SHARP AM service
Code Block |
---|
$ sudosystemctl servicestart sharp_am start Redirecting to /bin/$ systemctl startstatus sharp_am.service |
3. Setup sharpd on all cluster nodes (using pdsh or any other method)
Code Block |
---|
$ sudo $HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s -d sharpd Copying /global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/sharp/systemd/system/sharpd.service to ● sharp_am.service - SHARP Aggregation Manager (sharp_am). Version: 3.7.0 Loaded: loaded (/etc/systemd/system/sharp_am.service; enabled; preset: disabled) Drop-In: /etc/systemd/system/sharpdsharp_am.service.d Service sharpd is installed |
4. Start the SHARP daemon on all cluster nodes
Code Block |
---|
$ sudo service sharpd start Redirecting to /bin/systemctl start sharpd.service└─Service.conf Active: active (running) |
SHARP Parameters
1. hcoll must be enabled to use SHARP. Add -mca coll_hcoll_enable 1 to your mpirun.
2. SHARP admin mode - to enable sharp use -x HCOLL_ENABLE_SHARP=1 to your mpirun script. This will enable SHARP assuming hcoll is enabled as well. if some errors occur, SHARP will not run, but the job will run without SHARP.
In case you wish to force SHARP use -x HCOLL_ENABLE_SHARP=2 , in this case if some errors with SHARP occurs, the job will fail - this way you can make sure the SHARP is enabled on your job.
Verification
1. Run ibdiagnet --sharp and check the aggregation nodes
...
Code Block |
---|
$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 osu_barrier -i 100000 # OSU MPI Barrier Latency Test v5.4.1 # Avg Latency(us) 4.17 $ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx5_0:1 osu_barrier -i 100000 # OSU MPI Barrier Latency Test v5.4.1 # Avg Latency(us) 3.29 $ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_SHARP=1 osu_barrier -i 100000 # OSU MPI Barrier Latency Test v5.4.1 # Avg Latency(us) 1.64 |
Note: in above testing the first command runs osu_barrier without HCOLL, the second command runs it with HCOLL but without SHARP, the third command runs it with HCOLL and SHARP.
Advance Consideration
Multi-Channel
1. When using full PPN on the node (e.g. 32 Broadwell dual CPU or 40 on skylake servers) it is recommended to use set multi-channel
In the 3 level hierarchy:
First we will sub-group in the socket (e.g. 16 ranks), one group per socket. (-x HCOLL_SBGP=basesmsocket)
Second, each socket leader will form another group which is node level group per numa (-x HCOLL_SBGP=basesmuma)
Third, there will be one node leader per node (the leader of the second group) - one rank per node. (-x HCOLL_SBGP=p2p)
The collective algorithm will take advantage of this algorithm.
SBGP: sub grouping
The defaults is all 3 levels: -x HCOLL_SBGP=basesmsocket,basesmuma,p2p
In a of a intra-socket noise (when using full PPN for example) it is recommended to use 2 level hierarchy:
With 2 levels subgrouping it will be without the second level.
First we will sub-group in the socket (e.g. 16 ranks), one group per socket. (-x HCOLL_SBGP=basesmsocket)
Second , there will be one node leader per node (the leader of the second group) - one rank per node. (-x HCOLL_SBGP=p2p)
SBGP: sub grouping
-x HCOLL_SBGP=basesmsocket,p2p
2. In case we changed the HCOLL_SBGP, we need to align the HCOLL_BCOL (which is the communication channel)
For two level hierarchy we will use
-x HCOLL_BCOL=basesmuma,ucx_p2p
For three level hierarchy (default) we will use:
-x HCOLL_BCOL=basesmuma,basesmuma,ucx_p2p
basesmuma - use the shared memory for both level one and two.
ucx_p2p - for communication out of the node for level three.
To summarize, use the following for multi-channel for allreduce small messages (full PPN)
-x HCOLL_BCOL=basesmuma,ucx_p2p -x HCOLL_SBGP=basesmsocket,p2p
Note: if the ranks used are only within one socket it is not relevant.
Fragmentation (allreduce tuning)
To allow larger messages over SHARP use:
-x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4096 (default 256)
The fragmentation performance depends on #OSTS assigned to the group
1. SHARP resources for SwitchIB-2
For allreduce messages bigger than 256B (up to 4KB) you can use fragmentation on the message.
This is the fragment size:
-x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=256 (sharp fragment size)
The default is 128B.
2. OST - outstanding transactions. The default is 16.
One OST is one credit (after sending the sender is waiting for completion), for high performance reasons, you can enlarge it to 256.
-x SHARP_COLL_JOB_QUOTA_OSTS=256 ( # sharp osts)
3. Number of outstanding communicators that can use SHARP.
The default is 8. which means that by default each group gets 16/8=2 OST.
if the number of OSTs are 256, than the each group gets 256/8=32 OSTs
Each groups get #osts = (SHARP_COLL_JOB_QUOTA_OSTS / SHARP_COLL_JOB_QUOTA_MAX_GROUPS).
-x SHARP_COLL_JOB_QUOTA_MAX_GROUPS=8 ( #sharp groups).
For example for osu_allreduce up to 4KB (256 payload * 16 OST per group) use the following:
-x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4096
-x SHARP_COLL_JOB_QUOTA_OSTS=128 (per job. We have 128/8=16 OSTs per group)
-x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=256
Note: It is recommended to play around with the parameters and test it compare to the non-SHARP allreduce tests.
Debugging SHARP
1. Add log level 3 (information) to get the log for communicator creation.
-x SHARP_COLL_LOG_LEVEL=3
References
- SHARP Documentation
- How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries