Getting Started with SHARPv2

Network Plan

Have a dedicated server for openSM and SHARP Aggregation Manager (SHARP AM). 

Note that sharpd (SHARP daemon) and SHARP AM cannot run on the same compute server.

MPI API calls supported by SHARP 

Note that SHARP is enabled only on MPI_barrier and MPI_allreduce for large message sizes. 

Therefore, before testing it with HPC application, you should profile the application and look for high percentage of those calls MPI, otherwise, the SHARP value may be negligible. 



Prerequisites 

Hardware needed:

  • Adapters: Mellanox ConnectX-6 adapters

  • Switch: Mellanox HDR Switch



Software and Drivers:



Note: HPC-X packages like openmpi, sharp, ucx, hcoll are also part of MLNX_OFED. However, it is recommended to build HPC-X as a module.



Run module load to use the module, for example:

$ wget http://bgate.mellanox.com/products/hpcx/custom/hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64.tbz $ tar -xjf hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64.tbz $ cd hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64 $ module load $PWD/modulefiles/hpcx





1. Verify that the device supports packet based credits:

$ sudo ibv_devinfo -v -d mlx5_0 | grep EXP_PACKET_BASED_CREDIT_MODE EXP_PACKET_BASED_CREDIT_MODE



2. In case you are using alpha version of the firmware, set immediate retransmissions on the relevant device :



$ sudo mst start Starting MST (Mellanox Software Tools) driver set Loading MST PCI module - Success $ sudo mst status ... ------------ /dev/mst/mt4123_pciconf0 - PCI configuration cycles access. domain:bus:dev.fn=0000:03:00.0 addr.reg=88 data.reg=92 Chip revision is: 00 /dev/mst/mt4123_pciconf1 - PCI configuration cycles access. domain:bus:dev.fn=0000:82:00.0 addr.reg=88 data.reg=92 Chip revision is: 00 ... ... $ sudo mcra /dev/mst/mt4123_pciconf0 0x58318.1:1 0x1 $ sudo mcra /dev/mst/mt4123_pciconf0 0x58318.0:1 0x1



Set up the cluster and OpenSM

1. Make sure that your cluster meet with the minimum requirements, see the deployment guide here.



2. Login to the head node and raise the relevant modules.



3. Make sure that both directories (HPCX_SHARP_DIR, OMPI_HOME) are available on the lunching node (login nodes).



4. Set the parameters sharp_enabled 2 and routing_engine ftree,updn (or any other routing algorithm, not all are supported) in the opensm configuration file /etc/opensm/opensm.conf (or any other location)





$ cat /etc/opensm/opensm.conf | grep sharp
sharp_enabled 2



5. Set the parameter root_guid_file /etc/opensm/root_guid.cfg in the opensm configuration file /etc/opensm/opensm.conf (or any other location)



6. Add the switch GUIDs for all the root switches of your InfiniBand network. Put the root GUIDs in /etc/opensm/root_guid.cfg, e.g.,



7. Start the OpenSM using the prepared opensm.conf file (you can add other flags as well, e.g. priority).



8. Make sure that this is the sm running on the cluster.



9. Make sure that the aggregation nodes were activated by OpenSM



Note: Using OpenSM v5.3 or later should be used.



Enable SHARP Daemons 

1. Setup SHARP Aggregation Manager (sharp_am) on the opensm node. 

To enable SHARPv2 set control_path_version to 2.

SHARP tree trimming_mode should be disabled as well.



The parameters can be also updated in the configuration file located in:



Another option is to use SHARP AM service and not via command line:



3. Setup sharpd on all cluster nodes (using pdsh or any other method)



4. Start the SHARP daemon on all cluster nodes



Verification

1. Run ibdiagnet --sharp and check the aggregation nodes



SHARP Benchmark





 Example:



HCOLL/SHARP: 

1. HCOLL is enabled by default, and must be enabled to use SHARP. In case you are not sure, add -mca coll_hcoll_enable 1 to your mpirun.

2. SHARP admin mode - to enable sharp use -x HCOLL_ENABLE_SHARP=1 to your mpirun script. This will enable SHARP assuming hcoll is enabled as well. if some errors occur, SHARP will not run, but the job will run without SHARP. In this case fallback is available. 

In case you wish to force SHARP use -x HCOLL_ENABLE_SHARP=2 , in this case if some errors with SHARP occurs, the job will fail - this way you can make sure the SHARP is enabled on your job. It forces the SHARP on the default quota.

In case of  -x HCOLL_ENABLE_SHARP=3 if forces SHARP on all communicators.

3. Enable SHARP Streaming aggregation tree  -x  SHARP_COLL_ENABLE_SAT=1 (part of the features of SHARPv2).

4. Set the minimum packet size for the SHARP streaming aggregation, for example for 1024 bytes set  -x SHARP_COLL_SAT_THRESHOLD=1024 

5. Optionally set the Quota per QST  -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=1024 (the default is 128). For smaller messages than SHARP_COLL_SAT_THRESHOLD (1024 for example).

6. For PPN=1 add also the following: 

-x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=4  - use HCOLL uses the SHARPv2 algorithm for allreduce (there are difference algorithms).

-x HCOLL_ALLREDUCE_ZCOPY_TUNE=static   - use static algorithm, not dynamic tuning in run time.

-x HCOLL_SBGP=p2p  -x HCOLL_BCOL=ucx_p2p    - Disable HCOLL hierarcy, all ranks are point to point group, no shared memory awareness.

7. For multi-ppn, enabled HCOLL_ALLREDUCE_HYBRID_LB=1. This enable in-node reduction for SAT for multi-ppn. Note, the flag may harm the bandwidth performance.

8. If in the profile you see let's say 500KB message size, you can split the message into two messages using HCOLL_HYBRID_AR_LB_FRAG_THRESH=262144. (default is 1MB). This will allow some overlapping between the ranks.



HCOLL_ALLREDUCE_HYBRID_LB



Example:



Running allreduce on two nodes - Without SHARPv2 (only HCOLL) on messages 4096bytes and above

Note:

-x HCOLL_ENABLE_SHARP=0 : Disable SHARP

-x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=2 :SHARP cannot be used, use the hcoll algorithm for the allreduce.





Now enable SHARPv2 with streaming aggregation:

Note:

-x HCOLL_ENABLE_SHARP=3 : Force SHARP

-x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=4 :SHARPv2 can be used.



Note that 2 trees where created:

  • LLT tree for messages up to 4096

  • SAT tree for messages from 4096 and up.









Debugging SHARP 

1. Add log level 3 (information) to get the log for communicator creation.

-x SHARP_COLL_LOG_LEVEL=3



Known issues 

  1. SHARP tree trimming is not supported, set trimming_mode 0

  2. Switch reboot may be needed for the SHARPv2 alpha version after running jobs.

ReferencesÂ