Getting Started with SHARPv2

Network Plan

Have a dedicated server for openSM and SHARP Aggregation Manager (SHARP AM).

Note that sharpd (SHARP daemon) and SHARP AM cannot run on the same compute server.

MPI API calls supported by SHARP

Note that SHARP is enabled only on MPI_barrier and MPI_allreduce for large message sizes.

Therefore, before testing it with HPC application, you should profile the application and look for high percentage of those calls MPI, otherwise, the SHARP value may be negligible.

Prerequisites

Hardware needed:

Adapters: Mellanox ConnectX-6 adapters
Switch: Mellanox HDR Switch

Software and Drivers:

OS: CentoOS 7.5 (x86)
Mellanox OFED 4.5-1 (GA)
For GPU
- Install MLNX_OFED driver plug-in for GPU Direct RDMA http://www.mellanox.com/downloads/ofed/nvidia-peer-memory_1.0-7.tar.gz
- Install GDR copy lib https://github.com/NVIDIA/gdrcopy
HPC-X 2.5

Note: HPC-X packages like openmpi, sharp, ucx, hcoll are also part of MLNX_OFED. However, it is recommended to build HPC-X as a module.

Run module load to use the module, for example:

$ wget http://bgate.mellanox.com/products/hpcx/custom/hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64.tbz
$ tar -xjf hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64.tbz
$ cd hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64
$ module load $PWD/modulefiles/hpcx

1. Verify that the device supports packet based credits:

$ sudo ibv_devinfo -v -d mlx5_0 | grep EXP_PACKET_BASED_CREDIT_MODE
					EXP_PACKET_BASED_CREDIT_MODE

2. In case you are using alpha version of the firmware, set immediate retransmissions on the relevant device :

$ sudo mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success


$ sudo mst status

...

------------
/dev/mst/mt4123_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=0000:03:00.0 addr.reg=88 data.reg=92
Chip revision is: 00
/dev/mst/mt4123_pciconf1 - PCI configuration cycles access.
domain:bus:dev.fn=0000:82:00.0 addr.reg=88 data.reg=92
Chip revision is: 00

...

...

$ sudo mcra /dev/mst/mt4123_pciconf0 0x58318.1:1 0x1
$ sudo mcra /dev/mst/mt4123_pciconf0 0x58318.0:1 0x1

Set up the cluster and OpenSM

1. Make sure that your cluster meet with the minimum requirements, see the deployment guide here.

2. Login to the head node and raise the relevant modules.

$ module load intel/2018.1.163
$ module load hpcx/2.4.0

3. Make sure that both directories (HPCX_SHARP_DIR, OMPI_HOME) are available on the lunching node (login nodes).

$ echo $HPCX_SHARP_DIR
/global/software/centos-7/modules/intel/2018.1.163/hpcx/2.4.0/sharp
$ echo $OMPI_HOME
/global/software/centos-7/modules/intel/2018.1.163/hpcx/2.4.0/ompi

4. Set the parameters sharp_enabled 2 and routing_engine ftree,updn (or any other routing algorithm, not all are supported) in the opensm configuration file /etc/opensm/opensm.conf (or any other location)

sharp_enabled 2
routing_engine ftree,updn

$ cat /etc/opensm/opensm.conf | grep sharp
sharp_enabled 2

$ cat /etc/opensm/opensm.conf | grep routing_engine
routing_engine ftree,updn

$ cat /etc/opensm/opensm.conf | grep sharp
sharp_enabled 2

5. Set the parameter root_guid_file /etc/opensm/root_guid.cfg in the opensm configuration file /etc/opensm/opensm.conf (or any other location)

root_guid_file /etc/opensm/root_guid.cfg

6. Add the switch GUIDs for all the root switches of your InfiniBand network. Put the root GUIDs in /etc/opensm/root_guid.cfg, e.g.,

0x7cfe900300a5a2c0

7. Start the OpenSM using the prepared opensm.conf file (you can add other flags as well, e.g. priority).

sudo opensm -F /etc/opensm/opensm.conf -g 0xe41d2d0300a3ab5c -p 3 -B
-------------------------------------------------
OpenSM 5.0.0.MLNX20180219.c610c42
Config file is `/etc/opensm/opensm.conf`:
 Reading Cached Option File: /etc/opensm/opensm.conf
 Loading Cached Option:routing_engine = ftree,updn
 Loading Cached Option:root_guid_file = /etc/opensm/root_guid.cfg
 Loading Cached Option:sharp_enabled = 2
Command Line Arguments:
 guid = 0xe41d2d0300a3ab5c
 Priority = 3
 Daemon mode
 Log File: /var/log/opensm.log

8. Make sure that this is the sm running on the cluster.

$ sudo sminfo
sminfo: sm lid 11 sm guid 0xe41d2d0300a3ab5c, activity count 5303 priority 14 state 3 SMINFO_MASTER

9. Make sure that the aggregation nodes were activated by OpenSM

$ sudo ibnetdiscover | grep "Agg"
[41]	"H-98039b03007ab860"[1](98039b03007ab860) 		# "Mellanox Technologies Aggregation Node" lid 7 4xHDR
Ca	1 "H-98039b03007ab860"		# "Mellanox Technologies Aggregation Node"

Note: Using OpenSM v5.3 or later should be used.

$ opensm --version
-------------------------------------------------
OpenSM 5.3.0.MLNX20181108.33944a2

Enable SHARP Daemons

1. Setup SHARP Aggregation Manager (sharp_am) on the opensm node.

To enable SHARPv2 set control_path_version to 2.

SHARP tree trimming_mode should be disabled as well.

/global/software/centos-7/modules/gcc/4.8.5/hpcx/2.4.0-pre/sharp/bin/sharp_am --control_path_version 2 --trimming_mode 0 -B

The parameters can be also updated in the configuration file located in:

/etc/sysconfig/sharp_am.cfg

Another option is to use SHARP AM service and not via command line:

$ sudo service sharp_am start
Redirecting to /bin/systemctl start sharp_am.service

3. Setup sharpd on all cluster nodes (using pdsh or any other method)

$ sudo $HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s -d sharpd
Copying /global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/sharp/systemd/system/sharpd.service to /etc/systemd/system/sharpd.service

Service sharpd is installed

4. Start the SHARP daemon on all cluster nodes

$ sudo service sharpd start
Redirecting to /bin/systemctl start sharpd.service

Verification

1. Run ibdiagnet --sharp and check the aggregation nodes

$ sudo ibdiagnet --sharp 
...

$ cat /var/tmp/ibdiagnet2/ibdiagnet2.sharp
# This database file was automatically generated by IBDIAG

AN:Mellanox Technologies Aggregation Node, lid:7, node guid:0x98039b03007ab860

SHARP Benchmark

$share/sharp/examples/mpi/coll/sharp_coll_test -h
Usage:   sharp_mpi_test  [OPTIONS]
Options:
-h, --help              Show this help message and exit
-B, --perf_with_barrier Sync Allreduce with Barrier in Allreduce collective, Default:1
-c, --collectives       Comma separated list of collectives:[allreduce|iallreduce|barrier|all|none]] to run. Default; run all blocking collectives
-C, --iov_count Number of entries in IOV list, if used. Default: 2 Max:SHARP_COLL_DATA_MAX_IOV
-d, --ib-dev            Use IB device <dev:port> (default first device found)
-D, --data_layout       Data layout (contig, iov) for sender and receiver side. Default: contig
-g, --max_groups        Set value of max_groups (default:1 (COMM_WORLD)). For value > 1 , iterate over comm types
                        (comm world, comm world dup, comm world reverse, comm world odd even split, comm world odd even split reverse)
-i, --iters             Number of iterations to run perf benchmark
-H, --host_alloc_type   Host memory allocation method ( hugetlb, malloc)
-m, --mode              Test modes: <basic|complex|perf|all> . Default: basic
-M, --mem_type          Memory type(host,cuda) used in the communication buffers format: <src memtype>:<recv memtype>
-N, --nbc_count         Number of non-blocking operation posted before waiting for completion (max: 512)
-s, --size              Set  the minimum and/or the maximum message size. format:[MIN:]MAX Default:<4:max_ost_payload>
-t, --group_type        Set specific group type(world, world-dup, world-rev, slipt:<n> split-rev:<n>) to test
-x, --skips             Number of warmup iterations to run perf benchmark

Example:

$export  LD_LIBRARY_PATH=/.autodirect/mtrswgwork/devendar/workspace/GIT/sharp-build/sharp-rhel7.4_cuda10.1_mofed_4_5-5fa9853-06242019/lib:$LD_LIBRARY_PATH
/usr/mpi/gcc/openmpi-4.0.0rc5/bin/mpirun -np 1 -map-by node -H vulcan03 --mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -x SHARP_COLL_LOG_LEVEL=3 -x ENABLE_SHARP_COLL=1   -x  SHARP_COLL_ENABLE_SAT=1 -x SHARP_COLL_SAT_THRESHOLD=1024  sharp-rhel7.4_cuda10.1_mofed_4_5-5fa9853-06242019/share/sharp/examples/mpi/coll/sharp_coll_test -d mlx5_0:1 --mode perf  --collectives allreduce -H malloc -s :1000000 -x 100 -i 1000  -M cuda

HCOLL/SHARP:

1. HCOLL is enabled by default, and must be enabled to use SHARP. In case you are not sure, add -mca coll_hcoll_enable 1 to your mpirun.

2. SHARP admin mode - to enable sharp use -x HCOLL_ENABLE_SHARP=1 to your mpirun script. This will enable SHARP assuming hcoll is enabled as well. if some errors occur, SHARP will not run, but the job will run without SHARP. In this case fallback is available.

In case you wish to force SHARP use -x HCOLL_ENABLE_SHARP=2 , in this case if some errors with SHARP occurs, the job will fail - this way you can make sure the SHARP is enabled on your job. It forces the SHARP on the default quota.

In case of -x HCOLL_ENABLE_SHARP=3 if forces SHARP on all communicators.

3. Enable SHARP Streaming aggregation tree -x SHARP_COLL_ENABLE_SAT=1 (part of the features of SHARPv2).

4. Set the minimum packet size for the SHARP streaming aggregation, for example for 1024 bytes set -x SHARP_COLL_SAT_THRESHOLD=1024

5. Optionally set the Quota per QST -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=1024 (the default is 128). For smaller messages than SHARP_COLL_SAT_THRESHOLD (1024 for example).

6. For PPN=1 add also the following:

-x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=4 - use HCOLL uses the SHARPv2 algorithm for allreduce (there are difference algorithms).

-x HCOLL_ALLREDUCE_ZCOPY_TUNE=static - use static algorithm, not dynamic tuning in run time.

-x HCOLL_SBGP=p2p -x HCOLL_BCOL=ucx_p2p - Disable HCOLL hierarcy, all ranks are point to point group, no shared memory awareness.

7. For multi-ppn, enabled HCOLL_ALLREDUCE_HYBRID_LB=1. This enable in-node reduction for SAT for multi-ppn. Note, the flag may harm the bandwidth performance.

8. If in the profile you see let's say 500KB message size, you can split the message into two messages using HCOLL_HYBRID_AR_LB_FRAG_THRESH=262144. (default is 1MB). This will allow some overlapping between the ranks.

HCOLL_ALLREDUCE_HYBRID_LB

Example:

Running allreduce on two nodes - Without SHARPv2 (only HCOLL) on messages 4096bytes and above

Note:

-x HCOLL_ENABLE_SHARP=0 : Disable SHARP

-x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=2 :SHARP cannot be used, use the hcoll algorithm for the allreduce.

$ mpirun -np 2 -npernode 1  -map-by node  -mca pml ucx -mca coll_hcoll_enable 1  --report-bindings -x UCX_NET_DEVICES=mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_SHARP=0 -x SHARP_COLL_LOG_LEVEL=3  -x  SHARP_COLL_ENABLE_SAT=1 -x SHARP_COLL_SAT_THRESHOLD=4096 -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=1024 -x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=2 -x HCOLL_BCOL=ucx_p2p -x HCOLL_ALLREDUCE_ZCOPY_TUNE=static -x HCOLL_SBGP=p2p  /global/home/users/ophirm/hpcx/hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64/ompi/tests/osu-micro-benchmarks-5.3.2/osu_allreduce -m 4096:999999999
[thor035.hpcadvisorycouncil.com:20899] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.]
[thor036.hpcadvisorycouncil.com:177864] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.]

# OSU MPI Allreduce Latency Test v5.3.2
# Size       Avg Latency(us)
4096                    5.22
8192                    6.93
16384                  10.19
32768                  12.78
65536                  23.26
131072                 35.81
262144                 52.59
524288                 96.53
1048576               184.74
2097152               358.65
4194304               726.39
8388608              1477.21
16777216             3438.79
33554432             7217.48
67108864            17992.13
134217728           38130.87
268435456           77840.17
536870912          157383.25

Now enable SHARPv2 with streaming aggregation:

Note:

-x HCOLL_ENABLE_SHARP=3 : Force SHARP

-x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=4 :SHARPv2 can be used.

Note that 2 trees where created:

LLT tree for messages up to 4096
SAT tree for messages from 4096 and up.

$ mpirun -np 2 -npernode 1  -map-by node  -mca pml ucx -mca coll_hcoll_enable 1  --report-bindings -x UCX_NET_DEVICES=mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_SHARP=3 -x SHARP_COLL_LOG_LEVEL=3  -x  SHARP_COLL_ENABLE_SAT=1 -x SHARP_COLL_SAT_THRESHOLD=4096 -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=1024 -x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=4 -x HCOLL_BCOL=ucx_p2p -x HCOLL_ALLREDUCE_ZCOPY_TUNE=static -x HCOLL_SBGP=p2p  /global/home/users/ophirm/hpcx/hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64/ompi/tests/osu-micro-benchmarks-5.3.2/osu_allreduce -m 4096:536870912

[thor035.hpcadvisorycouncil.com:21277] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.]
[thor036.hpcadvisorycouncil.com:178246] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.]
[thor035:0:21288 - context.c:594] INFO job (ID: 2546139137) resource request quota: ( osts:0 user_data_per_ost:1024 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[thor035:0:21288 - context.c:765] INFO tree_info: type:LLT tree idx:0 treeID:0x0 caps:0x6 quota: ( osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1)
[thor035:0:21288 - context.c:769] INFO tree_info: type:SAT tree idx:1 treeID:0x3f caps:0x16
[thor035:0:21288 - comm.c:408] INFO [group#:0] group id:3b tree idx:0 tree_type:LLT rail_idx:0 group size:2 quota: (osts:2 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0x5000000003b) mlid:c004
[thor035:0:21288 - comm.c:408] INFO [group#:1] group id:3b tree idx:1 tree_type:SAT rail_idx:0 group size:2 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0

# OSU MPI Allreduce Latency Test v5.3.2
# Size       Avg Latency(us)
4096                    4.11
8192                    4.69
16384                   5.50
32768                   8.12
65536                  13.69
131072                 18.38
262144                 29.51
524288                 52.78
1048576                97.50
2097152               189.34
4194304               371.71
8388608               733.02
16777216             1492.25
33554432             3018.73
67108864             6214.00
134217728           13005.74
268435456           27909.87
536870912           57544.79

Debugging SHARP

1. Add log level 3 (information) to get the log for communicator creation.

-x SHARP_COLL_LOG_LEVEL=3

Known issues

SHARP tree trimming is not supported, set trimming_mode 0
Switch reboot may be needed for the SHARPv2 alpha version after running jobs.