Getting Started with SHARPv2
Network Plan
Have a dedicated server for openSM and SHARP Aggregation Manager (SHARP AM).
Note that sharpd (SHARP daemon) and SHARP AM cannot run on the same compute server.
MPI API calls supported by SHARP
Note that SHARP is enabled only on MPI_barrier and MPI_allreduce for large message sizes.
Therefore, before testing it with HPC application, you should profile the application and look for high percentage of those calls MPI, otherwise, the SHARP value may be negligible.
Prerequisites
Hardware needed:
Adapters: Mellanox ConnectX-6 adapters
Switch: Mellanox HDR Switch
Software and Drivers:
OS: CentoOS 7.5 (x86)
Mellanox OFED 4.5-1 (GA)
For GPU
Install MLNX_OFED driver plug-in for GPU Direct RDMA http://www.mellanox.com/downloads/ofed/nvidia-peer-memory_1.0-7.tar.gz
Install GDR copy lib https://github.com/NVIDIA/gdrcopy
HPC-X 2.5
Note: HPC-X packages like openmpi, sharp, ucx, hcoll are also part of MLNX_OFED. However, it is recommended to build HPC-X as a module.
Run module load to use the module, for example:
$ wget http://bgate.mellanox.com/products/hpcx/custom/hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64.tbz
$ tar -xjf hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64.tbz
$ cd hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64
$ module load $PWD/modulefiles/hpcx
1. Verify that the device supports packet based credits:
$ sudo ibv_devinfo -v -d mlx5_0 | grep EXP_PACKET_BASED_CREDIT_MODE
EXP_PACKET_BASED_CREDIT_MODE
2. In case you are using alpha version of the firmware, set immediate retransmissions on the relevant device :
$ sudo mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
$ sudo mst status
...
------------
/dev/mst/mt4123_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=0000:03:00.0 addr.reg=88 data.reg=92
Chip revision is: 00
/dev/mst/mt4123_pciconf1 - PCI configuration cycles access.
domain:bus:dev.fn=0000:82:00.0 addr.reg=88 data.reg=92
Chip revision is: 00
...
...
$ sudo mcra /dev/mst/mt4123_pciconf0 0x58318.1:1 0x1
$ sudo mcra /dev/mst/mt4123_pciconf0 0x58318.0:1 0x1
Set up the cluster and OpenSM
1. Make sure that your cluster meet with the minimum requirements, see the deployment guide here.
2. Login to the head node and raise the relevant modules.
$ module load intel/2018.1.163
$ module load hpcx/2.4.0
3. Make sure that both directories (HPCX_SHARP_DIR, OMPI_HOME) are available on the lunching node (login nodes).
$ echo $HPCX_SHARP_DIR
/global/software/centos-7/modules/intel/2018.1.163/hpcx/2.4.0/sharp
$ echo $OMPI_HOME
/global/software/centos-7/modules/intel/2018.1.163/hpcx/2.4.0/ompi
4. Set the parameters sharp_enabled 2 and routing_engine ftree,updn (or any other routing algorithm, not all are supported) in the opensm configuration file /etc/opensm/opensm.conf (or any other location)
sharp_enabled 2
routing_engine ftree,updn
$ cat /etc/opensm/opensm.conf | grep sharp
sharp_enabled 2
$ cat /etc/opensm/opensm.conf | grep routing_engine
routing_engine ftree,updn
$ cat /etc/opensm/opensm.conf | grep sharp
sharp_enabled 2
5. Set the parameter root_guid_file /etc/opensm/root_guid.cfg in the opensm configuration file /etc/opensm/opensm.conf (or any other location)
root_guid_file /etc/opensm/root_guid.cfg
6. Add the switch GUIDs for all the root switches of your InfiniBand network. Put the root GUIDs in /etc/opensm/root_guid.cfg, e.g.,
0x7cfe900300a5a2c0
7. Start the OpenSM using the prepared opensm.conf file (you can add other flags as well, e.g. priority).
sudo opensm -F /etc/opensm/opensm.conf -g 0xe41d2d0300a3ab5c -p 3 -B
-------------------------------------------------
OpenSM 5.0.0.MLNX20180219.c610c42
Config file is `/etc/opensm/opensm.conf`:
Reading Cached Option File: /etc/opensm/opensm.conf
Loading Cached Option:routing_engine = ftree,updn
Loading Cached Option:root_guid_file = /etc/opensm/root_guid.cfg
Loading Cached Option:sharp_enabled = 2
Command Line Arguments:
guid = 0xe41d2d0300a3ab5c
Priority = 3
Daemon mode
Log File: /var/log/opensm.log
8. Make sure that this is the sm running on the cluster.
$ sudo sminfo
sminfo: sm lid 11 sm guid 0xe41d2d0300a3ab5c, activity count 5303 priority 14 state 3 SMINFO_MASTER
9. Make sure that the aggregation nodes were activated by OpenSM
$ sudo ibnetdiscover | grep "Agg"
[41] "H-98039b03007ab860"[1](98039b03007ab860) # "Mellanox Technologies Aggregation Node" lid 7 4xHDR
Ca 1 "H-98039b03007ab860" # "Mellanox Technologies Aggregation Node"
Note: Using OpenSM v5.3 or later should be used.
$ opensm --version
-------------------------------------------------
OpenSM 5.3.0.MLNX20181108.33944a2
Enable SHARP Daemons
1. Setup SHARP Aggregation Manager (sharp_am) on the opensm node.
To enable SHARPv2 set control_path_version to 2.
SHARP tree trimming_mode should be disabled as well.
/global/software/centos-7/modules/gcc/4.8.5/hpcx/2.4.0-pre/sharp/bin/sharp_am --control_path_version 2 --trimming_mode 0 -B
The parameters can be also updated in the configuration file located in:
/etc/sysconfig/sharp_am.cfg
Another option is to use SHARP AM service and not via command line:
$ sudo service sharp_am start
Redirecting to /bin/systemctl start sharp_am.service
3. Setup sharpd on all cluster nodes (using pdsh or any other method)
$ sudo $HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s -d sharpd
Copying /global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/sharp/systemd/system/sharpd.service to /etc/systemd/system/sharpd.service
Service sharpd is installed
4. Start the SHARP daemon on all cluster nodes
$ sudo service sharpd start
Redirecting to /bin/systemctl start sharpd.service
Verification
1. Run ibdiagnet --sharp and check the aggregation nodes
$ sudo ibdiagnet --sharp
...
$ cat /var/tmp/ibdiagnet2/ibdiagnet2.sharp
# This database file was automatically generated by IBDIAG
AN:Mellanox Technologies Aggregation Node, lid:7, node guid:0x98039b03007ab860
SHARP Benchmark
$share/sharp/examples/mpi/coll/sharp_coll_test -h
Usage: sharp_mpi_test [OPTIONS]
Options:
-h, --help Show this help message and exit
-B, --perf_with_barrier Sync Allreduce with Barrier in Allreduce collective, Default:1
-c, --collectives Comma separated list of collectives:[allreduce|iallreduce|barrier|all|none]] to run. Default; run all blocking collectives
-C, --iov_count Number of entries in IOV list, if used. Default: 2 Max:SHARP_COLL_DATA_MAX_IOV
-d, --ib-dev Use IB device <dev:port> (default first device found)
-D, --data_layout Data layout (contig, iov) for sender and receiver side. Default: contig
-g, --max_groups Set value of max_groups (default:1 (COMM_WORLD)). For value > 1 , iterate over comm types
(comm world, comm world dup, comm world reverse, comm world odd even split, comm world odd even split reverse)
-i, --iters Number of iterations to run perf benchmark
-H, --host_alloc_type Host memory allocation method ( hugetlb, malloc)
-m, --mode Test modes: <basic|complex|perf|all> . Default: basic
-M, --mem_type Memory type(host,cuda) used in the communication buffers format: <src memtype>:<recv memtype>
-N, --nbc_count Number of non-blocking operation posted before waiting for completion (max: 512)
-s, --size Set the minimum and/or the maximum message size. format:[MIN:]MAX Default:<4:max_ost_payload>
-t, --group_type Set specific group type(world, world-dup, world-rev, slipt:<n> split-rev:<n>) to test
-x, --skips Number of warmup iterations to run perf benchmark
Example:
$export LD_LIBRARY_PATH=/.autodirect/mtrswgwork/devendar/workspace/GIT/sharp-build/sharp-rhel7.4_cuda10.1_mofed_4_5-5fa9853-06242019/lib:$LD_LIBRARY_PATH
/usr/mpi/gcc/openmpi-4.0.0rc5/bin/mpirun -np 1 -map-by node -H vulcan03 --mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -x SHARP_COLL_LOG_LEVEL=3 -x ENABLE_SHARP_COLL=1 -x SHARP_COLL_ENABLE_SAT=1 -x SHARP_COLL_SAT_THRESHOLD=1024 sharp-rhel7.4_cuda10.1_mofed_4_5-5fa9853-06242019/share/sharp/examples/mpi/coll/sharp_coll_test -d mlx5_0:1 --mode perf --collectives allreduce -H malloc -s :1000000 -x 100 -i 1000 -M cuda
HCOLL/SHARP:
1. HCOLL is enabled by default, and must be enabled to use SHARP. In case you are not sure, add -mca coll_hcoll_enable 1 to your mpirun.
2. SHARP admin mode - to enable sharp use -x HCOLL_ENABLE_SHARP=1 to your mpirun script. This will enable SHARP assuming hcoll is enabled as well. if some errors occur, SHARP will not run, but the job will run without SHARP. In this case fallback is available.
In case you wish to force SHARP use -x HCOLL_ENABLE_SHARP=2 , in this case if some errors with SHARP occurs, the job will fail - this way you can make sure the SHARP is enabled on your job. It forces the SHARP on the default quota.
In case of -x HCOLL_ENABLE_SHARP=3 if forces SHARP on all communicators.
3. Enable SHARP Streaming aggregation tree -x SHARP_COLL_ENABLE_SAT=1 (part of the features of SHARPv2).
4. Set the minimum packet size for the SHARP streaming aggregation, for example for 1024 bytes set -x SHARP_COLL_SAT_THRESHOLD=1024
5. Optionally set the Quota per QST -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=1024 (the default is 128). For smaller messages than SHARP_COLL_SAT_THRESHOLD (1024 for example).
6. For PPN=1 add also the following:
-x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=4 - use HCOLL uses the SHARPv2 algorithm for allreduce (there are difference algorithms).
-x HCOLL_ALLREDUCE_ZCOPY_TUNE=static - use static algorithm, not dynamic tuning in run time.
-x HCOLL_SBGP=p2p -x HCOLL_BCOL=ucx_p2p - Disable HCOLL hierarcy, all ranks are point to point group, no shared memory awareness.
7. For multi-ppn, enabled HCOLL_ALLREDUCE_HYBRID_LB=1. This enable in-node reduction for SAT for multi-ppn. Note, the flag may harm the bandwidth performance.
8. If in the profile you see let's say 500KB message size, you can split the message into two messages using HCOLL_HYBRID_AR_LB_FRAG_THRESH=262144. (default is 1MB). This will allow some overlapping between the ranks.
HCOLL_ALLREDUCE_HYBRID_LB
Example:
Running allreduce on two nodes - Without SHARPv2 (only HCOLL) on messages 4096bytes and above
Note:
-x HCOLL_ENABLE_SHARP=0 : Disable SHARP
-x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=2 :SHARP cannot be used, use the hcoll algorithm for the allreduce.
$ mpirun -np 2 -npernode 1 -map-by node -mca pml ucx -mca coll_hcoll_enable 1 --report-bindings -x UCX_NET_DEVICES=mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_SHARP=0 -x SHARP_COLL_LOG_LEVEL=3 -x SHARP_COLL_ENABLE_SAT=1 -x SHARP_COLL_SAT_THRESHOLD=4096 -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=1024 -x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=2 -x HCOLL_BCOL=ucx_p2p -x HCOLL_ALLREDUCE_ZCOPY_TUNE=static -x HCOLL_SBGP=p2p /global/home/users/ophirm/hpcx/hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64/ompi/tests/osu-micro-benchmarks-5.3.2/osu_allreduce -m 4096:999999999
[thor035.hpcadvisorycouncil.com:20899] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.]
[thor036.hpcadvisorycouncil.com:177864] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.]
# OSU MPI Allreduce Latency Test v5.3.2
# Size Avg Latency(us)
4096 5.22
8192 6.93
16384 10.19
32768 12.78
65536 23.26
131072 35.81
262144 52.59
524288 96.53
1048576 184.74
2097152 358.65
4194304 726.39
8388608 1477.21
16777216 3438.79
33554432 7217.48
67108864 17992.13
134217728 38130.87
268435456 77840.17
536870912 157383.25
Now enable SHARPv2 with streaming aggregation:
Note:
-x HCOLL_ENABLE_SHARP=3 : Force SHARP
-x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=4 :SHARPv2 can be used.
Note that 2 trees where created:
LLT tree for messages up to 4096
SAT tree for messages from 4096 and up.
$ mpirun -np 2 -npernode 1 -map-by node -mca pml ucx -mca coll_hcoll_enable 1 --report-bindings -x UCX_NET_DEVICES=mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_SHARP=3 -x SHARP_COLL_LOG_LEVEL=3 -x SHARP_COLL_ENABLE_SAT=1 -x SHARP_COLL_SAT_THRESHOLD=4096 -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=1024 -x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=4 -x HCOLL_BCOL=ucx_p2p -x HCOLL_ALLREDUCE_ZCOPY_TUNE=static -x HCOLL_SBGP=p2p /global/home/users/ophirm/hpcx/hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64/ompi/tests/osu-micro-benchmarks-5.3.2/osu_allreduce -m 4096:536870912
[thor035.hpcadvisorycouncil.com:21277] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.]
[thor036.hpcadvisorycouncil.com:178246] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.]
[thor035:0:21288 - context.c:594] INFO job (ID: 2546139137) resource request quota: ( osts:0 user_data_per_ost:1024 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[thor035:0:21288 - context.c:765] INFO tree_info: type:LLT tree idx:0 treeID:0x0 caps:0x6 quota: ( osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1)
[thor035:0:21288 - context.c:769] INFO tree_info: type:SAT tree idx:1 treeID:0x3f caps:0x16
[thor035:0:21288 - comm.c:408] INFO [group#:0] group id:3b tree idx:0 tree_type:LLT rail_idx:0 group size:2 quota: (osts:2 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0x5000000003b) mlid:c004
[thor035:0:21288 - comm.c:408] INFO [group#:1] group id:3b tree idx:1 tree_type:SAT rail_idx:0 group size:2 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
# OSU MPI Allreduce Latency Test v5.3.2
# Size Avg Latency(us)
4096 4.11
8192 4.69
16384 5.50
32768 8.12
65536 13.69
131072 18.38
262144 29.51
524288 52.78
1048576 97.50
2097152 189.34
4194304 371.71
8388608 733.02
16777216 1492.25
33554432 3018.73
67108864 6214.00
134217728 13005.74
268435456 27909.87
536870912 57544.79
Debugging SHARP
1. Add log level 3 (information) to get the log for communicator creation.
-x SHARP_COLL_LOG_LEVEL=3
Known issues
SHARP tree trimming is not supported, set trimming_mode 0
Switch reboot may be needed for the SHARPv2 alpha version after running jobs.