...
Prerequisites
Hardware needed:
Adapters: Mellanox ConnectX-6 adapters
Switch: Mellanox HDR Switch
Software and Drivers:
OS: CentoOS 7.5 (x86)
Mellanox OFED 4.5-1 (GA)
For GPU
Install MLNX_OFED driver plug-in for GPU Direct RDMA http://www.mellanox.com/downloads/ofed/nvidia-peer-memory_1.0-7.tar.gz
Install GDR copy lib https://github.com/NVIDIA/gdrcopy
HPC-X 2.5
Note: HPC-X packages like openmpi, sharp, ucx, hcoll are also part of MLNX_OFED. However, it is recommended to build HPC-X as a module.
...
-x HCOLL_ALLREDUCE_ZCOPY_TUNE=static - use static algorithm, not dynamic tuning in run time.
-x HCOLL_SBGP=p2p -x -x HCOLL_BCOL=ucx_p2p - Disable HCOLL hierarcy, all ranks are point to point group, no shared memory awareness.
...
8. If in the profile you see let's say 500KB message size, you can split the message into two messages using HCOLL_HYBRID_AR_LB_FRAG_THRESH=262144. (default is 1MB). This will allow some overlapping between the ranks.
HCOLL_ALLREDUCE_HYBRID_LB
Example:
Running allreduce on two nodes - Without SHARPv2 (only HCOLL) on messages 4096bytes and above
Note:
-x HCOLL_ENABLE_SHARP=0 : Disable SHARP
-x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=2 :SHARP cannot be used, use the hcoll algorithm for the allreduce.
Code Block |
---|
$ mpirun -np 2 -npernode 1 -map-by node -mca pml ucx -mca coll_hcoll_enable 1 --report-bindings -x UCX_NET_DEVICES=mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_SHARP=0 -x SHARP_COLL_LOG_LEVEL=3 -x SHARP_COLL_ENABLE_SAT=1 -x SHARP_COLL_SAT_THRESHOLD=4096 -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=1024 -x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=2 -x HCOLL_BCOL=ucx_p2p -x HCOLL_ALLREDUCE_ZCOPY_TUNE=static -x HCOLL_SBGP=p2p /global/home/users/ophirm/hpcx/hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64/ompi/tests/osu-micro-benchmarks-5.3.2/osu_allreduce -m 4096:999999999 [thor035.hpcadvisorycouncil.com:20899] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.] [thor036.hpcadvisorycouncil.com:177864] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.] # OSU MPI Allreduce Latency Test v5.3.2 # Size Avg Latency(us) 4096 5.22 8192 6.93 16384 10.19 32768 12.78 65536 23.26 131072 35.81 262144 52.59 524288 96.53 1048576 184.74 2097152 358.65 4194304 726.39 8388608 1477.21 16777216 3438.79 33554432 7217.48 67108864 17992.13 134217728 38130.87 268435456 77840.17 536870912 157383.25 |
...
Now enable SHARPv2 with streaming aggregation:
Note:
-x HCOLL_ENABLE_SHARP=3 : Force SHARP
-x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=4 :SHARPv2 can be used.
Note that 2 trees where created:
LLT tree for messages up to 4096
SAT tree for messages from 4096 and up.
Code Block |
---|
$ mpirun -np 2 -npernode 1 -map-by node -mca pml ucx -mca coll_hcoll_enable 1 --report-bindings -x UCX_NET_DEVICES=mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_SHARP=3 -x SHARP_COLL_LOG_LEVEL=3 -x SHARP_COLL_ENABLE_SAT=1 -x SHARP_COLL_SAT_THRESHOLD=4096 -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=1024 -x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=4 -x HCOLL_BCOL=ucx_p2p -x HCOLL_ALLREDUCE_ZCOPY_TUNE=static -x HCOLL_SBGP=p2p /global/home/users/ophirm/hpcx/hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64/ompi/tests/osu-micro-benchmarks-5.3.2/osu_allreduce -m 4096:536870912 [thor035.hpcadvisorycouncil.com:21277] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.] [thor036.hpcadvisorycouncil.com:178246] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.] [thor035:0:21288 - context.c:594] INFO job (ID: 2546139137) resource request quota: ( osts:0 user_data_per_ost:1024 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1) [thor035:0:21288 - context.c:765] INFO tree_info: type:LLT tree idx:0 treeID:0x0 caps:0x6 quota: ( osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1) [thor035:0:21288 - context.c:769] INFO tree_info: type:SAT tree idx:1 treeID:0x3f caps:0x16 [thor035:0:21288 - comm.c:408] INFO [group#:0] group id:3b tree idx:0 tree_type:LLT rail_idx:0 group size:2 quota: (osts:2 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0x5000000003b) mlid:c004 [thor035:0:21288 - comm.c:408] INFO [group#:1] group id:3b tree idx:1 tree_type:SAT rail_idx:0 group size:2 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0 # OSU MPI Allreduce Latency Test v5.3.2 # Size Avg Latency(us) 4096 4.11 8192 4.69 16384 5.50 32768 8.12 65536 13.69 131072 18.38 262144 29.51 524288 52.78 1048576 97.50 2097152 189.34 4194304 371.71 8388608 733.02 16777216 1492.25 33554432 3018.73 67108864 6214.00 134217728 13005.74 268435456 27909.87 536870912 57544.79 |
...
-x SHARP_COLL_LOG_LEVEL=3
Known issues
SHARP tree trimming is not supported, set trimming_mode 0
Switch reboot may be needed for the SHARPv2 alpha version after running jobs.