Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Prerequisites 

Hardware needed:

  • Adapters: Mellanox ConnectX-6 adapters

  • Switch: Mellanox HDR Switch


Software and Drivers:


Note: HPC-X packages like openmpi, sharp, ucx, hcoll are also part of MLNX_OFED. However, it is recommended to build HPC-X as a module.

...

-x HCOLL_ALLREDUCE_ZCOPY_TUNE=static   - use static algorithm, not dynamic tuning in run time.

-x HCOLL_SBGP=p2p -x   -x HCOLL_BCOL=ucx_p2p    - Disable HCOLL hierarcy, all ranks are point to point group, no shared memory awareness.

...

8. If in the profile you see let's say 500KB message size, you can split the message into two messages using HCOLL_HYBRID_AR_LB_FRAG_THRESH=262144. (default is 1MB). This will allow some overlapping between the ranks.


HCOLL_ALLREDUCE_HYBRID_LB


Example:


Running allreduce on two nodes - Without SHARPv2 (only HCOLL) on messages 4096bytes and above

Note:

-x HCOLL_ENABLE_SHARP=0 : Disable SHARP

-x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=2 :SHARP cannot be used, use the hcoll algorithm for the allreduce.


Code Block
$ mpirun -np 2 -npernode 1  -map-by node  -mca pml ucx -mca coll_hcoll_enable 1  --report-bindings -x UCX_NET_DEVICES=mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_SHARP=0 -x SHARP_COLL_LOG_LEVEL=3  -x  SHARP_COLL_ENABLE_SAT=1 -x SHARP_COLL_SAT_THRESHOLD=4096 -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=1024 -x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=2 -x HCOLL_BCOL=ucx_p2p -x HCOLL_ALLREDUCE_ZCOPY_TUNE=static -x HCOLL_SBGP=p2p  /global/home/users/ophirm/hpcx/hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64/ompi/tests/osu-micro-benchmarks-5.3.2/osu_allreduce -m 4096:999999999
[thor035.hpcadvisorycouncil.com:20899] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.]
[thor036.hpcadvisorycouncil.com:177864] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.]

# OSU MPI Allreduce Latency Test v5.3.2
# Size       Avg Latency(us)
4096                    5.22
8192                    6.93
16384                  10.19
32768                  12.78
65536                  23.26
131072                 35.81
262144                 52.59
524288                 96.53
1048576               184.74
2097152               358.65
4194304               726.39
8388608              1477.21
16777216             3438.79
33554432             7217.48
67108864            17992.13
134217728           38130.87
268435456           77840.17
536870912          157383.25


...

Now enable SHARPv2 with streaming aggregation:

Note:

-x HCOLL_ENABLE_SHARP=3 : Force SHARP

-x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=4 :SHARPv2 can be used.


Note that 2 trees where created:

  • LLT tree for messages up to 4096

  • SAT tree for messages from 4096 and up.


Code Block
$ mpirun -np 2 -npernode 1  -map-by node  -mca pml ucx -mca coll_hcoll_enable 1  --report-bindings -x UCX_NET_DEVICES=mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_SHARP=3 -x SHARP_COLL_LOG_LEVEL=3  -x  SHARP_COLL_ENABLE_SAT=1 -x SHARP_COLL_SAT_THRESHOLD=4096 -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=1024 -x HCOLL_BCOL_P2P_LARGE_ALLREDUCE_ALG=4 -x HCOLL_BCOL=ucx_p2p -x HCOLL_ALLREDUCE_ZCOPY_TUNE=static -x HCOLL_SBGP=p2p  /global/home/users/ophirm/hpcx/hpcx-v2.4.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.5-x86_64/ompi/tests/osu-micro-benchmarks-5.3.2/osu_allreduce -m 4096:536870912

[thor035.hpcadvisorycouncil.com:21277] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.]
[thor036.hpcadvisorycouncil.com:178246] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././.][./././././././././././././././.]
[thor035:0:21288 - context.c:594] INFO job (ID: 2546139137) resource request quota: ( osts:0 user_data_per_ost:1024 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[thor035:0:21288 - context.c:765] INFO tree_info: type:LLT tree idx:0 treeID:0x0 caps:0x6 quota: ( osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1)
[thor035:0:21288 - context.c:769] INFO tree_info: type:SAT tree idx:1 treeID:0x3f caps:0x16
[thor035:0:21288 - comm.c:408] INFO [group#:0] group id:3b tree idx:0 tree_type:LLT rail_idx:0 group size:2 quota: (osts:2 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0x5000000003b) mlid:c004
[thor035:0:21288 - comm.c:408] INFO [group#:1] group id:3b tree idx:1 tree_type:SAT rail_idx:0 group size:2 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0

# OSU MPI Allreduce Latency Test v5.3.2
# Size       Avg Latency(us)
4096                    4.11
8192                    4.69
16384                   5.50
32768                   8.12
65536                  13.69
131072                 18.38
262144                 29.51
524288                 52.78
1048576                97.50
2097152               189.34
4194304               371.71
8388608               733.02
16777216             1492.25
33554432             3018.73
67108864             6214.00
134217728           13005.74
268435456           27909.87
536870912           57544.79

...

-x SHARP_COLL_LOG_LEVEL=3


Known issues 

  1. SHARP tree trimming is not supported, set trimming_mode 0

  2. Switch reboot may be needed for the SHARPv2 alpha version after running jobs.

References