This post will help you tune OSU benchmarks for AMD 2^nd Generation EPYC™ CPU (formerly codenamed “Rome”) based servers to achieve maximum performance from ConnectX-6 HDR InfiniBand adapters using HPC-X MPI. This post was established based testing on the “Daytona_X” reference platform with 2^nd Gen AMD EPYC processors and ConnectX-6 HDR InfiniBand adapters.

The following tests were done with HPC-X version 2.7.0-pre, intel 2019 compilers

OSU Point to Point Tests

osu_latency

This is a point to point benchmark.

This micro-benchmarks runs on two cores only (basic latency and bandwidth)
Please use the local core to the adapter, in this case core 80.
HPC-X MPI version 2.7.0 was used
10000 iteration were used per test
OSU 5.6.2
MLNX_OFED 5.0.2

Command example:

mpirun -np 2 -map-by ppr:1:node  -bind-to cpu-list:ordered  -cpu-list 80   -x UCX_NET_DEVICES=mlx5_2:1 osu_latency -i 10000 -x 10000

Command output example on Rome (HDR):

$ mpirun -np 2 -map-by ppr:1:node  -bind-to cpu-list:ordered  -cpu-list 80   -x UCX_NET_DEVICES=mlx5_2:1 osu_latency -i 10000 -x 10000
# OSU MPI Latency Test v5.6.2
# Size          Latency (us)
0                       1.07
1                       1.07
2                       1.07
4                       1.07
8                       1.07
16                      1.07
32                      1.17
64                      1.26
128                     1.31
256                     1.70
512                     1.94
1024                    2.27
2048                    2.27
4096                    2.80
8192                    3.44
16384                   4.58
32768                   6.56
65536                   9.36
131072                 15.19
262144                 16.48
524288                 27.46
1048576                50.23
2097152                95.84
4194304               175.50

osu_bw

This is a point to point benchmark.

This micro-benchmarks runs on two cores only.
Please use the local core to the adapter, in this case core 80.
HPC-X MPI version 2.7.0 was used
10000 iteration were used per test
OSU 5.6.2
Set NPS=1 (or 2) on the BIOS to reach line rate (more memory channels).

Command example:

mpirun -np 2 -map-by ppr:1:node  -bind-to cpu-list:ordered  -cpu-list 80   -x UCX_NET_DEVICES=mlx5_2:1 osu_bw -i 10000 -x 10000

Command output example on Rome (HDR):

mpirun -np 2 -map-by ppr:1:node  -bind-to cpu-list:ordered  -cpu-list 80   -x UCX_NET_DEVICES=mlx5_2:1 osu_bw -i 10000 -x 10000
# OSU MPI Bandwidth Test v5.6.2
# Size      Bandwidth (MB/s)
1                       3.90
2                       7.80
4                      15.53
8                      31.20
16                     62.34
32                    124.11
64                    243.01
128                   477.91
256                   900.70
512                  1593.69
1024                 3103.96
2048                 5299.51
4096                 7513.63
8192                10371.66
16384               16105.36
32768               19001.13
65536               22253.28
131072              23313.10
262144              23997.89
524288              24349.10
1048576             24532.09
2097152             24614.44
4194304             24636.01

osu_bibw

This is a point to point benchmark.

This micro-benchmarks runs on two cores only.
Please use the local core to the adapter, in this case core 80.
HPC-X MPI version 2.7.0 was used
10000 iteration were used per test
OSU 5.6.2
Set NPS=1 (or 2) on the BIOS to reach line rate (more memory channels).

Command example:

mpirun -np 2 -map-by ppr:1:node  -bind-to cpu-list:ordered  -cpu-list 90   -x UCX_NET_DEVICES=mlx5_2:1 osu_bibw -i 10000 -x 10000 -W 512

Command output example on Rome (HDR):

$ mpirun -np 2 -map-by ppr:1:node  -bind-to cpu-list:ordered  -cpu-list 90   -x UCX_NET_DEVICES=mlx5_2:1 osu_bibw -i 10000 -x 10000 -W 512 
# OSU MPI Bi-Directional Bandwidth Test v5.6.2
# Size      Bandwidth (MB/s)
1                       5.82
2                      11.63
4                      23.27
8                      46.62
16                     93.07
32                    185.89
64                    285.26
128                   559.38
256                  1143.45
512                  1761.26
1024                 3385.42
2048                 5512.57
4096                 9142.15
8192                15138.53
16384               21865.95
32768               30857.67
65536               39546.48
131072              43946.92
262144              46488.86
524288              47851.40
1048576             48518.97
2097152             48831.65
4194304             48942.90

osu_mbw_mr

Multi Bandwidth Message Rate test creates multiple pairs that are sending traffic each other. Each rank of those pairs are located on difference node (otherwise, it will be shared memory test).

This test is a point to point benchmark.

Tunables

PPN (Processes per node) - the selection of cores to participate in the test, here we recommend to use 64 cores on the socket that local to the adapter.
Window Size, the default is 64, for better message rate, use larger window (e.g. 512), for better bandwidth use smaller window (e.g. 32).

HPC-X MPI version 2.7.0 was used
10000 iteration were used per test
OSU 5.6.2
Set NPS=1 (or 2) on the BIOS to reach line rate (more memory channels)

Command example to use 64 local cores (second socket):

mpirun -np 128 -map-by ppr:64:node -rank-by core -bind-to cpu-list:ordered -cpu-list 64-127  -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1 osu_mbw_mr   -W 512

Command output example on Rome (HDR):

In this case, we are using 64 cores (pairs) of the second socket CPU, cores 67-128 (The adapter is located on NUMA 5), over two nodes. Window size is set to 512.

$ mpirun  -np 128  -map-by ppr:64:node -rank-by core -bind-to cpu-list:ordered -cpu-list 64-127  -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1 osu_mbw_mr   -W 512
# OSU MPI Multiple Bandwidth / Message Rate Test v5.6.2
# [ pairs: 64 ] [ window size: 512 ]
# Size                  MB/s        Messages/s
1                     194.22	194215337.24
2                     387.23	193615177.36
4                     773.95	193486570.81
8                    1545.46	193182287.83
16                   2976.90	186056309.00
32                   4297.78	134305707.67
64                   6344.98	99140297.19
128                  9758.37	76237262.16
256                 13561.93	52976275.87
512                 17913.93	34988135.77
1024                21370.83	20869955.40
2048                23158.19	11307707.78
4096                23462.98	5728265.86
8192                24260.12	2961439.84
16384               23698.72	1446455.16
32768               23653.46	721846.43
65536               23803.63	363214.64
131072              24523.31	187097.99
262144              24546.99	93639.35
524288              24557.37	46839.47
1048576             24571.09	23432.82
2097152             24579.22	11720.29
4194304             24561.79	5855.99

OSU Collectives

osu_barrier

This is a collective benchmark.

This micro-benchmark runs over multiple amount of nodes.
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology can improve performance

Common run options

1 PPN
Full PPN

SHARP flags

Description	SHARP Flag
Enable SHARP (enables only LLT)	x HCOLL_ENABLE_SHARP=3
Enable Logging	x SHARP_COLL_LOG_LEVEL=3
Enable SHARP starting from 2 Nodes	x HCOLL_SHARP_NP=2 (default 4)

Command output example on Rome (HDR) over 1PPN, 8 nodes.

mpirun -np 8 -map-by ppr:1:node -rank-by core -bind-to cpu-list:ordered -cpu-list 80 -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1 -mca coll_hcoll_enable 1 -x HCOLL_ENABLE_SHARP=3 -x SHARP_COLL_ENABLE_SAT=0 osu_barrier -f -i 1000000 -x 1000000

# OSU MPI Barrier Latency Test v5.6.2
# Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
             2.28              2.28              2.28     1000000

osu_allreduce

This is a collective benchmark.

This micro-benchmark runs over multiple amount of nodes.
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology can improve performance

Common run options

1 PPN
Full PPN

SHARP Flags

Description	SHARP Flag
Enable SHARP (enables only LLT)	x HCOLL_ENABLE_SHARP=3
Enable streaming aggregation tree (SAT)	x SHARP_COLL_ENABLE_SAT=1
Enable Logging	x SHARP_COLL_LOG_LEVEL=3
HCOLL default uses up to 256 for SHARP (LLT), we can increase with more resources (SHARP_COLL_OSTS_PER_GROUP)	x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4096 (default 256)
Number of simultaneous fragments that can be pushed to the network	x SHARP_COLL_OSTS_PER_GROUP=8 (default 2)
SHARP LLT fragmentation size	x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=1024 (default 128)
Use SAT in multi ppn case.	x HCOLL_ALLREDUCE_HYBRID_LB=1
Enable SHARP starting from 2 Nodes (for demos/small setups)	x HCOLL_SHARP_NP=2 (default 4)

In this example, 1K fragmentation size, multiplied by number of OST per group (8) gives you the max message size without waiting for OST credit.

Setting the HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX to 4K will assume that LLT and SAT are overlapping and SHARP is being done for all message sizes.

Command output example on Rome (HDR) over 1PPN 8 nodes.

 mpirun -np 8 -map-by ppr:1:node -rank-by core -bind-to cpu-list:ordered -cpu-list 80 -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1 -mca coll_hcoll_enable 1 -x HCOLL_ENABLE_SHARP=3 -x SHARP_COLL_ENABLE_SAT=1 osu_allreduce -f -i 100000 -x 100000                                                                                                                                                          

# OSU MPI Allreduce Latency Test v5.6.2
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
4                       2.41              2.40              2.42      100000
8                       2.41              2.39              2.43      100000
16                      2.48              2.44              2.50      100000
32                      2.53              2.51              2.55      100000
64                      2.53              2.51              2.54      100000
128                     2.48              2.47              2.50      100000
256                     2.99              2.98              3.01      100000
512                     4.26              4.22              4.29      100000
1024                    7.25              7.21              7.35      100000
2048                    8.78              8.75              8.82      100000
4096                    4.75              4.73              4.77      100000
8192                    5.00              4.97              5.03      100000
16384                   5.41              5.36              5.43      100000
32768                   6.07              6.04              6.09      100000
65536                   7.40              7.38              7.41      100000
131072                 10.01              9.99             10.02      100000
262144                 15.33             15.31             15.35      100000
524288                 26.47             26.44             26.49      100000
1048576                49.00             48.98             49.01      100000

osu_allgather

This is a collective benchmark.

This micro-benchmark runs over multiple amount of nodes.

Common run options

1 PPN
Full PPN

Command output example on Rome (HDR) for 1PPN over 8 nodes.

$  mpirun -np 8   -map-by ppr:1:node -rank-by core -bind-to cpu-list:ordered -cpu-list 80  -mca coll_hcoll_enable 1  -x UCX_NET_DEVICES=mlx5_2:1 osu_allgather

# OSU MPI Allgather Latency Test v5.6.2
# Size       Avg Latency(us)
1                       3.83
2                       3.76
4                       3.69
8                       3.84
16                      4.19
32                      4.30
64                      5.02
128                     5.77
256                     6.12
512                     7.09
1024                    8.58
2048                   10.70
4096                   13.71
8192                   20.36
16384                  41.43
32768                  73.01
65536                  49.43
131072                 70.34
262144                114.36
524288                204.57
1048576               385.29

OSU Point to Point Tests

osu_latency

osu_bw

osu_bibw

osu_mbw_mr

OSU Collectives

osu_barrier

osu_allreduce

osu_allgather

References