OSU Benchmark Tuning for 2nd Gen AMD EPYC using HDR InfiniBand over HPC-X MPI
This post will help you tune OSU benchmarks for AMD 2nd Generation EPYC™ CPU (formerly codenamed “Rome”) based servers to achieve maximum performance from ConnectX-6 HDR InfiniBand adapters using HPC-X MPI. This post was established based testing on the “Daytona_X” reference platform with 2nd Gen AMD EPYC processors and ConnectX-6 HDR InfiniBand adapters.
The following tests were done with HPC-X version 2.7.0-pre, intel 2019 compilers
- 1 OSU Point to Point Tests
- 1.1 osu_latency
- 1.2 osu_bw
- 1.3 osu_bibw
- 1.4 osu_mbw_mr
- 2 OSU Collectives
- 2.1 osu_barrier
- 2.2 osu_allreduce
- 2.3 osu_allgather
- 3 References
OSU Point to Point Tests
osu_latency
This is a point to point benchmark.
This micro-benchmarks runs on two cores only (basic latency and bandwidth)
Please use the local core to the adapter, in this case core 80.
HPC-X MPI version 2.7.0 was used
10000 iteration were used per test
OSU 5.6.2
MLNX_OFED 5.0.2
Command example:
mpirun -np 2 -map-by ppr:1:node -bind-to cpu-list:ordered -cpu-list 80 -x UCX_NET_DEVICES=mlx5_2:1 osu_latency -i 10000 -x 10000
Command output example on Rome (HDR):
$ mpirun -np 2 -map-by ppr:1:node -bind-to cpu-list:ordered -cpu-list 80 -x UCX_NET_DEVICES=mlx5_2:1 osu_latency -i 10000 -x 10000
# OSU MPI Latency Test v5.6.2
# Size Latency (us)
0 1.07
1 1.07
2 1.07
4 1.07
8 1.07
16 1.07
32 1.17
64 1.26
128 1.31
256 1.70
512 1.94
1024 2.27
2048 2.27
4096 2.80
8192 3.44
16384 4.58
32768 6.56
65536 9.36
131072 15.19
262144 16.48
524288 27.46
1048576 50.23
2097152 95.84
4194304 175.50
osu_bw
This is a point to point benchmark.
This micro-benchmarks runs on two cores only.
Please use the local core to the adapter, in this case core 80.
HPC-X MPI version 2.7.0 was used
10000 iteration were used per test
OSU 5.6.2
Set NPS=1 (or 2) on the BIOS to reach line rate (more memory channels).
Command example:
mpirun -np 2 -map-by ppr:1:node -bind-to cpu-list:ordered -cpu-list 80 -x UCX_NET_DEVICES=mlx5_2:1 osu_bw -i 10000 -x 10000
Command output example on Rome (HDR):
osu_bibw
This is a point to point benchmark.
This micro-benchmarks runs on two cores only.
Please use the local core to the adapter, in this case core 80.
HPC-X MPI version 2.7.0 was used
10000 iteration were used per test
OSU 5.6.2
Set NPS=1 (or 2) on the BIOS to reach line rate (more memory channels).
Command example:
Command output example on Rome (HDR):
osu_mbw_mr
Multi Bandwidth Message Rate test creates multiple pairs that are sending traffic each other. Each rank of those pairs are located on difference node (otherwise, it will be shared memory test).
This test is a point to point benchmark.
Tunables
PPN (Processes per node) - the selection of cores to participate in the test, here we recommend to use 64 cores on the socket that local to the adapter.
Window Size, the default is 64, for better message rate, use larger window (e.g. 512), for better bandwidth use smaller window (e.g. 32).
HPC-X MPI version 2.7.0 was used
10000 iteration were used per test
OSU 5.6.2
Set NPS=1 (or 2) on the BIOS to reach line rate (more memory channels)
Command example to use 64 local cores (second socket):
Command output example on Rome (HDR):
In this case, we are using 64 cores (pairs) of the second socket CPU, cores 67-128 (The adapter is located on NUMA 5), over two nodes. Window size is set to 512.
OSU Collectives
osu_barrier
This is a collective benchmark.
This micro-benchmark runs over multiple amount of nodes.
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology can improve performance
Common run options
1 PPN
Full PPN
SHARP flags
Description | SHARP Flag |
---|---|
Enable SHARP (enables only LLT) |
|
Enable Logging |
|
Enable SHARP starting from 2 Nodes |
|
Command output example on Rome (HDR) over 1PPN, 8 nodes.
osu_allreduce
This is a collective benchmark.
This micro-benchmark runs over multiple amount of nodes.
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology can improve performance
Common run options
1 PPN
Full PPN
SHARP Flags
Description | SHARP Flag |
---|---|
Enable SHARP (enables only LLT) |
|
Enable streaming aggregation tree (SAT) |
|
Enable Logging |
|
HCOLL default uses up to 256 for SHARP (LLT), we can increase with more resources (SHARP_COLL_OSTS_PER_GROUP) |
|
Number of simultaneous fragments that can be pushed to the network |
|
SHARP LLT fragmentation size |
|
Use SAT in multi ppn case. |
|
Enable SHARP starting from 2 Nodes (for demos/small setups) |
|
In this example, 1K fragmentation size, multiplied by number of OST per group (8) gives you the max message size without waiting for OST credit.
Setting the HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX to 4K will assume that LLT and SAT are overlapping and SHARP is being done for all message sizes.
Command output example on Rome (HDR) over 1PPN 8 nodes.
osu_allgather
This is a collective benchmark.
This micro-benchmark runs over multiple amount of nodes.
Common run options
1 PPN
Full PPN
Command output example on Rome (HDR) for 1PPN over 8 nodes.
References