/
OSU Benchmark Tuning for 2nd Gen AMD EPYC using HDR InfiniBand over HPC-X MPI

OSU Benchmark Tuning for 2nd Gen AMD EPYC using HDR InfiniBand over HPC-X MPI

This post will help you tune OSU benchmarks for AMD 2nd Generation EPYC™ CPU (formerly codenamed “Rome”) based servers to achieve maximum performance from ConnectX-6 HDR InfiniBand adapters using HPC-X MPI. This post was established based testing on the “Daytona_X” reference platform with 2nd Gen AMD EPYC processors and ConnectX-6 HDR InfiniBand adapters.

 

The following tests were done with HPC-X version 2.7.0-pre, intel 2019 compilers

 

 

OSU Point to Point Tests

osu_latency

 

This is a point to point benchmark.

  • This micro-benchmarks runs on two cores only (basic latency and bandwidth)

  • Please use the local core to the adapter, in this case core 80.

  • HPC-X MPI version 2.7.0 was used

  • 10000 iteration were used per test

  • OSU 5.6.2

  • MLNX_OFED 5.0.2

 

Command example:

mpirun -np 2 -map-by ppr:1:node -bind-to cpu-list:ordered -cpu-list 80 -x UCX_NET_DEVICES=mlx5_2:1 osu_latency -i 10000 -x 10000

 

 

Command output example on Rome (HDR):

$ mpirun -np 2 -map-by ppr:1:node -bind-to cpu-list:ordered -cpu-list 80 -x UCX_NET_DEVICES=mlx5_2:1 osu_latency -i 10000 -x 10000 # OSU MPI Latency Test v5.6.2 # Size Latency (us) 0 1.07 1 1.07 2 1.07 4 1.07 8 1.07 16 1.07 32 1.17 64 1.26 128 1.31 256 1.70 512 1.94 1024 2.27 2048 2.27 4096 2.80 8192 3.44 16384 4.58 32768 6.56 65536 9.36 131072 15.19 262144 16.48 524288 27.46 1048576 50.23 2097152 95.84 4194304 175.50

 

osu_bw

This is a point to point benchmark.

  • This micro-benchmarks runs on two cores only.

  • Please use the local core to the adapter, in this case core 80.

  • HPC-X MPI version 2.7.0 was used

  • 10000 iteration were used per test

  • OSU 5.6.2

  • Set NPS=1 (or 2) on the BIOS to reach line rate (more memory channels).

 

Command example:

mpirun -np 2 -map-by ppr:1:node -bind-to cpu-list:ordered -cpu-list 80 -x UCX_NET_DEVICES=mlx5_2:1 osu_bw -i 10000 -x 10000

 

Command output example on Rome (HDR):

mpirun -np 2 -map-by ppr:1:node -bind-to cpu-list:ordered -cpu-list 80 -x UCX_NET_DEVICES=mlx5_2:1 osu_bw -i 10000 -x 10000 # OSU MPI Bandwidth Test v5.6.2 # Size Bandwidth (MB/s) 1 3.90 2 7.80 4 15.53 8 31.20 16 62.34 32 124.11 64 243.01 128 477.91 256 900.70 512 1593.69 1024 3103.96 2048 5299.51 4096 7513.63 8192 10371.66 16384 16105.36 32768 19001.13 65536 22253.28 131072 23313.10 262144 23997.89 524288 24349.10 1048576 24532.09 2097152 24614.44 4194304 24636.01

 

osu_bibw

This is a point to point benchmark.

  • This micro-benchmarks runs on two cores only.

  • Please use the local core to the adapter, in this case core 80.

  • HPC-X MPI version 2.7.0 was used

  • 10000 iteration were used per test

  • OSU 5.6.2

  • Set NPS=1 (or 2) on the BIOS to reach line rate (more memory channels).

 

Command example:

mpirun -np 2 -map-by ppr:1:node -bind-to cpu-list:ordered -cpu-list 90 -x UCX_NET_DEVICES=mlx5_2:1 osu_bibw -i 10000 -x 10000 -W 512

 

Command output example on Rome (HDR):

$ mpirun -np 2 -map-by ppr:1:node -bind-to cpu-list:ordered -cpu-list 90 -x UCX_NET_DEVICES=mlx5_2:1 osu_bibw -i 10000 -x 10000 -W 512 # OSU MPI Bi-Directional Bandwidth Test v5.6.2 # Size Bandwidth (MB/s) 1 5.82 2 11.63 4 23.27 8 46.62 16 93.07 32 185.89 64 285.26 128 559.38 256 1143.45 512 1761.26 1024 3385.42 2048 5512.57 4096 9142.15 8192 15138.53 16384 21865.95 32768 30857.67 65536 39546.48 131072 43946.92 262144 46488.86 524288 47851.40 1048576 48518.97 2097152 48831.65 4194304 48942.90

 

osu_mbw_mr

Multi Bandwidth Message Rate test creates multiple pairs that are sending traffic each other. Each rank of those pairs are located on difference node (otherwise, it will be shared memory test).

This test is a point to point benchmark.

Tunables

  • PPN (Processes per node) - the selection of cores to participate in the test, here we recommend to use 64 cores on the socket that local to the adapter.

  • Window Size, the default is 64, for better message rate, use larger window (e.g. 512), for better bandwidth use smaller window (e.g. 32).

 

  • HPC-X MPI version 2.7.0 was used

  • 10000 iteration were used per test

  • OSU 5.6.2

  • Set NPS=1 (or 2) on the BIOS to reach line rate (more memory channels)

 

Command example to use 64 local cores (second socket):

mpirun -np 128 -map-by ppr:64:node -rank-by core -bind-to cpu-list:ordered -cpu-list 64-127 -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1 osu_mbw_mr -W 512

 

Command output example on Rome (HDR):

In this case, we are using 64 cores (pairs) of the second socket CPU, cores 67-128 (The adapter is located on NUMA 5), over two nodes. Window size is set to 512.

$ mpirun -np 128 -map-by ppr:64:node -rank-by core -bind-to cpu-list:ordered -cpu-list 64-127 -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1 osu_mbw_mr -W 512 # OSU MPI Multiple Bandwidth / Message Rate Test v5.6.2 # [ pairs: 64 ] [ window size: 512 ] # Size MB/s Messages/s 1 194.22 194215337.24 2 387.23 193615177.36 4 773.95 193486570.81 8 1545.46 193182287.83 16 2976.90 186056309.00 32 4297.78 134305707.67 64 6344.98 99140297.19 128 9758.37 76237262.16 256 13561.93 52976275.87 512 17913.93 34988135.77 1024 21370.83 20869955.40 2048 23158.19 11307707.78 4096 23462.98 5728265.86 8192 24260.12 2961439.84 16384 23698.72 1446455.16 32768 23653.46 721846.43 65536 23803.63 363214.64 131072 24523.31 187097.99 262144 24546.99 93639.35 524288 24557.37 46839.47 1048576 24571.09 23432.82 2097152 24579.22 11720.29 4194304 24561.79 5855.99

OSU Collectives

osu_barrier

 

This is a collective benchmark.

  • This micro-benchmark runs over multiple amount of nodes.

  • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology can improve performance

 

Common run options

  • 1 PPN

  • Full PPN

 

SHARP flags

Description

SHARP Flag

Description

SHARP Flag

Enable SHARP (enables only LLT)

  • x HCOLL_ENABLE_SHARP=3

Enable Logging

  • x SHARP_COLL_LOG_LEVEL=3

Enable SHARP starting from 2 Nodes

  • x HCOLL_SHARP_NP=2 (default 4)

 

Command output example on Rome (HDR) over 1PPN, 8 nodes.

mpirun -np 8 -map-by ppr:1:node -rank-by core -bind-to cpu-list:ordered -cpu-list 80 -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1 -mca coll_hcoll_enable 1 -x HCOLL_ENABLE_SHARP=3 -x SHARP_COLL_ENABLE_SAT=0 osu_barrier -f -i 1000000 -x 1000000 # OSU MPI Barrier Latency Test v5.6.2 # Avg Latency(us) Min Latency(us) Max Latency(us) Iterations 2.28 2.28 2.28 1000000

 

osu_allreduce

 

This is a collective benchmark.

  • This micro-benchmark runs over multiple amount of nodes.

  • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology can improve performance

 

Common run options

  • 1 PPN

  • Full PPN

 

SHARP Flags

Description

SHARP Flag

Description

SHARP Flag

Enable SHARP (enables only LLT)

  • x HCOLL_ENABLE_SHARP=3

Enable streaming aggregation tree (SAT)

  • x SHARP_COLL_ENABLE_SAT=1

Enable Logging

  • x SHARP_COLL_LOG_LEVEL=3

HCOLL default uses up to 256 for SHARP (LLT), we can increase with more resources (SHARP_COLL_OSTS_PER_GROUP)

  • x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4096 (default 256)

Number of simultaneous fragments that can be pushed to the network

  • x SHARP_COLL_OSTS_PER_GROUP=8 (default 2)

SHARP LLT fragmentation size

  • x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=1024 (default 128)

Use SAT in multi ppn case.

  • x HCOLL_ALLREDUCE_HYBRID_LB=1

Enable SHARP starting from 2 Nodes (for demos/small setups)

  • x HCOLL_SHARP_NP=2 (default 4)

In this example, 1K fragmentation size, multiplied by number of OST per group (8) gives you the max message size without waiting for OST credit.

Setting the HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX to 4K will assume that LLT and SAT are overlapping and SHARP is being done for all message sizes.

 

Command output example on Rome (HDR) over 1PPN 8 nodes.

mpirun -np 8 -map-by ppr:1:node -rank-by core -bind-to cpu-list:ordered -cpu-list 80 -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1 -mca coll_hcoll_enable 1 -x HCOLL_ENABLE_SHARP=3 -x SHARP_COLL_ENABLE_SAT=1 osu_allreduce -f -i 100000 -x 100000 # OSU MPI Allreduce Latency Test v5.6.2 # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations 4 2.41 2.40 2.42 100000 8 2.41 2.39 2.43 100000 16 2.48 2.44 2.50 100000 32 2.53 2.51 2.55 100000 64 2.53 2.51 2.54 100000 128 2.48 2.47 2.50 100000 256 2.99 2.98 3.01 100000 512 4.26 4.22 4.29 100000 1024 7.25 7.21 7.35 100000 2048 8.78 8.75 8.82 100000 4096 4.75 4.73 4.77 100000 8192 5.00 4.97 5.03 100000 16384 5.41 5.36 5.43 100000 32768 6.07 6.04 6.09 100000 65536 7.40 7.38 7.41 100000 131072 10.01 9.99 10.02 100000 262144 15.33 15.31 15.35 100000 524288 26.47 26.44 26.49 100000 1048576 49.00 48.98 49.01 100000

 

 

osu_allgather

 

This is a collective benchmark.

  • This micro-benchmark runs over multiple amount of nodes.

 

Common run options

  • 1 PPN

  • Full PPN

 

Command output example on Rome (HDR) for 1PPN over 8 nodes.

$ mpirun -np 8 -map-by ppr:1:node -rank-by core -bind-to cpu-list:ordered -cpu-list 80 -mca coll_hcoll_enable 1 -x UCX_NET_DEVICES=mlx5_2:1 osu_allgather # OSU MPI Allgather Latency Test v5.6.2 # Size Avg Latency(us) 1 3.83 2 3.76 4 3.69 8 3.84 16 4.19 32 4.30 64 5.02 128 5.77 256 6.12 512 7.09 1024 8.58 2048 10.70 4096 13.71 8192 20.36 16384 41.43 32768 73.01 65536 49.43 131072 70.34 262144 114.36 524288 204.57 1048576 385.29

 

References

 

Related content