This post is shows few common RDMA benchmark examples for AMD Rome2nd Generation EPYC™ CPU (formerly codenamed “Rome”) based servers to achieve maximum performance using ConnectX-6 HDR InfiniBand adapters. This post was established based testing over the AMD Daytona_X reference platform with 2nd Gen EPYC CPUs and with ConnectX-6 HDR InfiniBand adapters.
Before starting, make sure to please follow AMD Rome 2nd Gen EPYC CPU Tuning Guide for InfiniBand HPC to tune your cluster to best performance. Make sure to use set the cluster parameters for high performance. Please use the latest firmware and driver, and find a core close to the adapter on your local Numa, see HowTo Find the local NUMA node in AMD EYPC EPYC Servers.
RDMA Testing is important to have before each application or micro-benchmark application testing, as it gives you the low level capabilities of your fabric.
Table of Contents |
---|
RDMA Write Benchmarks
RDMA Write Latency (ib_write_lat)
To check the latency of RDMA write -Write please follow those notes:
Make sure to Please use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80
More iterations helps to make the output more smooth. In this example, We are using 10000 iterations.
...
Code Block | ||
---|---|---|
| ||
$ numactl --physcpubind=80 ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome008 numactl --physcpubind=80 ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F rome007 -n 10000 [1] 59440 ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write Latency Test Dual-port : OFF Device : mlx5_2 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF Mtu : 4096[B] Link type : IB Max inline data : 220[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0xba QPN 0x0134 PSN 0x597b96 RKey 0x01bb9f VAddr 0x002b1ec3800000 remote address: LID 0xd1 QPN 0x0135 PSN 0x18ae22 RKey 0x019a74 VAddr 0x002ab499400000 --------------------------------------------------------------------------------------- #bytes #iterations t_minavg[usec] 2 t_max[usec] t_typical[usec] 10000 t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec] 21.01 4 10000 01.9801 8 10000 5.71 1.01 16 10000 1.01 32 0.0510000 1.0204 64 10000 1.8705 128 10000 41.08 256 10000 01.9855 512 10000 3.35 1.59 1024 1.01 10000 1.65 2048 1.01 10000 0.051.77 4096 10000 1.02 2.07 8192 10000 12.8742 16384 10000 83.01 32768 10000 03.9895 65536 10000 3.18 5.28 1.01 131072 10000 17.0193 262144 10000 0.04 13.23 524288 10000 1.03 23.83 1048576 10000 245.0401 2097152 10000 87.41 16 4194304 10000 0172.9821 8388608 10000 2.78 1.01 1.01 0.04 1.03 1.50 32 10000 1.01 3.17 1.04 1.05 0.04 1.06 342.52 --------------------------------------------------------------------------------------- |
RDMA Write Bandwidth (ib_write_bw)
To check the latency of RDMA write follow those notes:
Please use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80
More iterations helps to make the output more smooth. In this example, we are using 10000 iterations.
NPS Configuration should set to 1 (or 2) for HDR for maximum bandwidth.
Command Example:
Code Block |
---|
# numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 &
ssh rome002 numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F rome001 -n 10000 |
Output example, tested on Rome cluster.
Code Block | ||
---|---|---|
| ||
$ numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome008 numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F rome007 -n 10000 2.00 64 10000 1.02 2.95 1.05 1.05 0.05 1.06 1.83 128 10000 1.05 3.15 1.08 1.08 0.04[1] 59777 1.10 1.60 256 10000 1.50 3.69 1.55 1.55 0.04 1.57 2.02 512 10000 1.55 3.67 1.59 ************************************ * Waiting for client 1.59 0.04 1.61 2.29 1024 10000to connect... * ************************************ --------------------------------------------------------------------------------------- 1.61 RDMA_Write BW Test 3.63 1.65 1.66 0.03 1.68Dual-port : OFF Device 2.14 : mlx5_2 2048 10000 1.73 3.33Number of qps : 1 1.77 Transport type : IB 1.77 0.04 1.79 Connection type : RC 2.26 Using SRQ : OFF 4096 10000 2.04 4.15 CQ Moderation 2.07 : 100 2.07 0.04 2.09 2.86 Mtu 8192 10000 : 4096[B] 2.37 3.79 2.42 2.42 0.03 Link type 2.45 : IB 2.99 16384 10000 2.93 4.32 3.01 Max inline data : 0[B] 3.01 0.03 3.07 3.50 rdma_cm QPs 32768 : 10000OFF 3.87 4.94 3.95 3.96 0.04 Data 4ex.05 method : Ethernet 4.29 65536 10000 5.21 8.51 5.28 5.30 0.07 5.41 6.34 131072 10000 7.85 --------------------------------------------------------------------------------------- local address: LID 0xba QPN 0x0135 PSN 0xddd5cc RKey 0x01c3be VAddr 0x002b948b000000 remote address: LID 0xd1 QPN 0x0136 PSN 0x8fb73f RKey 0x019c76 VAddr 0x002ab363c00000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] --------------------------------------------------------------------------------------- 9.11 7.93 RDMA_Write BW Test Dual-port : OFF 7.94 Device 0.04 : mlx5_2 Number of qps 8.05: 1 Transport type : IB Connection type 8.33: RC Using SRQ 262144 10000: OFF TX depth 13.15 : 128 CQ Moderation : 14.30100 Mtu 13.23 : 4096[B] Link type 13.24 : IB Max inline data : 0.04[B] 13.36 13.68 524288 10000 23.74 25.17 23.83 23.84 0.04 23.93 24.30 1048576 10000 44.92 46.46 45.01 45.03 0.05 45.16 45.35 2097152 10000 87.33 88.77 87.41 87.42 0.04 87.53 87.88 4194304 10000 172.10 180.25 172.21 172.27 0.48 173.21 180.00 8388608 10000 342.33 394.65 342.52 345.79 6.22 356.91 380.58 --------------------------------------------------------------------------------------- |
RDMA Write Bandwidth (ib_write_bw)
To check the latency of RDMA write follow those notes:
Make sure to use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80
More iterations helps to make the output more smooth. In this example, We are using 10000 iterations.
NPS Configuration should set to 1 (or 2) for HDR for maximum bandwidth.
Command Example:
Code Block |
---|
# numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 &
ssh rome002 numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F rome001 -n 10000 |
Output example, tested on Rome cluster.
Code Block | ||
---|---|---|
| ||
$ numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome008 numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F rome007 -n 10000
[1] 59777
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_2
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0xba QPN 0x0135 PSN 0xddd5cc RKey 0x01c3be VAddr 0x002b948b000000
remote address: LID 0xd1 QPN 0x0136 PSN 0x8fb73f RKey 0x019c76 VAddr 0x002ab363c00000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_2
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0xd1 QPN 0x0136 PSN 0x8fb73f RKey 0x019c76 VAddr 0x002ab363c00000
remote address: LID 0xba QPN 0x0135 PSN 0xddd5cc RKey 0x01c3be VAddr 0x002b948b000000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
2 10000 0.066914 0.066468 4.154222
4 10000 0.13 0.13 4.157486
8 10000 0.27 0.27 4.164931
16 10000 0.53 0.53 4.147037
32 10000 1.07 1.07 4.163076
64 10000 2.13 2.12 4.132880
128 10000 4.27 4.26 4.157987
256 10000 8.55 8.51 4.157105
512 10000 17.07 17.00 4.150256
1024 10000 33.82 33.57 4.097722
2048 10000 67.27 66.95 4.086370
4096 10000 133.57 133.17 4.064149
8192 10000 186.65 186.58 2.846967
16384 10000 192.50 192.38 1.467769
32768 10000 197.07 197.06 0.751713
65536 10000 196.64 196.62 0.375025
131072 10000 197.48 197.47 0.188319
262144 10000 197.54 197.53 0.094191
524288 10000 197.57 197.54 0.047097
1048576 10000 197.57 197.54 0.023549
2097152 10000 197.58 197.56 0.011775
4194304 10000 197.58 197.57 0.005888
8388608 10000 197.55 197.53 0.002943
---------------------------------------------------------------------------------------
|
RDMA Write Bi-Directional Bandwidth (ib_write_bw -b)
To check the latency of RDMA write follow those notes:
Make sure to use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80
More iterations helps to make the output more smooth. In this example, I am using also 10000.
Expected RDMA write bandwidth line rate around 32K message size for Rome 7742 2.25GHz using HDR InfiniBand adapter over single HDR switch.
NPS Configuration should set to 1 (or 2) for HDR for maximum bandwidth.
Command Example:
Code Block |
---|
# numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 -b &
ssh rome002 numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F rome001 -n 10000 -b |
Output example, tested on Rome cluster.
Code Block | ||
---|---|---|
| ||
$ numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 -b ************************************ * Waiting for client to connect... * ************************************rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0xd1 QPN 0x0136 PSN 0x8fb73f RKey 0x019c76 VAddr 0x002ab363c00000 remote address: LID 0xba QPN 0x0135 PSN 0xddd5cc RKey 0x01c3be VAddr 0x002b948b000000 ---------------------------------------------------------------------------------------- #bytes #iterations BW average[Gb/sec] 2 10000 0.066468 4 10000 RDMA_Write Bidirectional BW Test Dual-port 0.13 :8 OFF 10000 Device : mlx5_2 0.27 Number of qps16 : 1 10000 Transport type : IB 0.53 Connection type : RC32 10000 Using SRQ : OFF 1.07 TX depth 64 : 128 CQ10000 Moderation : 100 Mtu 2.12 128 : 4096[B] Link10000 type : IB Max4.26 inline data : 0[B] 256 rdma_cm QPs 10000 : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0xba QPN 0x0138 PSN 0x210c6 RKey 0x01e1ef VAddr 0x002b7835800000 remote address: LID 0xd1 QPN 0x0139 PSN 0x99d1e6 RKey 0x01dae9 VAddr 0x002abd1dc00000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 2 8.51 512 10000 17.00 1024 10000 33.57 2048 10000 66.95 4096 10000 133.17 8192 10000 0.133457 186.58 16384 10000 0.132682192.38 32768 10000 8.292601 4 197.06 65536 10000 196.62 0.27 131072 10000 0.27 197.47 262144 10000 8.312058 8 197.53 524288 10000 0197.5354 1048576 10000 0.53 197.54 2097152 10000 8.270148 16 197.56 4194304 10000 1197.0757 8388608 10000 1.06 8.289442 32 10000 2.13 2.12 8.287524 64 10000 4.24 4.22 8.251245 128 10000 8.53 8.50197.53 --------------------------------------------------------------------------------------- |
RDMA Write Bi-Directional Bandwidth (ib_write_bw -b)
To check the latency of RDMA write follow those notes:
Please use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80
More iterations helps to make the output more smooth. In this example, we are using 10000 iterations.
NPS Configuration should set to 1 (or 2) for HDR for maximum bandwidth.
Command Example:
Code Block |
---|
# numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 -b &
ssh rome002 numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F rome001 -n 10000 -b |
Output example, tested on Rome cluster.
Code Block | ||
---|---|---|
| ||
$ numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 -b ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- 8.301570 256 RDMA_Write Bidirectional BW Test 10000 16.99 16.93 Dual-port 8.264573 : 512OFF 10000 Device 33.94: mlx5_2 Number of qps : 1 33.83 Transport type : IB Connection type : 8.258309RC 1024 10000 Using SRQ : OFF 66.96TX depth : 128 CQ Moderation 66.53 : 100 Mtu 8.120855 2048 : 4096[B] Link type 10000 : IB Max inline data 132.49 : 0[B] rdma_cm QPs : OFF Data 131ex.87 method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0xba QPN 0x0138 PSN 0x210c6 8.048677RKey 0x01e1ef 4096VAddr 0x002b7835800000 remote address: LID 0xd1 10000QPN 0x0139 PSN 0x99d1e6 RKey 0x01dae9 VAddr 0x002abd1dc00000 --------------------------------------------------------------------------------------- #bytes 261.92 #iterations BW 261.10 7.968046 8192average[Gb/sec] 2 10000 3580.14132682 4 357.8810000 50.46077527 16384 8 10000 10000 379.43 379.220.53 16 2.893244 32768 10000 10000 1.06 391.65 32 10000 391.56 2.12 64 1.493699 65536 10000 4.22 391.03 128 10000 390.98 8.50 256 0.745735 131072 10000 16.93 393.45 512 10000 393.42 33.83 1024 0.375196 262144 10000 10000 66.53 393.682048 10000 393.66 131.87 4096 0.187714 10000 524288 10000 261.10 8192 393.78 10000 393357.7688 16384 10000 0.093879 1048576 10000379.22 32768 10000 393.85 391.56 65536 393.80 10000 0390.04694598 131072 2097152 10000 393.8842 262144 10000 393.82 393.66 524288 10000 0.023473 4194304 10000 393.76 1048576 10000 393.87 393.80 2097152 393.82 10000 0393.01173782 83886084194304 10000 393.8582 8388608 10000 393.83 0.005869 393.83 --------------------------------------------------------------------------------------- |
Note: All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC-AI Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC-AI Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein.