This post is shows few common RDMA benchmark examples for AMD RomeAMD 2nd Generation EPYC™ CPU (formerly codenamed “Rome”) based servers to achieve maximum performance using ConnectX-6 HDR InfiniBand adapters. This post was established based testing over the AMD Daytona_X reference platform with 2nd Gen EPYC CPUs and with ConnectX-6 HDR InfiniBand adapters.
Before starting, make sure to please follow AMD Rome 2nd Gen EPYC CPU Tuning Guide for InfiniBand HPC to tune your cluster to best performance. Make sure to use latest firmware and driver.RDMA Testing is set the cluster parameters for high performance. Please use the latest firmware and driver, and find a core close to the adapter on your local Numa, see HowTo Find the local NUMA node in AMD EPYC Servers.
RDMA Testing is important to have before each application or micro-benchmark application testing, as it gives you the low level capabilities of your fabric.
Table of Contents |
---|
RDMA Write Benchmarks
RDMA Write Latency (ib_write_lat)
To check the latency of RDMA write -Write please follow those notes:
Make sure to Please use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80
More iterations helps to make the output more smooth. In this example, I am We are using also 10000.
Expected RDMA write latency is around 1usec (0.97-1.02) for Rome 7742 2.25GHz using HDR InfiniBand adapter over single HDR switch.
NPS Configuration is not critical here10000 iterations.
Command Example:
Code Block |
---|
# numactl --physcpubind=80 ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome002 numactl --physcpubind=80 ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F rome001 -n 10000 |
...
Code Block | ||
---|---|---|
| ||
$ numactl --physcpubind=80 ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome008 numactl --physcpubind=80 ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F rome007 -n 10000 [1] 59440 ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write Latency Test Dual-port : OFF Device : mlx5_2 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF Mtu : 4096[B] Link type : IB Max inline data : 220[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0xba QPN 0x0134 PSN 0x597b96 RKey 0x01bb9f VAddr 0x002b1ec3800000 remote address: LID 0xd1 QPN 0x0135 PSN 0x18ae22 RKey 0x019a74 VAddr 0x002ab499400000 --------------------------------------------------------------------------------------- #bytes #iterations t_minavg[usec] 2 t_max[usec] t_typical[usec] t_avg[usec]10000 t_stdev[usec] 99% percentile[usec] 991.9%01 percentile[usec] 2 4 10000 01.9801 8 10000 5.71 1.01 16 10000 1.01 32 0.05 10000 1.0204 64 10000 1.8705 128 10000 41.08 256 10000 01.9855 512 10000 3.35 1.0159 1024 10000 1.0165 2048 10000 0.05 1.77 4096 1.0210000 2.07 8192 10000 1.87 2.42 16384 10000 8 10000 3.01 32768 0.9810000 3.1895 65536 10000 1.01 5.28 131072 10000 1.01 7.93 0.04262144 10000 113.0323 524288 10000 2.0423.83 16 1048576 10000 045.9801 2097152 10000 2.78 1.0187.41 4194304 10000 1172.0121 8388608 10000 0342.04 1.03 1.50 32 10000 1.01 3.17 1.04 1.05 0.04 1.06 2.00 52 --------------------------------------------------------------------------------------- |
RDMA Write Bandwidth (ib_write_bw)
To check the latency of RDMA write follow those notes:
Please use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80
More iterations helps to make the output more smooth. In this example, we are using 10000 iterations.
NPS Configuration should set to 1 (or 2) for HDR for maximum bandwidth.
Command Example:
Code Block |
---|
# numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 &
ssh rome002 numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F rome001 -n 10000 |
Output example, tested on Rome cluster.
Code Block | ||
---|---|---|
| ||
$ numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome008 numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F rome007 -n 10000 64 10000 1.02 2.95 1.05 1.05 0.05 1.06 1.83 128 10000 1.05 3.15 1.08 1.08 0.04 1.10 [1.60 256 10000 1.50] 59777 3.69 1.55 1.55 0.04 1.57 2.02 512 10000 1.55 3.67 1.59 1.59 0.04 1.61 2.29 1024 10000 1.61 3.63 1.65 1.66 0.03 1.68 2.14 2048 10000 1.73 3.33 1.77 1.77 0.04 1.79 2.26 4096 10000 2.04 4.15 2.07 2.07 0.04 2.09 2.86 8192 10000 2.37 3.79 2.42 2.42 0.03 2.45 2.99 16384 10000 2.93 4.32 3.01 3.01 0.03 3.07 3.50 32768 10000 3.87 4.94 3.95 3.96 0.04 4.05 4.29 65536 10000 5.21 8.51 5.28 5.30 0.07 5.41 6.34 131072 10000 7.85 9.11 7.93 7.94 0.04 8.05 8.33 262144 10000 13.15 14.30 13.23 13.24 0.04 13.36 13.68 524288 10000 23.74 25.17 23.83 23.84 0.04 23.93 24.30 1048576 10000 44.92 46.46 45.01 45.03 0.05 45.16 45.35 2097152 10000 87.33 88.77 87.41 87.42 0.04 87.53 87.88 4194304 10000 172.10 180.25 172.21 172.27 0.48 173.21 180.00 8388608 10000 342.33 394.65 342.52 345.79 6.22 356.91 380.58 --------------------------------------------------------------------------------------- |
RDMA Write Bandwidth (ib_write_bw)
To check the latency of RDMA write follow those notes:
Make sure to use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80
More iterations helps to make the output more smooth. In this example, I am using also 10000.
Expected RDMA write bandwidth line rate around 8-16K message size for Rome 7742 2.25GHz using HDR InfiniBand adapter over single HDR switch.
NPS Configuration should set to 1 (or 2) for HDR for maximum bandwidth.
Command Example:
Code Block |
---|
# numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 &
ssh rome002 numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F rome001 -n 10000 |
Output example, tested on Rome cluster.
Code Block | ||
---|---|---|
| ||
$ numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome008 numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F rome007 -n 10000 ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device [1] 59777 : mlx5_2 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF CQ Moderation : 100 ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- Mtu RDMA_Write BW Test : 4096[B] Dual-port : OFF Device Link type : mlx5_2 IB Number of qps : 1 Transport type : IB Max inline data : 0[B] Connection type : RC Using SRQ : OFF rdma_cm QPs : OFF CQ Moderation : 100 Data ex. method : Ethernet Mtu : 4096[B] --------------------------------------------------------------------------------------- Link type : IB local address: LID 0xba QPN 0x0135 PSN 0xddd5cc RKey 0x01c3be VAddr 0x002b948b000000 remote address: LID 0xd1 QPN 0x0136 PSN 0x8fb73f RKey 0x019c76 VAddr 0x002ab363c00000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] Max inline data : 0[B] BW average[Gb/sec] MsgRate[Mpps] --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port rdma_cm QPs : OFF Device : mlx5_2 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ Data ex. method : Ethernet OFF TX depth : 128 CQ Moderation : 100 --------------------------------------------------------------------------------------- local address: LID 0xba QPN 0x0135 PSN 0xddd5cc RKey 0x01c3be VAddr 0x002b948b000000 remote address: LID 0xd1 QPN 0x0136 PSN 0x8fb73f RKey 0x019c76 VAddr 0x002ab363c00000 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] local address: LID 0xd1 QPN 0x0136 PSN 0x8fb73f RKey 0x019c76 VAddr 0x002ab363c00000 remote address: LID 0xba QPN 0x0135 PSN 0xddd5cc RKey 0x01c3be VAddr 0x002b948b000000 --------------------------------------------------------------------------------------- #bytes #iterations BW average[Gb/sec] 2 10000 RDMA_Write BW Test Dual-port 0.066468 4 : OFF10000 Device 0.13 8 : mlx5_2 Number of qps 10000 : 1 0.27 Transport16 type : IB Connection type : RC 10000 Using SRQ0.53 32 : OFF TX depth 10000 : 128 CQ Moderation 1.07 : 100 Mtu64 10000 : 4096[B] Link type 2.12 :128 IB Max inline data : 0[B] 10000 rdma_cm QPs : OFF Data ex4.26 method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0xd1 QPN 0x0136 PSN 0x8fb73f RKey 0x019c76 VAddr 0x002ab363c00000 remote address: LID 0xba QPN 0x0135 PSN 0xddd5cc RKey 0x01c3be VAddr 0x002b948b000000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 2 256 10000 8.51 512 10000 17.00 1024 10000 33.57 2048 10000 66.95 10000 4096 10000 0.066914 133.17 8192 0.066468 10000 4.154222 4 186.58 16384 10000 192.38 32768 0.13 10000 0197.1306 65536 10000 4.157486 8 196.62 131072 10000 197.47 0.27 262144 10000 0.27 197.53 524288 10000 4.164931 16 197.54 1048576 10000 0197.5354 2097152 10000 0.53 197.56 4194304 10000 4.147037 32 197.57 8388608 10000 1.07 1.07 4.163076 64 10000 2.13 2.12 4.132880 128 10000 4.27 4.26 4.157987 256 10000 8.55 8.51197.53 --------------------------------------------------------------------------------------- |
RDMA Write Bi-Directional Bandwidth (ib_write_bw -b)
To check the latency of RDMA write follow those notes:
Please use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80
More iterations helps to make the output more smooth. In this example, we are using 10000 iterations.
NPS Configuration should set to 1 (or 2) for HDR for maximum bandwidth.
Command Example:
Code Block |
---|
# numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 -b &
ssh rome002 numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F rome001 -n 10000 -b |
Output example, tested on Rome cluster.
Code Block | ||
---|---|---|
| ||
$ numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 -b ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- 4.157105 512 RDMA_Write Bidirectional BW Test Dual-port 10000 : OFF 17.07 Device : mlx5_2 Number 17.00of qps : 1 4.150256 1024 Transport type : IB Connection 10000type : RC 33.82 Using SRQ : OFF TX depth 33.57 : 128 CQ Moderation : 4.097722100 2048Mtu 10000 : 67.27 4096[B] Link type : IB Max inline 66.95data : 0[B] rdma_cm QPs : OFF Data 4ex.086370 method 4096: Ethernet --------------------------------------------------------------------------------------- local address: LID 100000xba QPN 0x0138 PSN 0x210c6 RKey 0x01e1ef VAddr 0x002b7835800000 remote address: 133.57LID 0xd1 QPN 0x0139 PSN 0x99d1e6 RKey 133.17 4.064149 8192 10000 0x01dae9 VAddr 0x002abd1dc00000 --------------------------------------------------------------------------------------- #bytes #iterations BW 186.65 average[Gb/sec] 2 10000 186.58 0.132682 4 2.846967 16384 10000 1920.5027 8 192.38 10000 10.46776953 3276816 10000 10000 197.07 1.06 32 197.06 10000 02.75171312 65536 64 10000 10000 196.64 4.22 128 196.62 10000 08.37502550 131072 256 10000 10000 197.48 16.93 512 197.47 10000 033.18831983 262144 1024 10000 10000 197.54 66.53 2048 10000 197.53 131.87 0.0941914096 524288 10000 261.10 197.57 8192 10000 197.54 357.88 16384 0.047097 10000 1048576 10000 379.22 32768 197.57 10000 197391.5456 65536 10000 0.023549 2097152 10000390.98 131072 10000 197.58 393.42 262144 197.56 10000 393.66 0.011775 524288 4194304 10000 197393.5876 1048576 10000 197.57 393.80 2097152 10000 0.005888 8388608 10000 393.82 4194304 10000 197.55 393.82 8388608 197.5310000 0393.002943 83 --------------------------------------------------------------------------------------- |
Note: All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC-AI Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC------ AI Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein.