This post shows few common RDMA benchmark examples for AMD Rome CPU based platform.
Before starting, please follow AMD Rome Tuning Guide for InfiniBand HPC to set the cluster parameters for high performance. Please use the latest firmware and driver, and find a core close to the adapter on your local Numa, see HowTo Find the local NUMA node in AMD EYPC Servers.
RDMA Testing is important to have before each application or micro-benchmark application testing, as it gives you the low level capabilities of your fabric.
RDMA Write Benchmarks
RDMA Write Latency (ib_write_lat)
To check the latency of RDMA-Write please follow those notes:
Please use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80
More iterations helps to make the output more smooth. In this example, We are using 10000 iterations.
Command Example:
# numactl --physcpubind=80 ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome002 numactl --physcpubind=80 ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F rome001 -n 10000
Output example, tested on Rome cluster.
$ numactl --physcpubind=80 ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome008 numactl --physcpubind=80 ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F rome007 -n 10000 [1] 59440 ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write Latency Test Dual-port : OFF Device : mlx5_2 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF Mtu : 4096[B] Link type : IB Max inline data : 220[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0xba QPN 0x0134 PSN 0x597b96 RKey 0x01bb9f VAddr 0x002b1ec3800000 remote address: LID 0xd1 QPN 0x0135 PSN 0x18ae22 RKey 0x019a74 VAddr 0x002ab499400000 --------------------------------------------------------------------------------------- #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec] 2 10000 0.98 5.71 1.01 1.01 0.05 1.02 1.87 4 10000 0.98 3.35 1.01 1.01 0.05 1.02 1.87 8 10000 0.98 3.18 1.01 1.01 0.04 1.03 2.04 16 10000 0.98 2.78 1.01 1.01 0.04 1.03 1.50 32 10000 1.01 3.17 1.04 1.05 0.04 1.06 2.00 64 10000 1.02 2.95 1.05 1.05 0.05 1.06 1.83 128 10000 1.05 3.15 1.08 1.08 0.04 1.10 1.60 256 10000 1.50 3.69 1.55 1.55 0.04 1.57 2.02 512 10000 1.55 3.67 1.59 1.59 0.04 1.61 2.29 1024 10000 1.61 3.63 1.65 1.66 0.03 1.68 2.14 2048 10000 1.73 3.33 1.77 1.77 0.04 1.79 2.26 4096 10000 2.04 4.15 2.07 2.07 0.04 2.09 2.86 8192 10000 2.37 3.79 2.42 2.42 0.03 2.45 2.99 16384 10000 2.93 4.32 3.01 3.01 0.03 3.07 3.50 32768 10000 3.87 4.94 3.95 3.96 0.04 4.05 4.29 65536 10000 5.21 8.51 5.28 5.30 0.07 5.41 6.34 131072 10000 7.85 9.11 7.93 7.94 0.04 8.05 8.33 262144 10000 13.15 14.30 13.23 13.24 0.04 13.36 13.68 524288 10000 23.74 25.17 23.83 23.84 0.04 23.93 24.30 1048576 10000 44.92 46.46 45.01 45.03 0.05 45.16 45.35 2097152 10000 87.33 88.77 87.41 87.42 0.04 87.53 87.88 4194304 10000 172.10 180.25 172.21 172.27 0.48 173.21 180.00 8388608 10000 342.33 394.65 342.52 345.79 6.22 356.91 380.58 ---------------------------------------------------------------------------------------
RDMA Write Bandwidth (ib_write_bw)
To check the latency of RDMA write follow those notes:
Please use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80
More iterations helps to make the output more smooth. In this example, we are using 10000 iterations.
NPS Configuration should set to 1 (or 2) for HDR for maximum bandwidth.
Command Example:
# numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome002 numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F rome001 -n 10000
Output example, tested on Rome cluster.
$ numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome008 numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F rome007 -n 10000 [1] 59777 ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_2 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0xba QPN 0x0135 PSN 0xddd5cc RKey 0x01c3be VAddr 0x002b948b000000 remote address: LID 0xd1 QPN 0x0136 PSN 0x8fb73f RKey 0x019c76 VAddr 0x002ab363c00000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_2 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0xd1 QPN 0x0136 PSN 0x8fb73f RKey 0x019c76 VAddr 0x002ab363c00000 remote address: LID 0xba QPN 0x0135 PSN 0xddd5cc RKey 0x01c3be VAddr 0x002b948b000000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 2 10000 0.066914 0.066468 4.154222 4 10000 0.13 0.13 4.157486 8 10000 0.27 0.27 4.164931 16 10000 0.53 0.53 4.147037 32 10000 1.07 1.07 4.163076 64 10000 2.13 2.12 4.132880 128 10000 4.27 4.26 4.157987 256 10000 8.55 8.51 4.157105 512 10000 17.07 17.00 4.150256 1024 10000 33.82 33.57 4.097722 2048 10000 67.27 66.95 4.086370 4096 10000 133.57 133.17 4.064149 8192 10000 186.65 186.58 2.846967 16384 10000 192.50 192.38 1.467769 32768 10000 197.07 197.06 0.751713 65536 10000 196.64 196.62 0.375025 131072 10000 197.48 197.47 0.188319 262144 10000 197.54 197.53 0.094191 524288 10000 197.57 197.54 0.047097 1048576 10000 197.57 197.54 0.023549 2097152 10000 197.58 197.56 0.011775 4194304 10000 197.58 197.57 0.005888 8388608 10000 197.55 197.53 0.002943 ---------------------------------------------------------------------------------------
RDMA Write Bi-Directional Bandwidth (ib_write_bw -b)
To check the latency of RDMA write follow those notes:
Please use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80
More iterations helps to make the output more smooth. In this example, we are using 10000 iterations.
NPS Configuration should set to 1 (or 2) for HDR for maximum bandwidth.
Command Example:
# numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 -b & ssh rome002 numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F rome001 -n 10000 -b
Output example, tested on Rome cluster.
$ numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 -b ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write Bidirectional BW Test Dual-port : OFF Device : mlx5_2 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0xba QPN 0x0138 PSN 0x210c6 RKey 0x01e1ef VAddr 0x002b7835800000 remote address: LID 0xd1 QPN 0x0139 PSN 0x99d1e6 RKey 0x01dae9 VAddr 0x002abd1dc00000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 2 10000 0.133457 0.132682 8.292601 4 10000 0.27 0.27 8.312058 8 10000 0.53 0.53 8.270148 16 10000 1.07 1.06 8.289442 32 10000 2.13 2.12 8.287524 64 10000 4.24 4.22 8.251245 128 10000 8.53 8.50 8.301570 256 10000 16.99 16.93 8.264573 512 10000 33.94 33.83 8.258309 1024 10000 66.96 66.53 8.120855 2048 10000 132.49 131.87 8.048677 4096 10000 261.92 261.10 7.968046 8192 10000 358.14 357.88 5.460775 16384 10000 379.43 379.22 2.893244 32768 10000 391.65 391.56 1.493699 65536 10000 391.03 390.98 0.745735 131072 10000 393.45 393.42 0.375196 262144 10000 393.68 393.66 0.187714 524288 10000 393.78 393.76 0.093879 1048576 10000 393.85 393.80 0.046945 2097152 10000 393.88 393.82 0.023473 4194304 10000 393.87 393.82 0.011737 8388608 10000 393.85 393.83 0.005869 ---------------------------------------------------------------------------------------
Note: All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC-AI Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC-AI Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein.
Add Comment