Basic RDMA Benchmark Examples for AMD 2nd Gen EPYC CPUs over HDR InfiniBand

This post is shows few common RDMA benchmark examples for AMD 2nd Generation EPYC™ CPU (formerly codenamed “Rome”) based servers to achieve maximum performance using ConnectX-6 HDR InfiniBand adapters. This post was established based testing over the AMD Daytona_X reference platform with 2nd Gen EPYC CPUs and with ConnectX-6 HDR InfiniBand adapters.

 

Before starting, please follow https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1280442391 to set the cluster parameters for high performance. Please use the latest firmware and driver, and find a core close to the adapter on your local Numa, see https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/204668929.

RDMA Testing is important to have before each application or micro-benchmark application testing, as it gives you the low level capabilities of your fabric.

 

 

RDMA Write Benchmarks

RDMA Write Latency (ib_write_lat)

 

To check the latency of RDMA-Write please follow those notes:

  • Please use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80

  • More iterations helps to make the output more smooth. In this example, We are using 10000 iterations.

 

Command Example:

# numactl --physcpubind=80 ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome002 numactl --physcpubind=80  ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F  rome001 -n 10000

 

Output example, tested on https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1180336152 cluster.

 

$ numactl --physcpubind=80 ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome008 numactl --physcpubind=80 ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F rome007 -n 10000 [1] 59440 ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write Latency Test Dual-port : OFF Device : mlx5_2 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF Mtu : 4096[B] Link type : IB Max inline data : 220[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0xba QPN 0x0134 PSN 0x597b96 RKey 0x01bb9f VAddr 0x002b1ec3800000 remote address: LID 0xd1 QPN 0x0135 PSN 0x18ae22 RKey 0x019a74 VAddr 0x002ab499400000 --------------------------------------------------------------------------------------- #bytes #iterations t_avg[usec] 2 10000 1.01 4 10000 1.01 8 10000 1.01 16 10000 1.01 32 10000 1.04 64 10000 1.05 128 10000 1.08 256 10000 1.55 512 10000 1.59 1024 10000 1.65 2048 10000 1.77 4096 10000 2.07 8192 10000 2.42 16384 10000 3.01 32768 10000 3.95 65536 10000 5.28 131072 10000 7.93 262144 10000 13.23 524288 10000 23.83 1048576 10000 45.01 2097152 10000 87.41 4194304 10000 172.21 8388608 10000 342.52 ---------------------------------------------------------------------------------------

 

 

RDMA Write Bandwidth (ib_write_bw)

To check the latency of RDMA write follow those notes:

  • Please use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80

  • More iterations helps to make the output more smooth. In this example, we are using 10000 iterations.

  • NPS Configuration should set to 1 (or 2) for HDR for maximum bandwidth.

 

Command Example:

# numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome002 numactl --physcpubind=80  ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F  rome001 -n 10000

 

Output example, tested on cluster.

 

 

RDMA Write Bi-Directional Bandwidth (ib_write_bw -b)

To check the latency of RDMA write follow those notes:

  • Please use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80

  • More iterations helps to make the output more smooth. In this example, we are using 10000 iterations.

  • NPS Configuration should set to 1 (or 2) for HDR for maximum bandwidth.

 

Command Example:

 

Output example, tested on cluster.

 

 

 

Note: All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty.   The HPC-AI Advisory Council makes no representation to the accuracy and completeness of the information contained herein.  HPC-AI Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein.

References