Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Current »

This post is shows few common RDMA benchmark examples for AMD 2nd Generation EPYC™ CPU (formerly codenamed “Rome”) based servers to achieve maximum performance using ConnectX-6 HDR InfiniBand adapters. This post was established based testing over the AMD Daytona_X reference platform with 2nd Gen EPYC CPUs and with ConnectX-6 HDR InfiniBand adapters.

Before starting, please follow AMD 2nd Gen EPYC CPU Tuning Guide for InfiniBand HPC to set the cluster parameters for high performance. Please use the latest firmware and driver, and find a core close to the adapter on your local Numa, see HowTo Find the local NUMA node in AMD EPYC Servers.

RDMA Testing is important to have before each application or micro-benchmark application testing, as it gives you the low level capabilities of your fabric.

RDMA Write Benchmarks

RDMA Write Latency (ib_write_lat)

To check the latency of RDMA-Write please follow those notes:

  • Please use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80

  • More iterations helps to make the output more smooth. In this example, We are using 10000 iterations.

Command Example:

# numactl --physcpubind=80 ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & 
ssh rome002 numactl --physcpubind=80  ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F  rome001 -n 10000

Output example, tested on Rome cluster.

$ numactl --physcpubind=80 ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome008 numactl --physcpubind=80  ib_write_lat -a -d mlx5_2 -i 1 --report_gbits -F  rome007 -n 10000
[1] 59440

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write Latency Test                                            
 Dual-port       : OFF          Device         : mlx5_2                                
 Number of qps   : 1            Transport type : IB                                    
 Connection type : RC           Using SRQ      : OFF                                   
 Mtu             : 4096[B]                                                             
 Link type       : IB                                                                  
 Max inline data : 220[B]                                                              
 rdma_cm QPs     : OFF                                                                 
 Data ex. method : Ethernet                                                            
---------------------------------------------------------------------------------------
 local address: LID 0xba QPN 0x0134 PSN 0x597b96 RKey 0x01bb9f VAddr 0x002b1ec3800000  
 remote address: LID 0xd1 QPN 0x0135 PSN 0x18ae22 RKey 0x019a74 VAddr 0x002ab499400000 
---------------------------------------------------------------------------------------
 #bytes #iterations    t_avg[usec]
 2       10000          1.01
 4       10000          1.01
 8       10000          1.01
 16      10000          1.01
 32      10000          1.04
 64      10000          1.05
 128     10000          1.08
 256     10000          1.55
 512     10000          1.59
 1024    10000          1.65
 2048    10000          1.77  
 4096    10000          2.07  
 8192    10000          2.42  
 16384   10000          3.01  
 32768   10000          3.95  
 65536   10000          5.28  
 131072  10000          7.93  
 262144  10000          13.23 
 524288  10000          23.83 
 1048576 10000          45.01 
 2097152 10000          87.41 
 4194304 10000          172.21
 8388608 10000          342.52
---------------------------------------------------------------------------------------

RDMA Write Bandwidth (ib_write_bw)

To check the latency of RDMA write follow those notes:

  • Please use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80

  • More iterations helps to make the output more smooth. In this example, we are using 10000 iterations.

  • NPS Configuration should set to 1 (or 2) for HDR for maximum bandwidth.

Command Example:

# numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & 
ssh rome002 numactl --physcpubind=80  ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F  rome001 -n 10000

Output example, tested on Rome cluster.

$ numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 & ssh rome008 numactl --physcpubind=80  ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F  rome007 -n 10000                                                                                                                                                                                                             
[1] 59777                                                                                                                                                                                                     

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test                                                 
 Dual-port       : OFF          Device         : mlx5_2                                
 Number of qps   : 1            Transport type : IB                                    
 Connection type : RC           Using SRQ      : OFF                                   
 CQ Moderation   : 100                                                                 
 Mtu             : 4096[B]                                                             
 Link type       : IB                                                                  
 Max inline data : 0[B]                                                                
 rdma_cm QPs     : OFF                                                                 
 Data ex. method : Ethernet                                                            
---------------------------------------------------------------------------------------
 local address: LID 0xba QPN 0x0135 PSN 0xddd5cc RKey 0x01c3be VAddr 0x002b948b000000  
 remote address: LID 0xd1 QPN 0x0136 PSN 0x8fb73f RKey 0x019c76 VAddr 0x002ab363c00000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_2
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0xd1 QPN 0x0136 PSN 0x8fb73f RKey 0x019c76 VAddr 0x002ab363c00000
 remote address: LID 0xba QPN 0x0135 PSN 0xddd5cc RKey 0x01c3be VAddr 0x002b948b000000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW average[Gb/sec]
 2          10000           0.066468
 4          10000           0.13  
 8          10000           0.27  
 16         10000           0.53  
 32         10000           1.07  
 64         10000           2.12  
 128        10000           4.26  
 256        10000           8.51  
 512        10000           17.00 
 1024       10000           33.57 
 2048       10000           66.95 
 4096       10000           133.17
 8192       10000           186.58
 16384      10000           192.38
 32768      10000           197.06
 65536      10000           196.62
 131072     10000           197.47
 262144     10000           197.53
 524288     10000           197.54
 1048576    10000           197.54
 2097152    10000           197.56
 4194304    10000           197.57
 8388608    10000           197.53
 ---------------------------------------------------------------------------------------

RDMA Write Bi-Directional Bandwidth (ib_write_bw -b)

To check the latency of RDMA write follow those notes:

  • Please use the core local to the HCA, in this example the HDR InfiniBand adapter is local to core 80

  • More iterations helps to make the output more smooth. In this example, we are using 10000 iterations.

  • NPS Configuration should set to 1 (or 2) for HDR for maximum bandwidth.

Command Example:

# numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 -b & 
ssh rome002 numactl --physcpubind=80  ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F  rome001 -n 10000 -b

Output example, tested on Rome cluster.

$ numactl --physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 -b

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write Bidirectional BW Test
 Dual-port       : OFF          Device         : mlx5_2
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0xba QPN 0x0138 PSN 0x210c6 RKey 0x01e1ef VAddr 0x002b7835800000
 remote address: LID 0xd1 QPN 0x0139 PSN 0x99d1e6 RKey 0x01dae9 VAddr 0x002abd1dc00000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW average[Gb/sec]
 2          10000           0.132682
 4          10000           0.27   
 8          10000           0.53  
 16         10000           1.06  
 32         10000           2.12  
 64         10000           4.22  
 128        10000           8.50  
 256        10000           16.93 
 512        10000           33.83 
 1024       10000           66.53 
 2048       10000           131.87
 4096       10000           261.10
 8192       10000           357.88
 16384      10000           379.22
 32768      10000           391.56
 65536      10000           390.98
 131072     10000           393.42
 262144     10000           393.66
 524288     10000           393.76
 1048576    10000           393.80
 2097152    10000           393.82
 4194304    10000           393.82
 8388608    10000           393.83
---------------------------------------------------------------------------------------

Note: All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty.   The HPC-AI Advisory Council makes no representation to the accuracy and completeness of the information contained herein.  HPC-AI Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein.

References

  • No labels