Understanding server topology is very important for performance benchmarking.

The nvidia-smi topo command will help you figure out what is connected and how.

We will discuss several options here and give some examples.

The GPU and HCA are connected on the same PCI switch

In the following example, GPU0 is connected to mlx5_2 device via single PCI switch, marked with PIX.

In this example, the traffic that goes from the GPU will not reach the CPU, but will pass via the PCI switch directly to the adapter -> GPU direct.

$ nvidia-smi topo -m
	    GPU0	 GPU1	GPU2	GPU3	mlx5_0	mlx5_1	mlx5_2	mlx5_3	CPU Affinity
GPU0	 X 	     PIX	PIX	    PIX	     NODE	NODE	PIX	      PIX	0-19
GPU1	PIX	      X 	PIX	    PIX	     NODE	NODE	PIX	      PIX	0-19
GPU2	PIX	     PIX	 X 	    PIX	     NODE	NODE	PIX	      PIX	0-19
GPU3	PIX	     PIX	PIX	     X 	     NODE	NODE	PIX	      PIX	0-19
mlx5_0	NODE	 NODE	NODE	NODE	  X 	 PIX	NODE      NODE	
mlx5_1	NODE	 NODE	NODE	NODE	 PIX	 X 	    NODE	  NODE	
mlx5_2	PIX	     PIX	PIX	    PIX 	 NODE	NODE	 X 	      PIX	
mlx5_3	PIX	     PIX	PIX	    PIX	     NODE	NODE	PIX	       X 	

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

In this case, adapter is ConnectX-6 HDR 100 and when running basic RDMA read or write, Device to Device test you should expect linerate.

To run this

Make sure to compile perftest with cuda support
Make sure nv_peer_mem module is loaded

$ ./ib_write_bw -a -d mlx5_2 -i 1 --report_gbits -F -n 10000 tessa001 --use_cuda
initializing CUDA
There are 4 devices supporting CUDA, picking first...
[pid = 169102, dev = 0] device name = [Tesla T4]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 16777216 bytes GPU buffer
allocated GPU buffer address at 00007f4207000000 pointer=0x7f4207000000
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_2
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0xbd QPN 0x008e PSN 0x93ed50 RKey 0x00895c VAddr 0x007f4207800000
 remote address: LID 0xa8 QPN 0x008e PSN 0x3dd06e RKey 0x00781b VAddr 0x007f45bf800000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 2          10000            0.11               0.11   		   6.659070
 4          10000            0.22               0.21   		   6.705317
 8          10000            0.43               0.42   		   6.633025
 16         10000            0.82               0.79   		   6.188507
 ...
 1048576    10000            98.95              96.75  		   0.011534
 2097152    10000            98.26              96.77  		   0.005768

The GPU and HCA are connected on the NUMA root complex

This is another server from the OPS cluster, that has the GPU connected via host bridge root complex (PHB) to the adapter . In this example traffic will flow from the GPU memory to the to the CPU host bridge and to the adapter.

$ nvidia-smi topo -m
		GPU0	mlx5_0	mlx5_1	CPU Affinity
GPU0	 X   	PHB	     PHB	    0-9
mlx5_0	 PHB	 X 	     PIX	
mlx5_1	 PHB	PIX	      X 	

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

The GPU and HCA are connected on two difference Numa

In this example, traffic from GPU memory will cross between two NUMA nodes to reach the adapter (SYS).

This is the least desired server architecture when GPUs and performance are required.

$ nvidia-smi topo -m
		GPU0	mlx5_0	mlx5_1	mlx5_2	mlx5_3	mlx5_4	CPU Affinity
GPU0	 X 	     SYS	 SYS	  SYS     SYS     SYS      48-63
mlx5_0	SYS	      X  	 PIX	  SYS     SYS     SYS	
mlx5_1	SYS	     PIX	  X 	  SYS	  SYS	  SYS	
mlx5_2	SYS	     SYS	 SYS	   X 	  PIX	  SYS	
mlx5_3	SYS	     SYS	 SYS	  PIX	   X 	  SYS	
mlx5_4	SYS	     SYS	 SYS	  SYS	  SYS	   X 	

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Useful commands

$ lstopo-no-graphics 
Machine (251GB total)
  NUMANode L#0 (P#0 125GB)
    Package L#0 + L3 L#0 (28MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
      L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
      L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
      L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
      L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
      L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)
      L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)
      L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)
      L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
      L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16)
      L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17)
      L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18)
      L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19)
    HostBridge L#0
      PCI 8086:a1d2
        Block(Disk) L#0 "sda"
      PCI 8086:a182
      PCIBridge
        PCIBridge
          PCI 1a03:2000
            GPU L#1 "card0"
            GPU L#2 "controlD64"
    HostBridge L#3
      PCIBridge
        PCI 15b3:101b
          Net L#3 "ib0"
          OpenFabrics L#4 "mlx5_0"
        PCI 15b3:101b
          Net L#5 "ib1"
          OpenFabrics L#6 "mlx5_1"
    HostBridge L#5
      PCIBridge
        PCIBridge
          PCIBridge
            PCI 10de:1eb8
              GPU L#7 "card1"
              GPU L#8 "renderD128"
          PCIBridge
            PCI 15b3:101b
              Net L#9 "ib2"
              OpenFabrics L#10 "mlx5_2"
            PCI 15b3:101b
              Net L#11 "ib3"
              OpenFabrics L#12 "mlx5_3"
          PCIBridge
            PCI 10de:1eb8
              GPU L#13 "card2"
              GPU L#14 "renderD129"
          PCIBridge
            PCI 10de:1eb8
              GPU L#15 "card3"
              GPU L#16 "renderD130"
          PCIBridge
            PCI 10de:1eb8
              GPU L#17 "card4"
              GPU L#18 "renderD131"
  NUMANode L#1 (P#1 126GB)
    Package L#1 + L3 L#1 (28MB)
      L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20)
      L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
      L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
      L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
      L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#24)
      L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#25)
      L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#26)
      L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#27)
      L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#28)
      L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#29)
      L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#30)
      L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#31)
      L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 + PU L#32 (P#32)
      L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 + PU L#33 (P#33)
      L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 + PU L#34 (P#34)
      L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 + PU L#35 (P#35)
      L2 L#36 (1024KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 + PU L#36 (P#36)
      L2 L#37 (1024KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 + PU L#37 (P#37)
      L2 L#38 (1024KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 + PU L#38 (P#38)
      L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 (P#39)
    HostBridge L#13
      PCIBridge
        PCI 8086:1528
          Net L#19 "eth0"
        PCI 8086:1528
          Net L#20 "eth1"
    HostBridge L#15
      PCI 8086:201d

PCI Switch, CPU and GPU Direct server topology

The GPU and HCA are connected on the same PCI switch

The GPU and HCA are connected on the NUMA root complex

The GPU and HCA are connected on two difference Numa

Useful commands