GPUDirect Benchmarking

This post walk-through the configuration and execution of the GPUDirect RDMA and GDR Copy features.

 

Overview

The GPUDirect RDMA technology exposes GPU memory to I/O devices by enabling the direct communication path between GPUs in two remote systems. This feature eliminates the need to use the system CPUs to stage GPU data in and out intermediate system memory buffers. As a result the end-to-end latency is reduced and  the sustained bandwidth is increased (depending on the PCIe topology).

The GDRCopy (GPUDirect RDMA Copy) library leverages the GPUDirect RDMA APIs to create CPU memory mappings of the GPU memory. The advantage of a CPU driven copy is the very small overhead involved. That is helpful when low latencies are required.

Getting Started with GPUDirect

Hardware Setup

In this post, the examples are given from the Tessa cluster:

  • Colfax CX41060t-XK7 cluster

  • Dual Socket Intel(R) Xeon(R) Gold 6138 CPU @ 2GHz

  • NVIDIA ConnectX-6 HDR InfiniBand adapter (over PCIe gen3 link)

  • NVIDIA A100 Tensor Core GPU 40GB per node (PCIe Gen4 capable in gen3 system)

  • NVIDIA HDR Quantum Switch QM7800 40-Port HDR 200Gb/s InfiniBand

  • Memory: 256GB DDR4 2666MHz RDIMMs per node

Adapter/GPU Localization

For best GPUDirect benchmarking tests, make sure that the adapter and GPU are connected on the same PCIe switch when possible. Refer to https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1675034639 documentation for some examples.

Check the PCI topology using the nvidia-smi-topo command:

$ nvidia-smi topo -m GPU0 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 CPU Affinity NUMA Affinity GPU0 X NODE NODE PIX PIX PIX PIX 0-19 0 mlx5_0 NODE X PIX NODE NODE NODE NODE mlx5_1 NODE PIX X NODE NODE NODE NODE mlx5_2 PIX NODE NODE X PIX PIX PIX mlx5_3 PIX NODE NODE PIX X PIX PIX mlx5_4 PIX NODE NODE PIX PIX X PIX mlx5_5 PIX NODE NODE PIX PIX PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

Make sure to use adapter with the PIX attribute (single PCIe bridge), in our example here we will use mlx5_3 device (back to back connected). Note that if there is no PCI switch on the server, you will not see the PIX option, use Node instead.

Adapter Firmware Configuration

Set PCIe Max Accumulative Outstanding read bytes (these changes persist across reboots but you may need to reconfigure after firmware upgrade). This parameter effects the bandwidth, if this is too low of a value, as there is not enough outstanding read requests. For higher bandwidth systems (e.g. PCIe gen4 systems with HDR adapters need even higher value) MAX_ACC_OUT_READ=44 is suggested to reach full bandwidth for HDR links.

The configuration below is (MAX_ACC_OUT_READ=32) done on our Tessa cluster (Intel Skylake CPU, PCIe gen3 with HDR links):

$ sudo mst start $ sudo mlxconfig -d /dev/mst/mt4123_pciconf0 s ADVANCED_PCI_SETTINGS=1 $ sudo mlxconfig -d /dev/mst/mt4123_pciconf0 s MAX_ACC_OUT_READ=32 $ reboot

You will need to do it for the PCIe adapter in use, in case there is more than one.

Note: For NDR links, MAX_ACC_OUT_READ=128 is recommended to be used (with latest FW).

PCIe Configuration

1. Set PCIe Max read request to 4KB. This is important for maximum bandwidth. Linux default are normally 512.

$ setpci -d ::207 68.w=5000:f000

 

Verify that the MadReadReq is set to 4096:

Note: setpci command doesn't survive reboot. For more information see Understanding PCIe Configuration for Maximum Performance.

 

2. Access Control Services (ACS)

IO virtualization (also known as, VT-d or IOMMU) can interfere with GPU Direct by redirecting all PCIe point-to-point traffic to the CPU root complex, causing a significant performance reduction or even a hang. Make sure that ACS is disabled on the PCIe. A Value of <flag> with ‘+' means enabled, while '-’ means disabled. We would like to have all ACS flags disabled :

In case it is enabled, use the following command to disable it (for the right PCIe address “-s”):

 

See also https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html for some more troubleshooting of the PCIe.

Software

  • CUDA 11.3

  • NV_PEER_MEM

NV_PEER_MEM is the name of the package for GPUDirect RDMA, it enables peer-to-peer data transport between the adapter and the GPU. This provides a significant decrease in GPU-GPU communication latency and completely offloads the CPU, removing it from all GPU-GPU communications across the network. In case this is not installed, the data need to be staged to the host memory.

The Git for the software is located here: https://github.com/Mellanox/nv_peer_memory. Formal release is on the website.

Note: New version of NV Peer mem can be found here:

  • GDRCOPY

While GPUDirect RDMA is meant for direct access to GPU memory from third-party devices, GDRCOPY use these same APIs to create valid CPU mappings of the GPU memory.

The advantage of a CPU driven copy is the very small overhead involved. That might be useful when low latencies are required. See the Git here:

  • UCX

Need to build UCX with both CUDA --with-cuda and with GDRCopy --with-gdrcopy

  • OpenMPI

Build OpenMPI with CUDA --with-cuda and UCX --with-ucx:

  • OSU Microbenchmarks for openmpi

Build OSU with CUDA --enable-cuda

 

In case you wish to use MVAPICH2 MPI and not OpenMPI:

  • MVAPICH2-GDR

  • OSU Microbenchmarks for MVAPICH2

 

Setting up GPU Clock

For micro-benchmarks, and possibly for application benchmarks, set the clock speed to peak value via nvidia-smi command:

-lgc --lock-gpu-clocks= Specifies <minGpuClock,maxGpuClock> clocks as a pair (e.g. 1500,1500) that defines the range of desired locked GPU clock speed in MHz. Setting this will supercede application clocks
and take effect regardless if an app is running. Input can also be a singular desired clock value

-ac --applications-clocks= Specifies <memory,graphics> clocks as a pair (e.g. 2000,800) that defines GPU's
speed in MHz while running applications on a GPU.

For example

 

Benchmarking

 

MPI Commands

Here are MPI Commands examples of our Tessa Cluster (A100 GPUs with ConnectX-6 HDR linked over PCIe gen3)

Open MPI

Generic OpenMPI and UCX Flags

  • UCX_NET_DEVICES=mlx5_3:1

This select the relevant network adapter to be used, it is important in cases where there are more than one adapter and more than one port. in this example mlx5_3 device is located on the same pci switch as the GPU (for more info, see Adapter/GPU-Localization section above).

  • -map-by dist -mca rmaps_dist_device

This option selected a core on the CPU this is located on the same NUMA of the adapter. Some nodes have multiple NUMA nodes and for better latency, it is important to select the closest NUMA node.

  • -x UCX_RNDV_THRESH=1024

This option set the message size rendezvous threshold (the UCX will switch from eager algorithm to rendezvous), there is an internal formula that selects the default value per system. In some cases, via manual tuning of this number you can overcome poor performance. The value of 1024 is give better performance on the Tessa cluster we were testing here.

GPUDirect RDMA + GDR Copy (both enabled)

UCX_IB__GPU_DIRECT_RDMA=1

  • forces the GPUDirect feature, if the package is not installed, the run will fail. Value of 0 will disable it.

UCX_TLS=rc,cuda_copy,gdr_copy :

  • cuda_copy is the basic feature to enable CUDA

  • gdr_copy enabled GDR COPY

  • rc will use the InfiniBand RC (Reliable Connection) transport mode

 

GPUDirect RDMA disabled and GDR Copy enabled

 

GPUDirect and GDR Copy are disabled

 

 

MVAPICH MPI

Command:

GPUDirect RDMA (enabled)

No GPUDirect RDMA (disabled)

No GPUDirect no GDR Copy

 

Results

Back to Back Setup

The results below were achieved with OpenMPI/UCX configuration as specified above.

Note: In this setup, the PCIe speed is x16 gen3 while the adapter is InfiniBand HDR, the bandwidth limitation will be the maximum PCIe speed.

Latency

Over 8.5X performance boosting is achieved when comparing up to 128B messages (with and without GPUDirect and GDR Copy) and over 5X performance boosting 128B-4KB.

Another observation that is clearly seen is that GDR copy provides a latency benefit for small messages.

GPUDirect RDMA by itself, for small messages is not enough for best performance.

Note that different systems may need different tuning of UCX_RNDV_THRESH parameter for best performance.

 

mpirun commands:

 

On the large message size, you can see similar behavior. The main value is given with the enablement of GPUDirect RDMA. GDRCopy doesn’t have any effect.

 

Uni-Directional Bandwidth

GPUDirect manages to push the GPU bandwidth to maximum PCIe capacity. GDRCopy doesn’t influence bandwidth.

 

mpirun commands:

 

Bi-Directional Bandwidth

GPUDirect managed to push the GPU bi-directional bandwidth to maximum PCIe capacity.

 

mpirun commands:

 

Appendix: GPUDirect example over 200Gb/s HDR InfiniBand

 

The following benchmark was done with similar configuration using the following hardware:

  • Selene Cluster

    • CPU: AMD EPYC Rome 7742 (64 cores)

    • GPU: Nvidia DGX A100

    • Network: HDR InfiniBand (via switch)

 

Configuration

In this cluster, testing are done via container with similar setup as described above.

 

Run time parameters

runtime parameters are similar, besides that usage of UCX_RNDV_THRESH=32768 instead of 1024 as tested above.

Benchmarking

Latency

Uni-Directional Bandwidth

 

Bi-Directional Bandwidth

Note: Without GPUDirect RDMA the bandwidth reported for example by osu_bw is the almost the same as osu_bibw as the limitation is the PCIe x16 link to the CPU that is shared and used by the adapter and the GPU together in parallel creating congestion on the PCIe. This is an artifact of this setup.

 

Application example

MILC

The MIMD Lattice Computation (MILC) represents part of a set of codes used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics.

It performs simulations of four dimensional SU(3) lattice gauge theory on MIMD parallel machines. "Strong interactions" are responsible for binding quarks into protons and neutrons and holding them all together in the atomic nucleus.

The MILC collaboration has produced application codes to study several different QCD research areas.

Configuration

Here are the MPI flags used for this test:

GPUDirect Disabled

GPUDirect Enabled

GPUDirect Disabled

GPUDirect Enabled

  • QUDA_ENABLE_GDR=1

  • UCX_IB_GPU_DIRECT_RDMA=0

  • UCX_IB_PCI_RELAXED_ORDERING=on

  • UCX_MAX_RNDV_RAILS=1

  • UCX_MEMTYPE_CACHE=n

  • UCX_RNDV_THRESH=8192

  • UCX_TLS=rc,cuda_copy

  • QUDA_ENABLE_GDR=0

  • UCX_IB_GPU_DIRECT_RDMA=1

  • UCX_IB_PCI_RELAXED_ORDERING=on

  • UCX_MAX_RNDV_RAILS=1

  • UCX_MEMTYPE_CACHE=n

  • UCX_RNDV_SCHEME=get_zcopy

  • UCX_RNDV_THRESH=8192

  • UCX_TLS=rc,cuda_copy,cuda_ipc,gdr_copy,sm

 

The cluster that was used here is Selene with 4 GPUs per node, using 0,2,4,6 so that each GPU has a dedicated x16 link to the CPU, using the closest adapter to the GPU (same PCIe switch)

Results

The results shows 33% improvement over 256 GPUs.

 

 

References