This post walk-through the configuration and execution of the GPUDirect RDMA and GDR Copy features.

Overview

The GPUDirect RDMA technology exposes GPU memory to I/O devices by enabling the direct communication path between GPUs in two remote systems. This feature eliminates the need to use the system CPUs to stage GPU data in and out intermediate system memory buffers. As a result the end-to-end latency is reduced and  the sustained bandwidth is increased (depending on the PCIe topology).

The GDRCopy (GPUDirect RDMA Copy) library leverages the GPUDirect RDMA APIs to create CPU memory mappings of the GPU memory. The advantage of a CPU driven copy is the very small overhead involved. That is helpful when low latencies are required.

Getting Started with GPUDirect

Hardware Setup

In this post, the examples are given from the Tessa cluster:

Adapter/GPU Localization

For best GPUDirect benchmarking tests, make sure that the adapter and GPU are connected on the same PCIe switch when possible. Refer to PCI Switch, CPU and GPU Direct server topology documentation for some examples.

Check the PCI topology using the nvidia-smi-topo command:

$ nvidia-smi topo -m
        GPU0    mlx5_0  mlx5_1  mlx5_2  mlx5_3  mlx5_4  mlx5_5  CPU Affinity    NUMA Affinity
GPU0     X      NODE    NODE    PIX     PIX     PIX     PIX     0-19    0
mlx5_0  NODE     X      PIX     NODE    NODE    NODE    NODE
mlx5_1  NODE    PIX      X      NODE    NODE    NODE    NODE
mlx5_2  PIX     NODE    NODE     X      PIX     PIX     PIX
mlx5_3  PIX     NODE    NODE    PIX      X      PIX     PIX
mlx5_4  PIX     NODE    NODE    PIX     PIX      X      PIX
mlx5_5  PIX     NODE    NODE    PIX     PIX     PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Make sure to use adapter with the PIX attribute (single PCIe bridge), in our example here we will use mlx5_3 device (back to back connected). Note that if there is no PCI switch on the server, you will not see the PIX option, use Node instead.

Adapter Firmware Configuration

Set PCIe Max Accumulative Outstanding read bytes (these changes persist across reboots but you may need to reconfigure after firmware upgrade). This parameter effects the bandwidth, if this is too low of a value, as there is not enough outstanding read requests. For higher bandwidth systems (e.g. PCIe gen4 systems with HDR adapters need even higher value) MAX_ACC_OUT_READ=44 is suggested to reach full bandwidth for HDR links.

The configuration below is (MAX_ACC_OUT_READ=32) done on our Tessa cluster (Intel Skylake CPU, PCIe gen3 with HDR links):

$ sudo mst start
$ sudo mlxconfig -d /dev/mst/mt4123_pciconf0 s ADVANCED_PCI_SETTINGS=1
$ sudo mlxconfig -d /dev/mst/mt4123_pciconf0 s MAX_ACC_OUT_READ=32
$ reboot

You will need to do it for the PCIe adapter in use, in case there is more than one.

Note: For NDR links, MAX_ACC_OUT_READ=128 is recommended to be used (with latest FW).

PCIe Configuration

1. Set PCIe Max read request to 4KB. This is important for maximum bandwidth. Linux default are normally 512.

$ setpci -d ::207 68.w=5000:f000

Verify that the MadReadReq is set to 4096:

sudo lspci -d ::207 -vvv | grep MaxReadReq
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                        

Note: setpci command doesn't survive reboot. For more information see Understanding PCIe Configuration for Maximum Performance.

2. Access Control Services (ACS)

IO virtualization (also known as, VT-d or IOMMU) can interfere with GPU Direct by redirecting all PCIe point-to-point traffic to the CPU root complex, causing a significant performance reduction or even a hang. Make sure that ACS is disabled on the PCIe. A Value of <flag> with ‘+' means enabled, while '-’ means disabled. We would like to have all ACS flags disabled :

$ sudo lspci -s 0000:18:00.0  -vvv | grep ACSCtl
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

In case it is enabled, use the following command to disable it (for the right PCIe address “-s”):

$ sudo setpci -s 0000:18:00.0 f2a.w=0000

See also https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html for some more troubleshooting of the PCIe.

Software

NV_PEER_MEM is the name of the package for GPUDirect RDMA, it enables peer-to-peer data transport between the adapter and the GPU. This provides a significant decrease in GPU-GPU communication latency and completely offloads the CPU, removing it from all GPU-GPU communications across the network. In case this is not installed, the data need to be staged to the host memory.

The Git for the software is located here: https://github.com/Mellanox/nv_peer_memory. Formal release is on the website.

Note: New version of NV Peer mem can be found here: https://download.nvidia.com/XFree86/Linux-x86_64/470.42.01/README/nvidia-peermem.html

export KERNEL_VER=4.18.0-305.3.1
wget -c https://www.mellanox.com/downloads/ofed/nvidia-peer-memory_1.1.tar.gz
tar zxpvf nvidia-peer-memory_1.1.tar.gz
cd nvidia-peer-memory-${GPUDIRECT_VER}
make KVER=${KERNEL_VER}.el8.x86_64 all install

While GPUDirect RDMA is meant for direct access to GPU memory from third-party devices, GDRCOPY use these same APIs to create valid CPU mappings of the GPU memory.

The advantage of a CPU driven copy is the very small overhead involved. That might be useful when low latencies are required. See the Git here: https://github.com/NVIDIA/gdrcopy

1. Install GDR Copy libary
module load cuda/11.3
wget -c https://github.com/NVIDIA/gdrcopy/archive/v2.3.tar.gz
tar zxpvf v2.3.tar.gz
cd gdrcopy-2.3
make prefix=<path>/gdrcopy CUDA=$CUDA_DIR all install

2. Insert driver
./insmod.sh

https://github.com/openucx/ucx

Need to build UCX with both CUDA --with-cuda and with GDRCopy --with-gdrcopy

CUDA_VER=11.3
BASE=$PWD
module load cuda/$CUDA_VER
wget https://github.com/openucx/ucx/releases/download/v1.11.1/ucx-1.11.1.tar.gz
tar xfp ucx-1.11.1.tar.gz
cd ucx-1.11.1
INSDIR=$BASE/openmpi-4.1.1_cuda$CUDA_VER
./configure --enable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --without-xpm
em --without-java --with-cuda=$CUDA_DIR --with-gdrcopy=<path>/gdrcopy --prefix=$INSDIR/ucx

make -j 32 install

Build OpenMPI with CUDA --with-cuda and UCX --with-ucx:

wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.gz
tar xfzp openmpi-4.1.1.tar.gz
cd openmpi-4.1.1
OMPI_DIR=$BASE/openmpi-4.1.1_cuda$CUDA_VER
./configure --prefix=$OMPI_DIR \
            --with-platform=contrib/platform/mellanox/optimized \
            --with-cuda=$CUDA_DIR \
            --with-ucx=$OMPI_DIR/ucx \
            --with-slurm \
            --enable-mpi1-compatibility \
            --with-verbs

make -j32 all 
make install 

Build OSU with CUDA --enable-cuda

wget https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.8.tgz
tar xfp osu-micro-benchmarks-5.8.tgz
cd osu-micro-benchmarks-5.8
OSU_HOME=$OMPI_HOME/osu-micro-benchmarks-5.8

export PATH=$OMPI_HOME/bin:$PATH
export LD_LIBRARY_PATH=$OMPI_HOME/lib:$LD_LIBRAR_PATH
./configure CC=mpicc CXX=mpicxx --prefix=$OSU_HOME --enable-cuda
make install

In case you wish to use MVAPICH2 MPI and not OpenMPI:

wget http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm
rpm -ivh mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm
./configure CC=mpicc CXX=mpicxx --prefix=$MVAPICH2_GDR_DIR/osu-micro-benchmarks-5.8 --enable-cuda
make install

Setting up GPU Clock

For micro-benchmarks, and possibly for application benchmarks, set the clock speed to peak value via nvidia-smi command:

-lgc --lock-gpu-clocks= Specifies <minGpuClock,maxGpuClock> clocks as a pair (e.g. 1500,1500) that defines the range of desired locked GPU clock speed in MHz. Setting this will supercede application clocks
and take effect regardless if an app is running. Input can also be a singular desired clock value

-ac --applications-clocks= Specifies <memory,graphics> clocks as a pair (e.g. 2000,800) that defines GPU's
speed in MHz while running applications on a GPU.

nvidia-smi -lgc <minGpuClock,maxGpuClock>
nvidia-smi -ac <memory,graphics>

For example

nvidia-smi -lgc 1410,1410
nvidia-smi -ac 1593,1410

Benchmarking

MPI Commands

Here are MPI Commands examples of our Tessa Cluster (A100 GPUs with ConnectX-6 HDR linked over PCIe gen3)

Open MPI

Generic OpenMPI and UCX Flags

This select the relevant network adapter to be used, it is important in cases where there are more than one adapter and more than one port. in this example mlx5_3 device is located on the same pci switch as the GPU (for more info, see Adapter/GPU-Localization section above).

This option selected a core on the CPU this is located on the same NUMA of the adapter. Some nodes have multiple NUMA nodes and for better latency, it is important to select the closest NUMA node.

This option set the message size rendezvous threshold (the UCX will switch from eager algorithm to rendezvous), there is an internal formula that selects the default value per system. In some cases, via manual tuning of this number you can overcome poor performance. The value of 1024 is give better performance on the Tessa cluster we were testing here.

GPUDirect RDMA + GDR Copy (both enabled)

UCX_IB__GPU_DIRECT_RDMA=1

UCX_TLS=rc,cuda_copy,gdr_copy :

mpirun -np 2 -host tessa003,tessa004 -x UCX_NET_DEVICES=mlx5_3:1 -map-by dist 
-mca rmaps_dist_device mlx5_3 -x UCX_IB_GPU_DIRECT_RDMA=1 -x UCX_TLS=rc,cuda_copy,gdr_copy
 -x UCX_RNDV_THRESH=1024  osu_latency -x 1000 -i 10000 -d cuda D D

GPUDirect RDMA disabled and GDR Copy enabled

mpirun -np 2 -host tessa003,tessa004  -x UCX_NET_DEVICES=mlx5_3:1 -map-by dist 
-mca rmaps_dist_device mlx5_3 -x UCX_IB_GPU_DIRECT_RDMA=0 -x UCX_RNDV_THRESH=1024
 -x UCX_TLS=rc,cuda_copy,gdr_copy osu_latency -x 1000 -i 10000 -d cuda D D

GPUDirect and GDR Copy are disabled

mpirun -np 2 -host tessa003,tessa004 -x UCX_NET_DEVICES=mlx5_3:1 -map-by dist
 -mca rmaps_dist_device mlx5_3 -x UCX_IB_GPU_DIRECT_RDMA=0 -x UCX_RNDV_THRESH=1024
  -x UCX_TLS=rc,cuda_copy osu_latency -x 1000 -i 10000 -d cuda D D

MVAPICH MPI

Command:

mpirun -np 2 -f machinefile.txt osu_latency -x 1000 -i 10000 -d cuda D D

GPUDirect RDMA (enabled)

MV2_DEFAULT_PORT=1
MV2_GPUDIRECT_GDRCOPY_LIB=/global/software/centos-8.x86_64/modules/cuda/11.3/gdrcopy/2.3/lib/libgdrapi.so
MV2_CPU_MAPPING=0
MV2_HOMOGENEOUS_CLUSTER=1
MV2_SHOW_CPU_BINDING=1
MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING=1
MV2_USE_GPUDIRECT_RDMA=1
MV2_IBA_HCA=
MV2_USE_CUDA=1
MV2_GPUDIRECT_LIMIT=4194304

No GPUDirect RDMA (disabled)

MV2_DEFAULT_PORT=1
MV2_CPU_MAPPING=0
MV2_HOMOGENEOUS_CLUSTER=1
MV2_SHOW_CPU_BINDING=1
MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING=1
MV2_USE_GPUDIRECT_RDMA=0
MV2_IBA_HCA=
MV2_USE_CUDA=1
MV2_GPUDIRECT_LIMIT=4194304

No GPUDirect no GDR Copy

MV2_USE_GDRCOPY=0
MV2_DEFAULT_PORT=1
MV2_HOMOGENEOUS_CLUSTER=1
MV2_SHOW_CPU_BINDING=1
MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING=1
MV2_USE_GPUDIRECT_RDMA=0
MV2_IBA_HCA=
MV2_USE_CUDA=1
MV2_GPUDIRECT_LIMIT=4194304

Results

Back to Back Setup

The results below were achieved with OpenMPI/UCX configuration as specified above.

Note: In this setup, the PCIe speed is x16 gen3 while the adapter is InfiniBand HDR, the bandwidth limitation will be the maximum PCIe speed.

Latency

Over 8.5X performance boosting is achieved when comparing up to 128B messages (with and without GPUDirect and GDR Copy) and over 5X performance boosting 128B-4KB.

Another observation that is clearly seen is that GDR copy provides a latency benefit for small messages.

GPUDirect RDMA by itself, for small messages is not enough for best performance.

Note that different systems may need different tuning of UCX_RNDV_THRESH parameter for best performance.

mpirun commands:

# GPUDirect RDMA and GDRCopy enabled 
mpirun -np 2 -host tessa003,tessa004 -x UCX_NET_DEVICES=mlx5_3:1 -map-by dist 
-mca rmaps_dist_device  -x UCX_IB_GPU_DIRECT_RDMA=1 -x UCX_TLS=rc,cuda_copy,gdr_copy
 -x UCX_RNDV_THRESH=1024  osu_latency -x 1000 -i 10000 -d cuda D D
 
#  GPUDirect RDMA disabled and GDR Copy enabled
mpirun -np 2 -host tessa003,tessa004  -x UCX_NET_DEVICES=mlx5_3:1 -map-by dist 
-mca rmaps_dist_device  -x UCX_IB_GPU_DIRECT_RDMA=0 -x UCX_RNDV_THRESH=32K
 -x UCX_TLS=rc,cuda_copy,gdr_copy osu_latency -x 1000 -i 10000 -d cuda D D

# GPUDirect RDMA enabled and GDRCopy disabled
mpirun -np 2 -host tessa003,tessa004 -x UCX_NET_DEVICES=mlx5_3:1 -map-by dist 
-mca rmaps_dist_device  -x UCX_IB_GPU_DIRECT_RDMA=1 -x UCX_TLS=rc,cuda_copy
 -x UCX_RNDV_THRESH=128  osu_latency -x 1000 -i 10000 -d cuda D D

# GPUDirect and GDR Copy are disabled
mpirun -np 2 -host tessa003,tessa004 -x UCX_NET_DEVICES=mlx5_3:1 -map-by dist
 -mca rmaps_dist_device  -x UCX_IB_GPU_DIRECT_RDMA=0 -x UCX_RNDV_THRESH=1024
  -x UCX_TLS=rc,cuda_copy osu_latency -x 1000 -i 10000 -d cuda D D

On the large message size, you can see similar behavior. The main value is given with the enablement of GPUDirect RDMA. GDRCopy doesn’t have any effect.

Uni-Directional Bandwidth

GPUDirect manages to push the GPU bandwidth to maximum PCIe capacity. GDRCopy doesn’t influence bandwidth.

mpirun commands:

# GPUDirect RDMA and GDRCopy enabled 
mpirun -np 2 -host tessa003,tessa004 -x UCX_NET_DEVICES=mlx5_3:1 -map-by dist 
-mca rmaps_dist_device  -x UCX_IB_GPU_DIRECT_RDMA=1 -x UCX_TLS=rc,cuda_copy,gdr_copy
 -x UCX_RNDV_THRESH=1024  osu_bw -x 1000 -i 10000 -d cuda D D
 
#  GPUDirect RDMA disabled and GDR Copy enabled
mpirun -np 2 -host tessa003,tessa004  -x UCX_NET_DEVICES=mlx5_3:1 -map-by dist 
-mca rmaps_dist_device  -x UCX_IB_GPU_DIRECT_RDMA=0 -x UCX_RNDV_THRESH=1024
 -x UCX_TLS=rc,cuda_copy,gdr_copy osu_bw -x 1000 -i 10000 -d cuda D D
 
# GPUDirect RDMA enabled and GDRCopy disabled
mpirun -np 2 -host tessa003,tessa004 -x UCX_NET_DEVICES=mlx5_3:1 -map-by dist 
-mca rmaps_dist_device  -x UCX_IB_GPU_DIRECT_RDMA=1 -x UCX_TLS=rc,cuda_copy
 -x UCX_RNDV_THRESH=1024  osu_bw -x 1000 -i 10000 -d cuda D D

# GPUDirect and GDR Copy are disabled
mpirun -np 2 -host tessa003,tessa004 -x UCX_NET_DEVICES=mlx5_3:1 -map-by dist
 -mca rmaps_dist_device  -x UCX_IB_GPU_DIRECT_RDMA=0 -x UCX_RNDV_THRESH=1024
  -x UCX_TLS=rc,cuda_copy osu_bw -x 1000 -i 10000 -d cuda D D

Bi-Directional Bandwidth

GPUDirect managed to push the GPU bi-directional bandwidth to maximum PCIe capacity.

mpirun commands:

# GPUDirect RDMA and GDRCopy enabled 
mpirun -np 2 -host tessa003,tessa004 -x UCX_NET_DEVICES=mlx5_3:1 -map-by dist 
-mca rmaps_dist_device  -x UCX_IB_GPU_DIRECT_RDMA=1 -x UCX_TLS=rc,cuda_copy,gdr_copy
 -x UCX_RNDV_THRESH=1024  osu_bibw -x 1000 -i 10000 -d cuda D D
 
#  GPUDirect RDMA disabled and GDR Copy enabled
mpirun -np 2 -host tessa003,tessa004  -x UCX_NET_DEVICES=mlx5_3:1 -map-by dist 
-mca rmaps_dist_device  -x UCX_IB_GPU_DIRECT_RDMA=0 -x UCX_RNDV_THRESH=1024
 -x UCX_TLS=rc,cuda_copy,gdr_copy osu_bibw -x 1000 -i 10000 -d cuda D D

# GPUDirect RDMA enabled and GDRCopy disabled
mpirun -np 2 -host tessa003,tessa004 -x UCX_NET_DEVICES=mlx5_3:1 -map-by dist 
-mca rmaps_dist_device  -x UCX_IB_GPU_DIRECT_RDMA=1 -x UCX_TLS=rc,cuda_copy
 -x UCX_RNDV_THRESH=1024  osu_bibw -x 1000 -i 10000 -d cuda D D
 
# GPUDirect and GDR Copy are disabled
mpirun -np 2 -host tessa003,tessa004 -x UCX_NET_DEVICES=mlx5_3:1 -map-by dist
 -mca rmaps_dist_device  -x UCX_IB_GPU_DIRECT_RDMA=0 -x UCX_RNDV_THRESH=1024
  -x UCX_TLS=rc,cuda_copy osu_bibw -x 1000 -i 10000 -d cuda D D

Appendix: GPUDirect example over 200Gb/s HDR InfiniBand

The following benchmark was done with similar configuration using the following hardware:

Configuration

In this cluster, testing are done via container with similar setup as described above.

Run time parameters

runtime parameters are similar, besides that usage of UCX_RNDV_THRESH=32768 instead of 1024 as tested above.

Benchmarking

Latency

Uni-Directional Bandwidth

Bi-Directional Bandwidth

Note: Without GPUDirect RDMA the bandwidth reported for example by osu_bw is the almost the same as osu_bibw as the limitation is the PCIe x16 link to the CPU that is shared and used by the adapter and the GPU together in parallel creating congestion on the PCIe. This is an artifact of this setup.

Application example

MILC

The MIMD Lattice Computation (MILC) represents part of a set of codes used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics.

It performs simulations of four dimensional SU(3) lattice gauge theory on MIMD parallel machines. "Strong interactions" are responsible for binding quarks into protons and neutrons and holding them all together in the atomic nucleus.

The MILC collaboration has produced application codes to study several different QCD research areas.

Configuration

Here are the MPI flags used for this test:

GPUDirect Disabled

GPUDirect Enabled

  • QUDA_ENABLE_GDR=1

  • UCX_IB_GPU_DIRECT_RDMA=0

  • UCX_IB_PCI_RELAXED_ORDERING=on

  • UCX_MAX_RNDV_RAILS=1

  • UCX_MEMTYPE_CACHE=n

  • UCX_RNDV_THRESH=8192

  • UCX_TLS=rc,cuda_copy

  • QUDA_ENABLE_GDR=0

  • UCX_IB_GPU_DIRECT_RDMA=1

  • UCX_IB_PCI_RELAXED_ORDERING=on

  • UCX_MAX_RNDV_RAILS=1

  • UCX_MEMTYPE_CACHE=n

  • UCX_RNDV_SCHEME=get_zcopy

  • UCX_RNDV_THRESH=8192

  • UCX_TLS=rc,cuda_copy,cuda_ipc,gdr_copy,sm

The cluster that was used here is Selene with 4 GPUs per node, using 0,2,4,6 so that each GPU has a dedicated x16 link to the CPU, using the closest adapter to the GPU (same PCIe switch)

Results

The results shows 33% improvement over 256 GPUs.

References