GPUDirect Benchmarking
This post walk-through the configuration and execution of the GPUDirect RDMA and GDR Copy features.
- 1 Overview
- 2 Getting Started with GPUDirect
- 3 Benchmarking
- 3.1 MPI Commands
- 3.1.1 Open MPI
- 3.1.2 MVAPICH MPI
- 3.1.2.1 GPUDirect RDMA (enabled)
- 3.1.2.2 No GPUDirect RDMA (disabled)
- 3.1.2.3 No GPUDirect no GDR Copy
- 3.1 MPI Commands
- 4 Results
- 4.1 Back to Back Setup
- 4.1.1 Latency
- 4.1.2 Uni-Directional Bandwidth
- 4.1.3 Bi-Directional Bandwidth
- 4.1 Back to Back Setup
- 5 Appendix: GPUDirect example over 200Gb/s HDR InfiniBand
- 5.1 Configuration
- 5.2 Run time parameters
- 5.3 Benchmarking
- 5.3.1 Latency
- 5.3.2 Uni-Directional Bandwidth
- 5.3.3 Bi-Directional Bandwidth
- 6 Application example
- 6.1 MILC
- 6.1.1 Configuration
- 6.1.2 Results
- 6.1 MILC
- 7 References
Overview
The GPUDirect RDMA technology exposes GPU memory to I/O devices by enabling the direct communication path between GPUs in two remote systems. This feature eliminates the need to use the system CPUs to stage GPU data in and out intermediate system memory buffers. As a result the end-to-end latency is reduced and the sustained bandwidth is increased (depending on the PCIe topology).
The GDRCopy (GPUDirect RDMA Copy) library leverages the GPUDirect RDMA APIs to create CPU memory mappings of the GPU memory. The advantage of a CPU driven copy is the very small overhead involved. That is helpful when low latencies are required.
Getting Started with GPUDirect
Hardware Setup
In this post, the examples are given from the Tessa cluster:
Colfax CX41060t-XK7 cluster
Dual Socket Intel(R) Xeon(R) Gold 6138 CPU @ 2GHz
NVIDIA ConnectX-6 HDR InfiniBand adapter (over PCIe gen3 link)
NVIDIA A100 Tensor Core GPU 40GB per node (PCIe Gen4 capable in gen3 system)
NVIDIA HDR Quantum Switch QM7800 40-Port HDR 200Gb/s InfiniBand
Memory: 256GB DDR4 2666MHz RDIMMs per node
Adapter/GPU Localization
For best GPUDirect benchmarking tests, make sure that the adapter and GPU are connected on the same PCIe switch when possible. Refer to PCI Switch, CPU and GPU Direct server topology documentation for some examples.
Check the PCI topology using the nvidia-smi-topo command:
$ nvidia-smi topo -m
GPU0 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 CPU Affinity NUMA Affinity
GPU0 X NODE NODE PIX PIX PIX PIX 0-19 0
mlx5_0 NODE X PIX NODE NODE NODE NODE
mlx5_1 NODE PIX X NODE NODE NODE NODE
mlx5_2 PIX NODE NODE X PIX PIX PIX
mlx5_3 PIX NODE NODE PIX X PIX PIX
mlx5_4 PIX NODE NODE PIX PIX X PIX
mlx5_5 PIX NODE NODE PIX PIX PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Make sure to use adapter with the PIX attribute (single PCIe bridge), in our example here we will use mlx5_3 device (back to back connected). Note that if there is no PCI switch on the server, you will not see the PIX option, use Node instead.
Adapter Firmware Configuration
Set PCIe Max Accumulative Outstanding read bytes (these changes persist across reboots but you may need to reconfigure after firmware upgrade). This parameter affects the bandwidth if this is too low of a value, as there are not enough outstanding read requests. For higher bandwidth systems (e.g., PCIe gen4 systems with HDR adapters need an even higher value) MAX_ACC_OUT_READ=44 is suggested to reach full bandwidth for HDR links.
The configuration below is (MAX_ACC_OUT_READ=32) done on our Tessa cluster (Intel Skylake CPU, PCIe gen3 with HDR links):
$ sudo mst start
$ sudo mlxconfig -d /dev/mst/mt4123_pciconf0 s ADVANCED_PCI_SETTINGS=1
$ sudo mlxconfig -d /dev/mst/mt4123_pciconf0 s MAX_ACC_OUT_READ=32
$ reboot
You will need to do it for the PCIe adapter in use, in case there is more than one.
Note: For NDR links, MAX_ACC_OUT_READ=128 is recommended to be used (with latest FW).
PCIe Configuration
1. Set PCIe Max read request to 4KB. This is important for maximum bandwidth. Linux default is normally 512.
$ setpci -d ::207 68.w=5000:f000
Verify that the MadReadReq is set to 4096:
Note: setpci command doesn't survive reboot. For more information see Understanding PCIe Configuration for Maximum Performance.
2. Access Control Services (ACS)
IO virtualization (also known as, VT-d or IOMMU) can interfere with GPU Direct by redirecting all PCIe point-to-point traffic to the CPU root complex, causing a significant performance reduction or even a hang. Make sure that ACS is disabled on the PCIe. A Value of <flag> with ‘+' means enabled, while '-’ means disabled. We would like to have all ACS flags disabled :
In case it is enabled, use the following command to disable it (for the right PCIe address “-s”):
See also https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html for some more troubleshooting of the PCIe.
Software
CUDA 11.3
NV_PEER_MEM
NV_PEER_MEM is the name of the package for GPUDirect RDMA, it enables peer-to-peer data transport between the adapter and the GPU. This provides a significant decrease in GPU-GPU communication latency and completely offloads the CPU, removing it from all GPU-GPU communications across the network. In case this is not installed, the data need to be staged to the host memory.
The repository for the software is located here: https://github.com/Mellanox/nv_peer_memory. Formal release is on the website.
Note: New version of NV Peer mem can be found here: https://download.nvidia.com/XFree86/Linux-x86_64/470.42.01/README/nvidia-peermem.html
GDRCOPY
While GPUDirect RDMA is meant for direct access to GPU memory from third-party devices, GDRCOPY uses these same APIs to create valid CPU mappings of the GPU memory.
The advantage of a CPU driven copy is the very small overhead involved. That might be useful when low latencies are required. See the Git here: https://github.com/NVIDIA/gdrcopy
UCX
https://github.com/openucx/ucx
Need to build UCX with both CUDA --with-cuda and with GDRCopy --with-gdrcopy
OpenMPI
Build OpenMPI with CUDA --with-cuda and UCX --with-ucx:
OSU Microbenchmarks for openmpi
Build OSU with CUDA --enable-cuda
In case you wish to use MVAPICH2 MPI and not OpenMPI:
MVAPICH2-GDR
OSU Microbenchmarks for MVAPICH2
Setting up GPU Clock
For micro-benchmarks, and possibly for application benchmarks, set the clock speed to peak value via nvidia-smi command:
-lgc --lock-gpu-clocks= Specifies <minGpuClock,maxGpuClock> clocks as a pair (e.g. 1500,1500) that defines the range of desired locked GPU clock speed in MHz. Setting this will supersede application clocks and take effect regardless of whether an app is running. Input can also be a singular desired clock value
-ac --applications-clocks= Specifies <memory,graphics> clocks as a pair (e.g. 2000,800) that defines GPU's speed in MHz while running applications on a GPU.
For example
Benchmarking
MPI Commands
Here are MPI Commands examples of our Tessa Cluster (A100 GPUs with ConnectX-6 HDR linked over PCIe gen3)
Open MPI
Generic OpenMPI and UCX Flags
UCX_NET_DEVICES=mlx5_3:1
This selects the relevant network adapter to be used; it is important in cases where there is more than one adapter and more than one port. In this example the mlx5_3 device is located on the same PCI switch as the GPU (for more info, see Adapter/GPU-Localization section above).
-map-by dist -mca rmaps_dist_device
This option selected a core on the CPU this is located on the same NUMA of the adapter. Some nodes have multiple NUMA nodes and for better latency; it is important to select the closest NUMA node.
-x UCX_RNDV_THRESH=1024
This option set the message size rendezvous threshold (the UCX will switch from eager algorithm to rendezvous), there is an internal formula that selects the default value per system. In some cases, via manual tuning of this number you can overcome poor performance. The value of 1024 is give better performance on the Tessa cluster we were testing here.
GPUDirect RDMA + GDR Copy (both enabled)
UCX_IB__GPU_DIRECT_RDMA=1
forces the GPUDirect feature, if the package is not installed, the run will fail. Value of 0 will disable it.
UCX_TLS=rc,cuda_copy,gdr_copy :
cuda_copy is the basic feature to enable CUDA
gdr_copy enabled GDR COPY
rc will use the InfiniBand RC (Reliable Connection) transport mode
GPUDirect RDMA disabled and GDR Copy enabled
GPUDirect and GDR Copy are disabled
MVAPICH MPI
Command:
GPUDirect RDMA (enabled)
No GPUDirect RDMA (disabled)
No GPUDirect no GDR Copy
Results
Back to Back Setup
The results below were achieved with OpenMPI/UCX configuration as specified above.
Note: In this setup, the PCIe speed is x16 gen3 while the adapter is InfiniBand HDR, the bandwidth limitation will be the maximum PCIe speed.
Latency
Over 8.5X performance boosting is achieved when comparing up to 128B messages (with and without GPUDirect and GDR Copy) and over 5X performance boosting 128B-4KB.
Another observation that is clearly seen is that GDR copy provides a latency benefit for small messages.
GPUDirect RDMA by itself, for small messages is not enough for best performance.
Note that different systems may need different tuning of UCX_RNDV_THRESH parameter for best performance.
mpirun commands:
On the large message size, you can see similar behavior. The main value is given with the enablement of GPUDirect RDMA. GDRCopy doesn’t have any effect.
Uni-Directional Bandwidth
GPUDirect manages to push the GPU bandwidth to maximum PCIe capacity. GDRCopy doesn’t influence bandwidth.
mpirun commands:
Bi-Directional Bandwidth
GPUDirect managed to push the GPU bi-directional bandwidth to maximum PCIe capacity.
mpirun commands:
Appendix: GPUDirect example over 200Gb/s HDR InfiniBand
The following benchmark was done with similar configuration using the following hardware:
CPU: AMD EPYC Rome 7742 (64 cores)
GPU: Nvidia DGX A100
Network: HDR InfiniBand (via switch)
Configuration
In this cluster, testing are done via container with similar setup as described above.
Run time parameters
runtime parameters are similar, besides that usage of UCX_RNDV_THRESH=32768 instead of 1024 as tested above.
Benchmarking
Latency
Uni-Directional Bandwidth
Bi-Directional Bandwidth
Note: Without GPUDirect RDMA the bandwidth reported for example by osu_bw is the almost the same as osu_bibw as the limitation is the PCIe x16 link to the CPU that is shared and used by the adapter and the GPU together in parallel creating congestion on the PCIe. This is an artifact of this setup.
Application example
MILC
The MIMD Lattice Computation (MILC) represents part of a set of codes used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics.
It performs simulations of four dimensional SU(3) lattice gauge theory on MIMD parallel machines. "Strong interactions" are responsible for binding quarks into protons and neutrons and holding them all together in the atomic nucleus.
The MILC collaboration has produced application codes to study several different QCD research areas.
Configuration
Here are the MPI flags used for this test:
GPUDirect Disabled | GPUDirect Enabled |
---|---|
|
|
The cluster that was used here is Selene with 4 GPUs per node, using 0,2,4,6 so that each GPU has a dedicated x16 link to the CPU, using the closest adapter to the GPU (same PCIe switch)
Results
The results shows 33% improvement over 256 GPUs.
References
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html
Benchmarking GPUDirect RDMA on Modern Server Platforms | NVIDIA Technical Blog