AMD 2nd Gen EPYC CPU Tuning Guide for InfiniBand HPC

This post is a setup guide focused on tuning AMD 2nd Generation EPYC™ CPU (formerly codenamed “Rome”) based servers to achieve maximum performance from ConnectX-6 HDR InfiniBand adapters. This post was established based testing over “Daytona” AMD 2nd Gen EPYC powered platform with ConnectX-6 HDR InfiniBand adapters.

Note: OEMs may implement the BIOS configuration differently.

 

AMD 2nd Generation EPYC Rome SOC

The second generation EPYC processor models are built from eight CCD (Core Cache Die) chaplets, grouped as four quadrants, and each is comprised of two CCDs and two memory channels. This is different from the 1st Generation EPYC which was composed of four SoCs connected only through the AMD Infinity Fabric (TM) technology.

 

 

Memory DIMMs

It is recommended to have Memory 8 DIMMs, per socket (2 DIMMs per CPU die). The CPU than will be able to push more bandwidth to reach HDR InfiniBand speed.

 

AMD 2nd Generation EPYC BIOS Recommendations

 

NPS Configuration

 

There is a new feature in 2nd Gen EPYC processors called NUMA Per Socket (NPS). With this feature, a single socket can be divided into up to 4 NUMA nodes. Each NUMA node can only use its assigned memory controllers.

The configuration options for NPS are 1,2, or 4.

NPS=1 implies that the AMD 2nd Gen CPU is within a single NUMA domain (all the cores and all memory channels). Memory is interleaved across the eight memory channels.

 

The image below represents what a dual socket system NUMA configuration looks like with the default settings of NPS=1.

 

 

NPS2 partitions the CPU into two NUMA domains, with half the cores and half the memory channels on the socket in each NUMA domain. Memory is interleaved across the four memory channels in each NUMA domain.

 

 

NPS4 partitions the CPU into four NUMA domains. Each quadrant is a NUMA domain here and memory is interleaved across the two memory channels. PCIe devices will be local to one of four NUMA domains on the socket depending on which quadrant of the IO die has the PCIe root for that device.

 

Recommendations:

We recommend checking with the OEM system vendor, for best NPS configuration on the specific system you own.

  1. For maximum bandwidth of micro-benchmarks tests over InfiniBand HDR (or 200Gb/s) gen4 PCI with EPYC Series 7002 processors, using NPS=1 or NPS=2 is recommended due to the number of memory channels (more memory channels = more bandwidth).

  2. For maximum bandwidth micro-benchmarks tests over InfiniBand HDR100 or EDR with 2nd Gen EPYC, any configuration will work to reach line rate.

  3. For HPC application testing, NPS=2 or NPS=4 recommendation may depend on the application's ability to reach better locality between the CPUs and the memory.

  4. NPS=2 is the best generic default. However, specific applications might need more testing to tune the specific performance.

  5. For 100Gb/s any NPS configuration will work.

 

To enable NPS2 (allocate memory locally to NUMA node where NIC resides), set the following on the BIOS:

Advanced → AMD CBS → DF Common Options → Memory Addressing → NUMA nodes per socket → NPS2

Note: This configuration is important to achieve the max 200Gb/s performance.

Preferred IO Device

 

There is an option to set a specific PCI slot as preferred (scheduling).

In case there is only one PCI device on the host (e.g. one HCA), it is recommended to set the preferred IO to that PCI slot. This will help reduce the scheduling overhead between the CPU and the PCI components, while giving the adapter better performance.

In case there is more than one PCI device (e.g. HCA and GPU or multiple HCAs), this option should be tested since the other PCI device may suffer due to preferred scheduling.

 

To enable preferred IO (improving PCIe ordering performance), do the following on the BIOS:

Advanced → AMD CBS → NBIO Common Options → Preferred IO → Manual

Advanced → AMD CBS → NBIO Common Options → Preferred IO Bus → <Bus Num>

 

Note: This can be configured to only one PCIe function per system.

BIOS Performance Mode

To set performance mode on the BIOS, set the following on the BIOS:

Advanced → AMD CBS → NBIO Common Options → SMU Common Options → Determinism Control → Manual

Advanced → AMD CBS → NBIO Common Options → SMU Common Options → Determinism Slider → Performance

x2APIC MSI Interrupts

AMD has implemented an x2APIC controller. This has two benefits:

  1. It allows the operating systems to work with the 256 CPU threads.

  1. It provides improved performance over the legacy APIC.

AMD recommends, but does not require, that you enable the x2APIC mode in BIOS, even for lower core count parts. (The AMD BIOS will enable x2APIC automatically when two 64-core processors are installed.)

 

To enable x2APIC (Default XAPIC MSI Interrupts), set the following on the BIOS:  

Advanced → AMD CBS → CPU Common Options → Local APIC Mode → x2APIC

 

APBDIS (P-States)

Like processor cores, the EPYC SOC can go into lower power states (P-states) when being lightly used. This saves power consumption of the overall socket or allows power to be diverted to other portions of the processor. By default, to enable the best performance per watt, P-states are enabled in the processor. For high performance, disable the switching of P-states and force P0 all the time, you must go into the system BIOS and set APBDIS to 1. Higher bandwidth adapters may require the forcing of the P0 state to maintain the highest bandwidth.
To set fixed Pstate to P0 and disable APBDIS (set to 1), set the following on the BIOS:
Advanced → AMD CBS → NBIO Common Options → SMU Common Options → APBDIS → 1
Advanced → AMD CBS → NBIO Common Options → SMU Common Options → Fixed SOC Pstate=P0

L3 as NUMA (For TCP Applications)

 

The EPYC processors have multiple L3 caches. While operating systems can handle the multiple Last Level Cache (LLCs) and schedule jobs accordingly, AMD has created a BIOS option to enable the description of a single NUMA domain per LLC. This can help the operating system schedulers maintain locality to the LLC without causing unnecessary cache-to-cache transactions. L3 as NUMA will not impact the memory interleaving or BW when compared to NPS1 but will give hints to the kernel scheduler to schedule tasks based on L3 proximity.

 

To enable L3 as NUMA (expose all CCX and their L3), do the following:

Advanced → AMD CBS → DF Common Options → ACPI → ACPI SRAT L3 Cache as NUMA Domain → Enabled

Note: This config is optional but may yield better result depending upon application and underlying operating system. L3 Cache will create additional NUMAs, therefore may not be needed for HPC applications. See example below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 1 Core(s) per socket: 64 Socket(s): 2 NUMA node(s): 32 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7742 64-Core Processor Stepping: 0 CPU MHz: 2250.000 CPU max MHz: 2250.0000 CPU min MHz: 1500.0000 BogoMIPS: 4491.36 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 16384K NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 NUMA node2 CPU(s): 8-11 NUMA node3 CPU(s): 12-15 NUMA node4 CPU(s): 16-19 NUMA node5 CPU(s): 20-23 NUMA node6 CPU(s): 24-27 NUMA node7 CPU(s): 28-31 NUMA node8 CPU(s): 32-35 NUMA node9 CPU(s): 36-39 NUMA node10 CPU(s): 40-43 NUMA node11 CPU(s): 44-47 NUMA node12 CPU(s): 48-51 NUMA node13 CPU(s): 52-55 NUMA node14 CPU(s): 56-59 NUMA node15 CPU(s): 60-63 NUMA node16 CPU(s): 64-67 NUMA node17 CPU(s): 68-71 NUMA node18 CPU(s): 72-75 NUMA node19 CPU(s): 76-79 NUMA node20 CPU(s): 80-83 NUMA node21 CPU(s): 84-87 NUMA node22 CPU(s): 88-91 NUMA node23 CPU(s): 92-95 NUMA node24 CPU(s): 96-99 NUMA node25 CPU(s): 100-103 NUMA node26 CPU(s): 104-107 NUMA node27 CPU(s): 108-111 NUMA node28 CPU(s): 112-115 NUMA node29 CPU(s): 116-119 NUMA node30 CPU(s): 120-123 NUMA node31 CPU(s): 124-127 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl xtopology nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb cat_l3 cdp_l3 hw_pstate sme retpoline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov succor smca

 

Simultaneous Multithreading (SMT)

For HPC applications, disable multithreading if it is not needed:

Advanced → AMD CBS → CPU Common → Performance → CCD/CORE/Thread → Accept → SMT Control → SMT = Disabled

 

Transparent Secure Memory Encryption (TSME)

While SME provides a lot of flexibility for managing main memory encryption, it does require support in the OS/HV. Systems that need only the physical protection of SME, but run legacy OS or HV software, may use a mode called Transparent SME (TSME). In TSME, all memory is encrypted regardless of the value of the C-bit on any particular page. This mode provides a simple method to enable encryption without requiring software modifications.

Disable TSME if it is not need:

Advanced → AMD CBS → UMC → DDR4 → Security → TSME = Disabled

IOMMU

Disable IOMMU for bare-metal installations:

Advanced → AMD CBS →NBIO → IOMMU = Disabled

If you need to enable IOMMU due to VFIO or VF requirements use IOMMU=pt in Kernel command line

Configurable Thermal Design Power (cTDP)

Configurable TDP (cTDP) is a mechanism to change the standard TDP of a processor to a lower TDP that requires less power and cooling compared to the standard TDP. 

Disable cTDP (set to 255) in order to not change power for high performance:

Advanced → AMD CBS → NBIO → SMU → cTDP = 255

Power Limit Control

Disable Power limit control for high performance (set to 255):

Advanced → AMD CBS → NBIO → SMU → Package Power Limit = 255

DF C States

Disable C state for high performance applications:

Advanced → AMD CBS → NBIO → SMU → DF Cstates = Disabled

Relaxed Ordering

Using relaxed ordering for PCI operations is a key mechanism to get maximum performance when targeting memory attached to AMD 2nd Gen EPYC CPUs. There are two ways to achieve that : using communication libraries which enable relaxed ordering using a new API, or forcing all traffic to use relaxed ordering. The former is preferred, since the latter can break libraries which optimized communication relying on memory being written and visible in a given order in memory.

By default, the PCI_WR_ORDERING firmware parameter setting is set to "per_mkey"

1 2 # mlxconfig -d /dev/mst/mt4123_pciconf0 q | grep PCI_WR_ORDERING          PCI_WR_ORDERING                     per_mkey(0)

In this configuration, applications can enable relaxed ordering per memory region.

  • UCX since v1.9 can use that feature when UCX_IB_PCI_RELAXED_ORDERING is set to "on". The default is auto which is enabled for specific architectures and may change.

1 2 3 4 5 # Enable relaxed ordering for PCIe transactions to improve performance on some systems. # # syntax: <on|off|auto> # UCX_IB_PCI_RELAXED_ORDERING=auto

 

  • Perftests embedded in MLNX_OFED also use relaxed ordering by default since version 5.1-0.6.6.0.

 

If using UCX is not possible, one can force all traffic to use relaxed ordering :

1 2 3 4 5 6 7 8 9 # mlxconfig -d /dev/mst/mt4123_pciconf0 q | grep ORDER PCI_WR_ORDERING per_mkey(0) # mlxconfig -d /dev/mst/mt4123_pciconf0 -s PCI_WR_ORDERING=1 … # mlxconfig -d /dev/mst/mt4123_pciconf0 q | grep ORDER PCI_WR_ORDERING force_relax(1)

 

For dual port devices, second port to 1 as well.

1 2 3 4 5 6 7 8 9 # mlxconfig -d /dev/mst/mt4123_pciconf0.1 q | grep ORDER PCI_WR_ORDERING per_mkey(0) # mlxconfig -d /dev/mst/mt4123_pciconf0.1 -s PCI_WR_ORDERING=1 … # mlxconfig -d /dev/mst/mt4123_pciconf0.1 q | grep ORDER PCI_WR_ORDERING force_relax(1)

 

Note : Forcing relaxed ordering will break NCCL up to version 2.7, make sure to upgrade to NCCL 2.8.

Benchmarking

Basic RDMA Testing

Follow https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1284603909/Basic+RDMA+Benchmark+Examples+for+AMD+2nd+Gen+EPYC+CPUs+over+HDR+InfiniBand to perform basic RDMA write latency and bandwidth testing.

OSU MPI Testing

Follow https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1284538459/OSU+Benchmark+Tuning+for+2nd+Gen+AMD+EPYC+using+HDR+InfiniBand+over+HPC-X+MPI to perform basic OSU testing.

 

References