Getting Started with MILC

The MIMD Lattice Computation (MILC) represents part of a set of codes used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics. It performs simulations of four dimensional SU(3) lattice gauge theory on MIMD parallel machines. The MILC collaboration has produced application codes to study several different QCD research areas.

Strong interactions are responsible for binding quarks into protons and neutrons and holding them all together in the atomic nucleus.

This post will walk you through installation, best practices of configuration and performance.

1 Build and Install
2 Performance Benchmarks
- 2.1 Helios Cluster Performance
- 2.2 Thor Cluster Performance
3 Profile
- 3.1 MPI Communication
- 3.2 MPI Message Sizes
- 3.3 MPI time
- 3.4 Communication Matrix
- 3.5 Memory Footprint
4 Summary

Build and Install

The build described below was done with commit 77d89f04bdc8fb55ebd40d555cb1f54c4b39d105 dated May 29, 2020 of MILC.

git clone https://github.com/milc-qcd/milc_qcd.git
cd milc_qcd
git checkout develop

Set up the build environment with Intel compilers, MKL library and HPC-X (Open MPI):

module load intel/2020.1.217 mkl/2020.1.217 hpcx/2.6.0

The model of interest is called “ks_imp_rhmc”:

cd ks_imp_rhmc
cp ../Makefile .

After copying the upper-level Makefile into the model’s subdirectory, it needs to be edited to reflect the compilers being used and certain options. For the clusters available in the HPC-AI Advisory Council Cluster Center, we did two separate builds, with the following selections in the Makefile:

Build for Intel Haswell and Broadwell processors, as well as for AMD EPYC processors:

COMPILER=intel ARCH=hsw ## NOTE: change -xCORE-AVX2 to -march=core-avx2 in both Makefile ## and ../libraries/Make_vanilla, for use with EPYC processors WANTQUDA=false MPP=true PRECISION=2 OMP=true
Build for Intel Skylake and Cascade Lake processors:

COMPILER=intel ARCH=skx WANTQUDA=false MPP=true PRECISION=2 OMP=true

After doing the appropriate edits to the Makefile(s), build the executables:

## Edit Makefiles for HSW/BDW/EPYC, build the HSW/BDW/EPYC executable:
make su3_rhmd_hisq 2>&1 | tee make_i201h260_hsw.log
## Clean up, edit Makefiles for an SKX build, build the SKX executable
make clean
rm ../libraries/*.o
make su3_rhmd_hisq 2>&1 | tee make_i201h260_skx.log

After the builds complete, there will be two executables distinguished by a suffix determined by the “ARCH” setting from the Makefile:

[gerardo@login02 ks_imp_rhmc]$ ls -l su3_*.???
-rwxrwxr-x 1 gerardo hpcperf 3016096 Jun  5 10:43 su3_rhmd_hisq.hsw
-rwxrwxr-x 1 gerardo hpcperf 3028256 Jun  5 10:46 su3_rhmd_hisq.skx

Performance Benchmarks

We ran MILC with medium size benchmark (36x26x36x72.chklat) input.

Software used:

OS: RHEL 7.7, MLNX_OFED 4.7.3
MPI: HPC-X 2.6.0
MILC: develop branch of https://github.com/milc-qcd/milc_qcd, commit 77d89f04bdc8fb55ebd40d555cb1f54c4b39d105 (May 29 1 2020)

Cluster used:

https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/7962701 Cluster with HDR InfiniBand
https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/7864401 Cluster with HDR100 InfiniBand

Helios Cluster Performance

95% scalability between 16 to 32 nodes is observed. Slight better performance when using 5 OpenMP threads comparing to 10 OpenMP threads.

Thor Cluster Performance

Super Linear scalability between 16 to 32 nodes is observed. 15% performance is seen when using 2 OpenMP threads comparing to 8 OpenMP threads on 32 nodes.

Profile

MILC Profile based on 32 Nodes https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/7962701 cluster.

MPI Communication

MPI Message Sizes

91% of MPI Communication spent on MPI_Wait while 5% MPI Allreduce 8 bytes. In addition, we see async send and receive MPI communication.

MPI time

MPI time among the 256 ranks sorted by time spent in MPI shows that rhere is a load imbalance of about 20% between the ranks that spend the most and the least time in MPI:

MPI time ordered by rank, shows imbalance between sockets (4 MPI ranks per socket, 5 OpenMP threads per MPI rank).

Communication Matrix

Memory Footprint

Summary

95% scaling was achieved from 16 to 32 nodes for the medium benchmark on https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/7962701 Cluster, using HDR InfiniBand network. A 3% difference was seen comparing 5 OpenMP threads comparing to 10 OpenMP threads on https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/7962701.

On https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/7864401 Cluster using HDR100 InfiniBand network, Super linear scaling can be seen on the medium benchmark when scaling from 16 to 32 nodes. A 15% difference was seen comparing 2 OpenMP threads to 8 OpenMP threads.

On both clusters, low number of OpenMP threads did better than high number of OpenMP threads.

While running MILC Profile, we’ve noticed async P2P communication with 8 byte MPI Allreduce collective. In addition, there is high percentage of MPI wait due to imbalance of the application that uses 4D communication matrix.

HPC-Works