Getting Started with MILC
The MIMD Lattice Computation (MILC) represents part of a set of codes used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics. It performs simulations of four dimensional SU(3) lattice gauge theory on MIMD parallel machines. The MILC collaboration has produced application codes to study several different QCD research areas.
Strong interactions are responsible for binding quarks into protons and neutrons and holding them all together in the atomic nucleus.
This post will walk you through installation, best practices of configuration and performance.
Build and Install
The build described below was done with commit 77d89f04bdc8fb55ebd40d555cb1f54c4b39d105 dated May 29, 2020 of MILC.
git clone https://github.com/milc-qcd/milc_qcd.git
cd milc_qcd
git checkout develop
Set up the build environment with Intel compilers, MKL library and HPC-X (Open MPI):
module load intel/2020.1.217 mkl/2020.1.217 hpcx/2.6.0
The model of interest is called “ks_imp_rhmc”:
cd ks_imp_rhmc
cp ../Makefile .
After copying the upper-level Makefile into the model’s subdirectory, it needs to be edited to reflect the compilers being used and certain options. For the clusters available in the HPC-AI Advisory Council Cluster Center, we did two separate builds, with the following selections in the Makefile:
Build for Intel Haswell and Broadwell processors, as well as for AMD EPYC processors:
Build for Intel Skylake and Cascade Lake processors:
After doing the appropriate edits to the Makefile(s), build the executables:
After the builds complete, there will be two executables distinguished by a suffix determined by the “ARCH” setting from the Makefile:
Performance Benchmarks
We ran MILC with medium size benchmark (36x26x36x72.chklat) input.
Software used:
OS: RHEL 7.7, MLNX_OFED 4.7.3
MPI: HPC-X 2.6.0
MILC: develop branch of https://github.com/milc-qcd/milc_qcd, commit 77d89f04bdc8fb55ebd40d555cb1f54c4b39d105 (May 29 1 2020)
Cluster used:
Helios Cluster Performance
95% scalability between 16 to 32 nodes is observed. Slight better performance when using 5 OpenMP threads comparing to 10 OpenMP threads.
Thor Cluster Performance
Super Linear scalability between 16 to 32 nodes is observed. 15% performance is seen when using 2 OpenMP threads comparing to 8 OpenMP threads on 32 nodes.
Profile
MILC Profile based on 32 Nodes Helios cluster.
MPI Communication
MPI Message Sizes
91% of MPI Communication spent on MPI_Wait while 5% MPI Allreduce 8 bytes. In addition, we see async send and receive MPI communication.
MPI time
MPI time among the 256 ranks sorted by time spent in MPI shows that rhere is a load imbalance of about 20% between the ranks that spend the most and the least time in MPI:
MPI time ordered by rank, shows imbalance between sockets (4 MPI ranks per socket, 5 OpenMP threads per MPI rank).
Communication Matrix
Memory Footprint
Summary
95% scaling was achieved from 16 to 32 nodes for the medium benchmark on Helios Cluster, using HDR InfiniBand network. A 3% difference was seen comparing 5 OpenMP threads comparing to 10 OpenMP threads on Helios.
On Thor Cluster using HDR100 InfiniBand network, Super linear scaling can be seen on the medium benchmark when scaling from 16 to 32 nodes. A 15% difference was seen comparing 2 OpenMP threads to 8 OpenMP threads.
On both clusters, low number of OpenMP threads did better than high number of OpenMP threads.
While running MILC Profile, we’ve noticed async P2P communication with 8 byte MPI Allreduce collective. In addition, there is high percentage of MPI wait due to imbalance of the application that uses 4D communication matrix.