Getting Started with MILC

The MIMD Lattice Computation (MILC) represents part of a set of codes used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics. It performs simulations of four dimensional SU(3) lattice gauge theory on MIMD parallel machines. The MILC collaboration has produced application codes to study several different QCD research areas.

Strong interactions are responsible for binding quarks into protons and neutrons and holding them all together in the atomic nucleus.

This post will walk you through installation, best practices of configuration and performance.

 

Build and Install

The build described below was done with commit 77d89f04bdc8fb55ebd40d555cb1f54c4b39d105 dated May 29, 2020 of MILC.

git clone https://github.com/milc-qcd/milc_qcd.git cd milc_qcd git checkout develop

Set up the build environment with Intel compilers, MKL library and HPC-X (Open MPI):

module load intel/2020.1.217 mkl/2020.1.217 hpcx/2.6.0

The model of interest is called “ks_imp_rhmc”:

cd ks_imp_rhmc cp ../Makefile .

After copying the upper-level Makefile into the model’s subdirectory, it needs to be edited to reflect the compilers being used and certain options. For the clusters available in the HPC-AI Advisory Council Cluster Center, we did two separate builds, with the following selections in the Makefile:

  1. Build for Intel Haswell and Broadwell processors, as well as for AMD EPYC processors:

  2. Build for Intel Skylake and Cascade Lake processors:

After doing the appropriate edits to the Makefile(s), build the executables:

After the builds complete, there will be two executables distinguished by a suffix determined by the “ARCH” setting from the Makefile:

Performance Benchmarks

We ran MILC with medium size benchmark (36x26x36x72.chklat) input.

Software used:

 

Cluster used:

 

Helios Cluster Performance

95% scalability between 16 to 32 nodes is observed. Slight better performance when using 5 OpenMP threads comparing to 10 OpenMP threads.

 

Thor Cluster Performance

Super Linear scalability between 16 to 32 nodes is observed. 15% performance is seen when using 2 OpenMP threads comparing to 8 OpenMP threads on 32 nodes.

 

Profile

MILC Profile based on 32 Nodes Helios cluster.

MPI Communication

 

MPI Message Sizes

91% of MPI Communication spent on MPI_Wait while 5% MPI Allreduce 8 bytes. In addition, we see async send and receive MPI communication.

 

MPI time

MPI time among the 256 ranks sorted by time spent in MPI shows that rhere is a load imbalance of about 20% between the ranks that spend the most and the least time in MPI:

MPI time ordered by rank, shows imbalance between sockets (4 MPI ranks per socket, 5 OpenMP threads per MPI rank).

Communication Matrix

Memory Footprint

Summary

95% scaling was achieved from 16 to 32 nodes for the medium benchmark on Helios Cluster, using HDR InfiniBand network. A 3% difference was seen comparing 5 OpenMP threads comparing to 10 OpenMP threads on Helios.

On Thor Cluster using HDR100 InfiniBand network, Super linear scaling can be seen on the medium benchmark when scaling from 16 to 32 nodes. A 15% difference was seen comparing 2 OpenMP threads to 8 OpenMP threads.

On both clusters, low number of OpenMP threads did better than high number of OpenMP threads.

While running MILC Profile, we’ve noticed async P2P communication with 8 byte MPI Allreduce collective. In addition, there is high percentage of MPI wait due to imbalance of the application that uses 4D communication matrix.