Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is a classical molecular dynamics code. LAMMPS has potentials for solid-state materials (metals, semiconductors) and soft matter (biomolecules, polymers) and coarse-grained or mesoscopic systems. It can be used to model atoms or, more generically, as a parallel particle simulator at the atomic, meso, or continuum scale. LAMMPS runs on single processors or in parallel using message-passing techniques and a spatial-decomposition of the simulation domain. LAMMPS is distributed as an open source code under the terms of the GPL. More information on LAMMPS can be found at the LAMMPS web site: http://lammps.sandia.gov.

Introduction to LAMMPS

Slides are here:

Build LAMMPS for CPUs

This is an example that uses HPC-X MPI.

Load the relevant modules

module load intel/2020.4.304
module load mkl/2020.4.304
module load tbb/2020.4.304
module load hpcx-2.7.0

2. Make

# Download latest stable version from https://lammps.sandia.gov/download.html.
tar xf lammps-stable.tar.gz
cd lammps-29Oct20/src
TARGET=intel_cpu_openmpi
sed -e "s/xHost/xCORE-AVX512/g" -i MAKE/OPTIONS/Makefile.$TARGET

make clean-all
make no-all
make no-lib

make yes-manybody yes-molecule yes-replica yes-kspace yes-asphere yes-rigid yes-snap yes-user-omp yes-user-reaxc yes-user-omp
make yes-user-intel
make -j 32 $TARGET

Input example

For “3d Lennard-Jones melt” , see https://lammps.sandia.gov/bench/in.lj.txt

Run example

mpirun -np 160 lammps/lmp_mpi-hpcx-2.7.0.AVX512 -in in.lj.all.inp -pk intel 0 omp 1 -sf intel

Output example

...
Neighbor list info ...
  update every 20 steps, delay 0 steps, check no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 2.8
  ghost atom cutoff = 2.8
  binsize = 1.4, bins = 24 24 24
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair lj/cut/intel, perpetual
      attributes: half, newton on, intel
      pair build: half/bin/newton/intel
      stencil: half/bin/3d/newton
      bin: intel
Per MPI rank memory allocation (min/avg/max) = 2.826 | 3.045 | 3.384 Mbytes
Step Temp E_pair E_mol TotEng Press 
       0         1.44   -6.7733683            0   -4.6134358    -5.019707 
     100   0.75745333   -5.7585066            0   -4.6223621   0.20725895 
Loop time of 0.0174524 on 640 procs for 100 steps with 32000 atoms

Performance: 2475308.243 tau/day, 5729.880 timesteps/s
94.1% CPU use with 640 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.0010798  | 0.0013214  | 0.0016188  |   0.2 |  7.57
Neigh   | 0.00021591 | 0.00024434 | 0.0003079  |   0.0 |  1.40
Comm    | 0.015171   | 0.015479   | 0.01573    |   0.1 | 88.69
Output  | 9.0258e-05 | 0.00011218 | 0.00014501 |   0.0 |  0.64
Modify  | 0.00017915 | 0.00018453 | 0.00020567 |   0.0 |  1.06
Other   |            | 0.0001111  |            |       |  0.64

Nlocal:        50.0000 ave          61 max          40 min
Histogram: 9 36 73 97 133 133 96 37 18 8
Nghost:        725.892 ave         741 max         706 min
Histogram: 1 8 23 52 121 142 159 81 42 11
Neighs:        1879.43 ave        2521 max        1383 min

We will be looking on the performance tau/day and timesteps/s values.

For GPUs

LAMMPS Accelerator Package Documentation:

https://lammps.sandia.gov/doc/Speed_packages.html
https://lammps.sandia.gov/doc/Build_extras.html

On GPUs, timing breakdown won’t be accurate without CUDA_LAUNCH_BLOCKING=1 (will slow down simulation though).

By default for Kokkos, KSpace (including FFTs) run on GPU, but can change to run on CPU and overlap with force bonded interactions.

Configuration Changes allowed

Number of MPI ranks
MPI and thread affinity
Number of OpenMP threads per MPI rank
Compiler optimization flags
Can compile with “-default-stream per-thread”
FFT library
MPI library
Compiler version
CUDA version
CUDA-aware flags
CUDA MPS on/off
Can use any LAMMPS accelerator package
Any package option (see https://lammps.sandia.gov/doc/package.html), except precision
Coulomb cutoff
Can use atom sorting
Newton flag on/off
Can add load balancing
Can use LAMMPS “processors” command
Can turn off tabulation in pair_style (i.e “pair_modify table 0”)
Can use multiple Kokkos backends (e.g. CUDA + OpenMP)
Can use “kk/device” or “kk/host” suffix for any kernel to run on CPU or GPU

Configuration Changes not allowed

Modifying any style: pair, fix, kspace, etc.
Number of atoms
Timestep value
Number of timesteps
Neighborlist parameters (except binsize)
Changing precision (must use double precision FP64)
LJ charmm cutoff

Visualizing the results

LAMMPS “dump image” command: https://lammps.sandia.gov/doc/dump_image.html
VMD: https://www.ks.uiuc.edu/Research/vmd/
OVITO: https://www.ovito.org/about/ovito-pro/

Here is an example:

Input Files

LJ and Rhodo input files can be downloaded here.

Tasks and Submissions

Note: You need to run this on both clusters 4 CPU nodes, or 4 GPUs. and to supply two results, one per cluster.

Run LAMMPS with the given inputs. On both Niagara and NSCC clusters, you can use up to 4 nodes or 4 GPUs for this run. Submit the results to OneDrive team folder. Change the tunables parameters and see what gives you the best performance.
Run IPM LAMMPS profile on one of the clusters on 4 nodes, what are the main MPI calls used. submit the profile results in PDF format.
Visualize the input files. Generate a video or image out of the run.
For teams with twitter account, tweet your video or image, tagged with your team name/university with the hashtags : #ISC21 #ISC21_SCC #LAMMPS

LAMMPS for ISC21