LAMMPS for ISC21

LAMMPS for ISC21

Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is a classical molecular dynamics code. LAMMPS has potentials for solid-state materials (metals, semiconductors) and soft matter (biomolecules, polymers) and coarse-grained or mesoscopic systems. It can be used to model atoms or, more generically, as a parallel particle simulator at the atomic, meso, or continuum scale. LAMMPS runs on single processors or in parallel using message-passing techniques and a spatial-decomposition of the simulation domain. LAMMPS is distributed as an open source code under the terms of the GPL. More information on LAMMPS can be found at the LAMMPS web site: http://lammps.sandia.gov.

 

 

Introduction to LAMMPS

 

Slides are here:

 

Build LAMMPS for CPUs

This is an example that uses HPC-X MPI.

  1. Load the relevant modules

module load intel/2020.4.304 module load mkl/2020.4.304 module load tbb/2020.4.304 module load hpcx-2.7.0

2. Make

# Download latest stable version from https://lammps.sandia.gov/download.html. tar xf lammps-stable.tar.gz cd lammps-29Oct20/src TARGET=intel_cpu_openmpi sed -e "s/xHost/xCORE-AVX512/g" -i MAKE/OPTIONS/Makefile.$TARGET make clean-all make no-all make no-lib make yes-manybody yes-molecule yes-replica yes-kspace yes-asphere yes-rigid yes-snap yes-user-omp yes-user-reaxc yes-user-omp make yes-user-intel make -j 32 $TARGET

Input example

For “3d Lennard-Jones melt” , see https://lammps.sandia.gov/bench/in.lj.txt

 

Run example

mpirun -np 160 lammps/lmp_mpi-hpcx-2.7.0.AVX512 -in in.lj.all.inp -pk intel 0 omp 1 -sf intel

Output example

... Neighbor list info ... update every 20 steps, delay 0 steps, check no max neighbors/atom: 2000, page size: 100000 master list distance cutoff = 2.8 ghost atom cutoff = 2.8 binsize = 1.4, bins = 24 24 24 1 neighbor lists, perpetual/occasional/extra = 1 0 0 (1) pair lj/cut/intel, perpetual attributes: half, newton on, intel pair build: half/bin/newton/intel stencil: half/bin/3d/newton bin: intel Per MPI rank memory allocation (min/avg/max) = 2.826 | 3.045 | 3.384 Mbytes Step Temp E_pair E_mol TotEng Press 0 1.44 -6.7733683 0 -4.6134358 -5.019707 100 0.75745333 -5.7585066 0 -4.6223621 0.20725895 Loop time of 0.0174524 on 640 procs for 100 steps with 32000 atoms Performance: 2475308.243 tau/day, 5729.880 timesteps/s 94.1% CPU use with 640 MPI tasks x no OpenMP threads MPI task timing breakdown: Section | min time | avg time | max time |%varavg| %total --------------------------------------------------------------- Pair | 0.0010798 | 0.0013214 | 0.0016188 | 0.2 | 7.57 Neigh | 0.00021591 | 0.00024434 | 0.0003079 | 0.0 | 1.40 Comm | 0.015171 | 0.015479 | 0.01573 | 0.1 | 88.69 Output | 9.0258e-05 | 0.00011218 | 0.00014501 | 0.0 | 0.64 Modify | 0.00017915 | 0.00018453 | 0.00020567 | 0.0 | 1.06 Other | | 0.0001111 | | | 0.64 Nlocal: 50.0000 ave 61 max 40 min Histogram: 9 36 73 97 133 133 96 37 18 8 Nghost: 725.892 ave 741 max 706 min Histogram: 1 8 23 52 121 142 159 81 42 11 Neighs: 1879.43 ave 2521 max 1383 min

We will be looking on the performance tau/day and timesteps/s values.

 

For GPUs

LAMMPS Accelerator Package Documentation:

https://lammps.sandia.gov/doc/Speed_packages.html
https://lammps.sandia.gov/doc/Build_extras.html

On GPUs, timing breakdown won’t be accurate without CUDA_LAUNCH_BLOCKING=1 (will slow down simulation though).

By default for Kokkos, KSpace (including FFTs) run on GPU, but can change to run on CPU and overlap with force bonded interactions.

Configuration Changes allowed

  • Number of MPI ranks

  • MPI and thread affinity

  • Number of OpenMP threads per MPI rank

  • Compiler optimization flags

  • Can compile with “-default-stream per-thread”

  • FFT library

  • MPI library

  • Compiler version

  • CUDA version

  • CUDA-aware flags

  • CUDA MPS on/off

  • Can use any LAMMPS accelerator package

  • Any package option (see https://lammps.sandia.gov/doc/package.html), except precision

  • Coulomb cutoff

  • Can use atom sorting

  • Newton flag on/off

  • Can add load balancing

  • Can use LAMMPS “processors” command

  • Can turn off tabulation in pair_style (i.e “pair_modify table 0”)

  • Can use multiple Kokkos backends (e.g. CUDA + OpenMP)

  • Can use “kk/device” or “kk/host” suffix for any kernel to run on CPU or GPU

 

Configuration Changes not allowed

  • Modifying any style: pair, fix, kspace, etc.

  • Number of atoms

  • Timestep value

  • Number of timesteps

  • Neighborlist parameters (except binsize)

  • Changing precision (must use double precision FP64)

  • LJ charmm cutoff

Visualizing the results

 

Here is an example:

 

Input Files

LJ and Rhodo input files can be downloaded here.

 

Tasks and Submissions

 

Note: You need to run this on both clusters 4 CPU nodes, or 4 GPUs. and to supply two results, one per cluster.

 

  1. Run LAMMPS with the given inputs. On both Niagara and NSCC clusters, you can use up to 4 nodes or 4 GPUs for this run. Submit the results to OneDrive team folder. Change the tunables parameters and see what gives you the best performance.

  2. Run IPM LAMMPS profile on one of the clusters on 4 nodes, what are the main MPI calls used. submit the profile results in PDF format. Need to submit a profile per input.

  3. Visualize the input files. Generate a video or image out of the run.

  4. For teams with twitter account, tweet your video or image, tagged with your team name/university with the hashtags : #ISC21 #ISC21_SCC #LAMMPS