...
Some test cases to start with can be found at ftp://ftp.gromacs.org/pub/benchmarks/gmxbench-3.0.tar.gz.
Refereneces
Build and Install Gromacs
...
Code Block |
---|
tar xfz gromacs-2020.2.tar.gz
cd gromacs-2020.2
module load intel/2019.5.281
module load mkl/2019.5.281
module load gcc/8.4.0
module load cmake/3.13.4
module load hpcx/2.6.0
mkdir build
mkdir install run
cd build
cmake .. -DGMX_FFT_LIBRARY=mkl -DMKL_LIBRARIES=-mkl \
-DMKL_INCLUDE_DIR=$MKLROOT/include \
-DGMX_SIMD=AVX2_256 \
-DGMX_MPI=ON \
-DGMX_BUILD_MDRUN_ONLY=on \
-DBUILD_SHARED_LIBS=on \
-DGMX_HWLOC=off \
-DCMAKE_INSTALL_PREFIX=../install \
-DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx
make -j 16 install |
|
Note: if you get hwloc error, comment out the following lines from the following file: /src/gromacs/hardware/hardwaretopology.cpp.
Code Block |
---|
//# if GMX_HWLOC_API_VERSION < 0x00020000
//# error "HWLOC library major version set during configuration is 1, but currently using version 2 headers"
//# endif |
Check the install directory for the file mdrun_mpi
Code Block |
---|
$ ls ../install/
bin include lib64 share
$ ls ../install/bin
gmx-completion-mdrun_mpi.bash mdrun_mpi |
Build Gromacs with GPU support
Code Block |
---|
tar xfz gromacs-2020.2.tar.gz
cd gromacs-2020.2
module load intel/2019.5.281
module load mkl/2019.5.281
module load gcc/8.4.0
module load cmake/3.13.4
module load hpcx/2.6.0
module load cuda/10.1
mkdir build
mkdir install
cd build
cmake .. -DGMX_FFT_LIBRARY=mkl -DMKL_LIBRARIES=-mkl \
-DMKL_INCLUDE_DIR=$MKLROOT/include \
-DGMX_SIMD=AVX2_256 \
-DGMX_MPI=ON \
-DGMX_GPU=ON \
-DGMX_BUILD_MDRUN_ONLY=on \
-DBUILD_SHARED_LIBS=on \
-DCMAKE_INSTALL_PREFIX=../install \
-DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx
make -j 16 install |
|
Run Gromacs
To run Gromacs with CPU only using stmv case
...
Check the install directory for the file mdrun_mpi
Code Block |
---|
$ ls ../install/
bin include lib64 share
$ ls ../install/bin
gmx-completion-mdrun_mpi.bash mdrun_mpi |
Build Gromacs with GPU support
Code Block |
---|
tar xfz gromacs-2020.2.tar.gz
cd gromacs-2020.2
module load intel/2019.5.281
module load mkl/2019.5.281
module load gcc/8.4.0
module load cmake/3.13.4
module load hpcx/2.6.0
module load cuda/10.1
mkdir build install run
cd build
cmake .. -DGMX_FFT_LIBRARY=mkl -DMKL_LIBRARIES=-mkl \
-DMKL_INCLUDE_DIR=$MKLROOT/include \
-DGMX_SIMD=AVX2_256 \
-DGMX_MPI=ON \
-DGMX_GPU=ON \
-DGMX_BUILD_MDRUN_ONLY=on \
-DBUILD_SHARED_LIBS=on \
-DGMX_HWLOC=off \
-DCMAKE_INSTALL_PREFIX=../install \
-DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx
make -j 16 install |
|
Run Gromacs
To run Gromacs with CPU only using stmv case
Code Block |
---|
% mpirun -np 40 -x UCX_NET_DEVICES=mlx5_0:1 -bind-to core -report-bindings \
mdrun_mpi -v -s stmv.tpr -nsteps 10000 -noconfout -nb cpu -pin on
Command line:
mdrun_mpi -v -s stmv.tpr -nsteps 10000 -noconfout -nb cpu -pin on
Reading file stmv.tpr, VERSION 2018.1 (single precision)
Note: file tpx version 112, software tpx version 119
Overriding nsteps with value passed on the command line: 10000 steps, 20 ps
Changing nstlist from 10 to 80, rlist from 1.2 to 1.316
Using 40 MPI processes
Using 1 OpenMP thread per MPI process
...
step 9900, remaining wall clock time: 6 s
vol 0.96 imb F 1% pme/F 0.81 step 10000, remaining wall clock time: 0 s
Dynamic load balancing report:
DLB was turned on during the run due to measured imbalance.
Average load imbalance: 0.9%.
The balanceable part of the MD step is 94%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 0.8%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.790
Part of the total run time spent waiting due to PP/PME imbalance: 2.0 %
Core t (s) Wall t (s) (%)
Time: 27556.777 688.921 4000.0
(ns/day) (hour/ns)
Performance: 2.509 9.567 |
To run Gromacs with GPU
Code Block |
---|
% export OMP_NUM_THREADS=2
% export KMP_AFFINITY=verbose,compact
% mpirun -np 4 -x UCX_NET_DEVICES=mlx5_0:1 -map-by node:PE=$OMP_NUM_THREADS \
mdrun_mpi -v -s stmv.tpr -nsteps 10000 -noconfout -nb gpu -pin on \
-ntomp $OMP_NUM_THREADS
Command line:
mdrun_mpi -v -s stmv.tpr -nsteps 10000 -noconfout -nb gpu -pin on
Reading file stmv.tpr, VERSION 2018.1 (single precision)
Note: file tpx version 112, software tpx version 119
Overriding nsteps with value passed on the command line: 10000 steps, 20 ps
Changing nstlist from 10 to 100, rlist from 1.2 to 1.339
On host ops003.hpcadvisorycouncil.com 2 GPUs selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
PP:0,PP:1
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the CPU
Using 4 MPI processes
Using 2 OpenMP threads per MPI process
...
imb F 0% step 9900, remaining wall clock time: 3 s
imb F 0% step 10000, remaining wall clock time: 0 s
Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 0.2%.
The balanceable part of the MD step is 59%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 0.1%.
Core t (s) Wall t (s) (%)
Time: 2581.428 322.681 800.0
(ns/day) (hour/ns)
Performance: 5.356 4.481 |
Other command options
Code Block |
---|
OPTIONS
Options to specify input files:
-s [<.tpr>] (topol.tpr)
Portable xdr run input file
-cpi [<.cpt>] (state.cpt) (Opt.)
Checkpoint file
-table [<.xvg>] (table.xvg) (Opt.)
xvgr/xmgr file
-tablep [<.xvg>] (tablep.xvg) (Opt.)
xvgr/xmgr file
-tableb [<.xvg> [...]] (table.xvg) (Opt.)
xvgr/xmgr file
-rerun [<.xtc/.trr/...>] (rerun.xtc) (Opt.)
Trajectory: xtc trr cpt gro g96 pdb tng
-ei [<.edi>] (sam.edi) (Opt.)
ED sampling input
-multidir [<dir> [...]] (rundir) (Opt.)
Run directory
-awh [<.xvg>] (awhinit.xvg) (Opt.)
xvgr/xmgr file
-membed [<.dat>] (membed.dat) (Opt.)
Generic data file
-mp [<.top>] (membed.top) (Opt.)
Topology file
-mn [<.ndx>] (membed.ndx) (Opt.)
Index file
Options to specify output files:
-o [<.trr/.cpt/...>] (traj.trr)
Full precision trajectory: trr cpt tng
-x [<.xtc/.tng>] (traj_comp.xtc) (Opt.)
Compressed trajectory (tng format or portable xdr format)
-cpo [<.cpt>] (state.cpt) (Opt.)
Checkpoint file
-c [<.gro/.g96/...>] (confout.gro)
Structure file: gro g96 pdb brk ent esp
-e [<.edr>] (ener.edr)
Energy file
-g [<.log>] (md.log)
Log file
-dhdl [<.xvg>] (dhdl.xvg) (Opt.)
xvgr/xmgr file
-field [<.xvg>] (field.xvg) (Opt.)
xvgr/xmgr file
-tpi [<.xvg>] (tpi.xvg) (Opt.)
xvgr/xmgr file
-tpid [<.xvg>] (tpidist.xvg) (Opt.)
xvgr/xmgr file
-eo [<.xvg>] (edsam.xvg) (Opt.)
xvgr/xmgr file
-px [<.xvg>] (pullx.xvg) (Opt.)
xvgr/xmgr file
-pf [<.xvg>] (pullf.xvg) (Opt.)
xvgr/xmgr file
-ro [<.xvg>] (rotation.xvg) (Opt.)
xvgr/xmgr file
-ra [<.log>] (rotangles.log) (Opt.)
Log file
-rs [<.log>] (rotslabs.log) (Opt.)
Log file
-rt [<.log>] (rottorque.log) (Opt.)
Log file
-mtx [<.mtx>] (nm.mtx) (Opt.)
Hessian matrix
-if [<.xvg>] (imdforces.xvg) (Opt.)
xvgr/xmgr file
-swap [<.xvg>] (swapions.xvg) (Opt.)
xvgr/xmgr file
Other options:
-deffnm <string>
Set the default filename for all file options
-xvg <enum> (xmgrace)
xvg plot formatting: xmgrace, xmgr, none
-dd <vector> (0 0 0)
Domain decomposition grid, 0 is optimize
-ddorder <enum> (interleave)
DD rank order: interleave, pp_pme, cartesian
-npme <int> (-1)
Number of separate ranks to be used for PME, -1 is guess
-nt <int> (0)
Total number of threads to start (0 is guess)
-ntmpi <int> (0)
Number of thread-MPI ranks to start (0 is guess)
-ntomp <int> (0)
Number of OpenMP threads per MPI rank to start (0 is guess)
-ntomp_pme <int> (0)
Number of OpenMP threads per MPI rank to start (0 is -ntomp)
-pin <enum> (auto)
Whether mdrun should try to set thread affinities: auto, on, off
-pinoffset <int> (0)
The lowest logical core number to which mdrun should pin the first
thread
-pinstride <int> (0)
Pinning distance in logical cores for threads, use 0 to minimize
the number of threads per physical core
-gpu_id <string>
List of unique GPU device IDs available to use
-gputasks <string>
List of GPU device IDs, mapping each PP task on each node to a
device
-[no]ddcheck (yes)
Check for all bonded interactions with DD
-rdd <real> (0)
The maximum distance for bonded interactions with DD (nm), 0 is
determine from initial coordinates
-rcon <real> (0)
Maximum distance for P-LINCS (nm), 0 is estimate
-dlb <enum> (auto)
Dynamic load balancing (with DD): auto, no, yes
-dds <real> (0.8)
Fraction in (0,1) by whose reciprocal the initial DD cell size will
be increased in order to provide a margin in which dynamic load
balancing can act while preserving the minimum cell size.
-nb <enum> (auto)
Calculate non-bonded interactions on: auto, cpu, gpu
-nstlist <int> (0)
Set nstlist when using a Verlet buffer tolerance (0 is guess)
-[no]tunepme (yes)
Optimize PME load between PP/PME ranks or GPU/CPU
-pme <enum> (auto)
Perform PME calculations on: auto, cpu, gpu
-pmefft <enum> (auto)
Perform PME FFT calculations on: auto, cpu, gpu
-bonded <enum> (auto)
Perform bonded calculations on: auto, cpu, gpu
-update <enum> (auto)
Perform update and constraints on: auto, cpu, gpu
-[no]v (no)
Be loud and noisy
-pforce <real> (-1)
Print all forces larger than this (kJ/mol nm)
-[no]reprod (no)
Try to avoid optimizations that affect binary reproducibility
-cpt <real> (15)
(ns/day) (hour/ns) Performance: Checkpoint interval (minutes)
2.509-[no]cpnum 9.567 |
To run Gromacs with GPU
Code Block |
---|
% mpirun -np 4 -x UCX_NET_DEVICES=mlx5_0:1 -bind-to none --map-by node:PE=2 \
mdrun_mpi -v -s stmv.tpr -nsteps 10000 -noconfout -nb gpu -pin on
Command line:
mdrun_mpi -v -s stmv.tpr -nsteps 10000 -noconfout -nb gpu -pin on
Reading file stmv.tpr, VERSION 2018.1 (single precision)
Note: file tpx version 112, software tpx version 119
Overriding nsteps with value passed on the command line: 10000 steps, 20 ps
Changing nstlist from 10 to 100, rlist from 1.2 to 1.339
On host ops003.hpcadvisorycouncil.com 2 GPUs selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
PP:0,PP:1
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the CPU
Using 4 MPI processes
Using 2 OpenMP threads per MPI process
...
imb F 0% step 9900, remaining wall clock time: 3 s
imb F 0% step 10000, remaining wall clock time: 0 s
Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 0.2%.
The balanceable part of the MD step is 59%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 0.1%. (no)
Keep and number checkpoint files
-[no]append (yes)
Append to previous output files when continuing from checkpoint
instead of adding the simulation part number to all file names
-nsteps <int> (-2)
Run this number of steps (-1 means infinite, -2 means use mdp
option, smaller is invalid)
-maxh <real> (-1)
Terminate after 0.99 times this time (hours)
-replex <int> (0)
Attempt replica exchange periodically with this period (steps)
-nex <int> (0)
Number of random exchanges to carry Coreout teach (s)exchange interval Wall(N^3
t (s) (%) is one suggestion). -nex zero Time:or not specified gives neighbor
2581.428 322.681 replica 800exchange.0
-reseed <int> (ns/day-1)
(hour/ns) Performance: Seed 5.356for replica exchange, -1 is generate a 4.481seed |