Getting Started with NWChem for ISC22 SCC
This page is an overview, configuration and tasks for ISC22 Student Cluster Competition teams.
- 1 References
- 2 Overview
- 3 Introduction to NWChem
- 4 Downloading and Compiling NWChem
- 5 Running NWChem
- 6 Tunables/Non-Tunables
- 6.1 MPI and Global Arrays (GA)
- 6.1.1 Details
- 6.2 Tuning Parallel Execution
- 6.3 NWChem input file tuning
- 6.3.1 DFT module
- 6.3.2 TCE CCSD(T) module
- 6.3.3 Semidirect CCSD(T) module
- 6.1 MPI and Global Arrays (GA)
- 7 Generating input files
- 8 Tasks and Submissions
References
Overview
NWChem is a widely used open-source computational chemistry software package, written primarily in Fortran. It supports scalable parallel implementations of atomic orbital and plane-wave density-function theory, many-body methods from Moller-Plesset perturbation theory to coupled-cluster with quadruple excitations, and a number of methods for computing multiscale methods and molecular and macroscropic properties. It is used on computers from Apple M1 laptops to the largest supercomputers, supporting multicore CPUs and GPUs with OpenMP and other programming models. NWChem uses MPI for parallelism, usually hidden by the Global Arrays programming model, which uses one-sided communication to support a data-centric abstraction of multidimensional arrays across shared and distributed memory protocols. NWChem was created by Pacific Northwest National Laboratory in the 1990s, and has been under continuous development by a team based on national laboratories, universities, and industry for 25 years. NWChem has been cited thousands of times and was a finalist for the Gordon Bell Prize in 2009.
Introduction to NWChem
The slides:
Downloading and Compiling NWChem
NWChem is available for download from GitHub:
git clone https://github.com/nwchemgit/nwchem
cd nwchem/src/tools && ./get-tools-github
Once you have downloaded NWChem, the environment needs to be set up before building a working executable for your cluster. In the following example, we show how to compile it using Intel compilers, the Intel Math Kernel Library (MKL) and Open MPI from HPC-X:
module load intel/2021.3
module load mkl/2021.3.0
module load hpcx/2.9.0
export NWCHEM_TOP=/path-to-directory-where-you-ran-git/nwchem
export NWCHEM_TARGET=LINUX64
export ARMCI_NETWORK=ARMCI-MPI
export USE_MPI=y
export NWCHEM_MODULES=qm
export BLASOPT="-mkl"
export BLAS_SIZE=8
export LAPACK_LIB="-mkl"
export USE_SCALAPACK=y
export SCALAPACK="-mkl -lmkl_scalapack_ilp64 -lmkl_blacs_openmpi_ilp64"
export SCALAPACK_SIZE=8
export FC=ifort
Full details can be found in Compiling NWChem from source - NWChem
Using ARMCI-MPI requires the user to build this prior to compiling NWChem. This requires you to have the Linux packages autoconf
, automake
, m4
and libtool
installed. If they aren’t present and you can’t install them via a package manager, ARMCI-MPI provides a link to instructions on how to do it manually.
cd $NWCHEM_TOP/src/tools
./install-armci-mpi
This step should succeed, but if it does not, it means that mpicc
and $MPICC
are not available in your environment. Note also that you need to make sure that LIBMPI
matches the output of mpifort -show
for NWChem to link correctly.
After the environment has been set up, NWChem is built in two steps: building the configuration file nwchem_config.h, and making the executable.
The last make command takes a while to complete; when it is done, there should be an executable called nwchem under $NWCHEM_TOP/bin/$NWCHEM_TARGET.
Running NWChem
Once you have an executable, running NWChem is as simple as choosing a suitable input and running it with your chosen MPI application launcher:
The above (small) test problem runs in about 20 seconds in one node of the HPC-AI Advisory Council’s helios cluster. If your test goes well, the last line of your output should look as follows:
(Except for the values in the very last line, of course.)
The performance metric you will be trying to minimize is the wall time in the last line.
Tunables/Non-Tunables
There are a number of tunable parameters that affect the performance of NWChem. One of them is the MPI library used, and how Global Arrays uses MPI. Unlike many HPC codes, Global Arrays is based on a one-sided communication model - not message-passing - and thus is sensitive to implementation details, like RDMA and asynchronous progress. Another key issue is parallel execution configuration. NWChem often runs well with flat MPI parallelism, although with larger core counts and/or smaller memory capacities, a mixture of MPI and OpenMP can improve performance of some modules.
Both of the above do not require source code changes. Additional tuning opportunities exist if one modifies the source code of bottleneck kernels, especially to make better us of fine-grain parallelism, either on CPUs or GPUs.
All of these issues will be described in the following sections.
If you would like to understand more about how NWChem works, https://www.nersc.gov/assets/Uploads/Hammond-NERSC-OpenMP-August-2019-1.pdf may be useful.
MPI and Global Arrays (GA)
The short version of tuning MPI and GA in NWChem is:
Try
ARMCI_NETWORK=ARMCI-MPI
,ARMCI_NETWORK=MPI-PR
, andARMCI_NETWORK=MPI-TS
Which one performs better varies by use case. Note that you must launch the MPI-PR binary with more than 1 MPI process, because the runtime devotes one MPI process to communication. The other two implementations allow binaries to run with one MP process. Please remember this when computing parallel efficiency: N vs N-1 is significant for smaller N.Try at least two different MPI libraries, particularly with ARMCI-MPI. The performance of ARMCI-MPI with Open-MPI, Intel MPI and MVAPICH2 will often be noticeably different.
There are more ways to tune NWChem than these, but these are the straightforward ones for you to consider first.
Details
The following contains the full details of what goes on inside of Global Arrays. You do not need to study all of it. The important parts to understand are:
There are two implementations of the ARMCI API: the ARMCI/ComEx library distributed with Global Arrays, and the ARMCI-MPI library distributed separately.
Global Arrays can use Send-Recv or RMA (one-sided communication) internally to implement it’s one-sided operations. While mapping one-sided to one-sided is more natural, some MPI libraries implement Send-Recv much better than RMA, in which case, the less natural mapping of one-sided communication to message-passing is more effective.
The performance of different MPI libraries can vary significantly for the communication patterns in NWChem. The choice of MPI library is an important tunable parameter for NWChem.
Tuning Parallel Execution
The presence of OpenMP in a few modules of NWChem presents an opportunity for tuning. Both of the CCSD(T) modules (TCE and semidirect) contain OpenMP in the bottlenecks.
Assuming you have 64 cores, you can run 64 MPI processes and 1 OpenMP thread (i.e. 64x1) all the way to 1x64 (1 MPI process and 64 OpenMP threads). You will find that NWChem does not scale to more than 8 threads per process, because of Amdahl’s Law, as well as the NUMA properties of some modern servers. However, you may find that 16x4 is better than 64x1, for example. This is particularly true when file I/O is happening, because Linux serializes this. The CCSD(T) semidirect module does nontrivial file I/O in some scenarios, so it benefits from OpenMP threading even if the compute efficiency is imperfect.
Below is a simple example of a script that could be useful for performing scaling MPI x OpenMP scaling studies.
NWChem input file tuning
The CCSD(T) semidirect and TCE modules have at least one important tuning parameter.
DFT module
The DFT module contains no OpenMP and has limited tuning options, most of which related to quantum chemistry algorithms that can be ignored for the SCC. The most important tuning options related to ARMCI and MPI library choice, as described above.
TCE CCSD(T) module
See Tensor Contraction Engine Module: CI, MBPT, and CC - NWChem for details. The most important tuning parameters are:
2eorb
always use this2emet
set to 13 and see how that workstilesize
the default for CCSD(T) should be 20. If you set it to a much larger value, the job will crash. Smaller values are usually less efficient. You can also try Tensor Contraction Engine Module: CI, MBPT, and CC - NWChem instead and see if that helps performance.
Semidirect CCSD(T) module
The following settings may be useful:
Advanced users may attempt to improve the OpenMP code in CCSD for additional performance.
Generating input files
Please use GitHub - NWChem/input-generator: A script to generate NWChem input files for use in performance and scalability experiments to generate NWChem input files for the competition. For example, you can generate the input files for 7 water molecules with the two CCSD(T) modules and 21 water molecules with the DFT module, together with the cc-pVTZ basis set, like this:
It is always a good idea to set permanent_dir
and scratch_dir
appropriately for your system. The former path needs to be a shared folder that all nodes can see. The latter can be a local private scratch, such as /tmp
. Depending on your input, you may generate large files in both, so it is not a good idea to set them to slow and/or very limited filesystems.
For CCSD(T), a smaller molecule can be used for testing, because this method is more expensive. The w5
configuration runs in approximately 10 minutes on a modern server node. For debugging, use w1
or w2
.
For the SCC, a good DFT benchmark problem for 64 to 128 cores is w21. The input file for this is generated as shown above. This should run in approximately 10 minutes on 40 cores of a modern x86 server, and should scale to 4 nodes, albeit imperfectly. For debugging, teams can use a much smaller input, such as w5
, while for performance experiments, w12
is a decent proxy for w21
that runs in less time.
Tasks and Submissions
Run the application with the given input file and submit the results, assuming a 4-node cluster. Run on both Niagara and Bridges-2 using 4 CPU-only nodes.
a. Semidirect CCSD(T)
The input file that follows (generated with ./make_nwchem_input.py w7 rccsd-t cc-pvtz energy
) is the one that must be used. The only lines that you can change are noted as such.
The correct result is determined by comparison with the following. The last decimal might vary.
This job should run in less than 5000 seconds of wall time on 4 nodes.
You should practice on the w5
input first, since it runs faster and allows easier experiments. The following are the reference energies for this configuration.
b. TCE CCSD(T)
The input file that follows (generated with ./make_nwchem_input.py w7 ccsd-t cc-pvtz energy
) is the one that must be used. The only lines that you can change are noted as such.
This input should produce the same answers as 1a but uses a completely different implementation, with different algorithms.
c. Density functional theory
The input file that follows (generated with ./make_nwchem_input.py w21 b3lyp cc-pvtz energy
) is the one that must be used. The only lines that you can change are noted as such.
The correct result is determined by comparison with the following. The last decimal might vary.
This job should run in less than 600 seconds of wall time on 40 cores. There is no GPU support for DFT so do not try to measure anything related to GPUs here.
2. Obtain an IPM profile for the application and submit the results as a PDF file (don’t submit the raw ipm_parse -html data.) What are the top three MPI calls used?
3. Visualize the results, create a figure or short video, and submit the results. In case the team has a Twitter account publish the figure or video with the hashtags: #ISC22, #ISC22_SCC, #NWChem (mark/tag the figure or video with your team name/university).
4. Run a 4-GPU job on the Bridge-2 cluster using V100 nodes, and submit the results (either Semidirect CCSD(T) or TCE CCSD(T)), you can submit both inputs, but only one is required (either one).