Getting Started with NWChem for ISC22 SCC - Onsite

This page is an overview, configuration and tasks for ISC22 Student Cluster Competition teams.

 

References

 

Overview

NWChem is a widely used open-source computational chemistry software package, written primarily in Fortran. It supports scalable parallel implementations of atomic orbital and plane-wave density-function theory, many-body methods from Moller-Plesset perturbation theory to coupled-cluster with quadruple excitations, and a number of methods for computing multiscale methods and molecular and macroscropic properties. It is used on computers from Apple M1 laptops to the largest supercomputers, supporting multicore CPUs and GPUs with OpenMP and other programming models. NWChem uses MPI for parallelism, usually hidden by the Global Arrays programming model, which uses one-sided communication to support a data-centric abstraction of multidimensional arrays across shared and distributed memory protocols. NWChem was created by Pacific Northwest National Laboratory in the 1990s, and has been under continuous development by a team based on national laboratories, universities, and industry for 25 years. NWChem has been cited thousands of times and was a finalist for the Gordon Bell Prize in 2009.

 

Introduction to NWChem

 

 

The slides:

Downloading and Compiling NWChem

NWChem is available for download from GitHub:

git clone https://github.com/nwchemgit/nwchem cd nwchem/src/tools && ./get-tools-github

Once you have downloaded NWChem, the environment needs to be set up before building a working executable for your cluster. In the following example, we show how to compile it using Intel compilers, the Intel Math Kernel Library (MKL) and Open MPI from HPC-X:

module load intel/2021.3 module load mkl/2021.3.0 module load hpcx/2.9.0 export NWCHEM_TOP=/path-to-directory-where-you-ran-git/nwchem export NWCHEM_TARGET=LINUX64 export ARMCI_NETWORK=ARMCI-MPI export USE_MPI=y export NWCHEM_MODULES=qm export BLASOPT="-mkl" export BLAS_SIZE=8 export LAPACK_LIB="-mkl" export USE_SCALAPACK=y export SCALAPACK="-mkl -lmkl_scalapack_ilp64 -lmkl_blacs_openmpi_ilp64" export SCALAPACK_SIZE=8 export FC=ifort

Full details can be found in https://nwchemgit.github.io/Compiling-NWChem.html

Using ARMCI-MPI requires the user to build this prior to compiling NWChem. This requires you to have the Linux packages autoconf, automake, m4 and libtool installed. If they aren’t present and you can’t install them via a package manager, ARMCI-MPI provides a link to instructions on how to do it manually.

cd $NWCHEM_TOP/src/tools ./install-armci-mpi

This step should succeed, but if it does not, it means that mpicc and $MPICC are not available in your environment. Note also that you need to make sure that LIBMPI matches the output of mpifort -show for NWChem to link correctly.

After the environment has been set up, NWChem is built in two steps: building the configuration file nwchem_config.h, and making the executable.

The last make command takes a while to complete; when it is done, there should be an executable called nwchem under $NWCHEM_TOP/bin/$NWCHEM_TARGET.

Running NWChem

Once you have an executable, running NWChem is as simple as choosing a suitable input and running it with your chosen MPI application launcher:

The above (small) test problem runs in about 20 seconds in one node of the HPC-AI Advisory Council’s helios cluster. If your test goes well, the last line of your output should look as follows:

(Except for the values in the very last line, of course.)

The performance metric you will be trying to minimize is the wall time in the last line.

Tunables/Non-Tunables

There are a number of tunable parameters that affect the performance of NWChem. One of them is the MPI library used, and how Global Arrays uses MPI. Unlike many HPC codes, Global Arrays is based on a one-sided communication model - not message-passing - and thus is sensitive to implementation details, like RDMA and asynchronous progress. Another key issue is parallel execution configuration. NWChem often runs well with flat MPI parallelism, although with larger core counts and/or smaller memory capacities, a mixture of MPI and OpenMP can improve performance of some modules.

Both of the above do not require source code changes. Additional tuning opportunities exist if one modifies the source code of bottleneck kernels, especially to make better us of fine-grain parallelism, either on CPUs or GPUs.

All of these issues will be described in the following sections.

If you would like to understand more about how NWChem works, https://www.nersc.gov/assets/Uploads/Hammond-NERSC-OpenMP-August-2019-1.pdf may be useful.

MPI and Global Arrays (GA)

The short version of tuning MPI and GA in NWChem is:

  • Try ARMCI_NETWORK=ARMCI-MPI, ARMCI_NETWORK=MPI-PR, and ARMCI_NETWORK=MPI-TS Which one performs better varies by use case. Note that you must launch the MPI-PR binary with more than 1 MPI process, because the runtime devotes one MPI process to communication. The other two implementations allow binaries to run with one MP process. Please remember this when computing parallel efficiency: N vs N-1 is significant for smaller N.

  • Try at least two different MPI libraries, particularly with ARMCI-MPI. The performance of ARMCI-MPI with Open-MPI, Intel MPI and MVAPICH2 will often be noticeably different.

There are more ways to tune NWChem than these, but these are the straightforward ones for you to consider first.

Details

The following contains the full details of what goes on inside of Global Arrays. You do not need to study all of it. The important parts to understand are:

  • There are two implementations of the ARMCI API: the ARMCI/ComEx library distributed with Global Arrays, and the ARMCI-MPI library distributed separately.

  • Global Arrays can use Send-Recv or RMA (one-sided communication) internally to implement it’s one-sided operations. While mapping one-sided to one-sided is more natural, some MPI libraries implement Send-Recv much better than RMA, in which case, the less natural mapping of one-sided communication to message-passing is more effective.

  • The performance of different MPI libraries can vary significantly for the communication patterns in NWChem. The choice of MPI library is an important tunable parameter for NWChem.

 

Tuning Parallel Execution

The presence of OpenMP in a few modules of NWChem presents an opportunity for tuning. Both of the CCSD(T) modules (TCE and semidirect) contain OpenMP in the bottlenecks.

Assuming you have 64 cores, you can run 64 MPI processes and 1 OpenMP thread (i.e. 64x1) all the way to 1x64 (1 MPI process and 64 OpenMP threads). You will find that NWChem does not scale to more than 8 threads per process, because of Amdahl’s Law, as well as the NUMA properties of some modern servers. However, you may find that 16x4 is better than 64x1, for example. This is particularly true when file I/O is happening, because Linux serializes this. The CCSD(T) semidirect module does nontrivial file I/O in some scenarios, so it benefits from OpenMP threading even if the compute efficiency is imperfect.

Below is a simple example of a script that could be useful for performing scaling MPI x OpenMP scaling studies.

 

NWChem input file tuning

Semidirect CCSD(T) module

The following settings may be useful:

Advanced users may attempt to improve the OpenMP code in CCSD for additional performance.

 

Generating input files

Please use to generate NWChem input files for the competition. You can generate the input files for 7 water molecules with the CCSD(T) module together with the cc-pVTZ basis set, like this:

It is always a good idea to set permanent_dir and scratch_dir appropriately for your system. The former path needs to be a shared folder that all nodes can see. The latter can be a local private scratch, such as /tmp. Depending on your input, you may generate large files in both, so it is not a good idea to set them to slow and/or very limited filesystems.

For CCSD(T), a smaller molecule can be used for testing, because this method is more expensive. The w5 configuration runs in approximately 10 minutes on a modern server node. For debugging, use w1 or w2.

Tasks and Submissions

Run the application with the below input and submit all logs, build & run scripts from your best run.

Semidirect CCSD(T)

The input file that follows (generated with ./make_nwchem_input.py w7 rccsd-t cc-pvtz energy) is the one that must be used. The only lines that you can change are noted as such.

The correct result is determined by comparison with the following. The last decimal might vary.

This job should run in less than 5000 seconds of wall time on 4 nodes.

You should practice on the w5 input first, since it runs faster and allows easier experiments. The following are the reference energies for this configuration.