WRF with Single Domain - Practice case for ISC21 SCC

Overview

The Weather Research and Forecasting (WRF, pronounced “worf”) model is a primarily-Fortran (yes, we know, it is surprising that we are still alive), highly parallel, CPU-based package for numerical weather prediction (NWP). This is a command line driven, finite-difference NWP model that utilizes distributed memory parallelism with the MPI API. The WRF model runs on Raspberry Pi4, laptops, within docker containers, and has run on nearly half a million processors for an enormous job. The purpose for selecting this software package for the competition is to allow you to run a very popular program as a big parallel job on multiple CPU nodes. Arguably, four nodes is not massive, but we are trying to emulate the restriction of the maximum power consumption requirements from previous competitions. There are two objective metrics that will be used for this portion of the competition. The first is the time to solution (how fast were you able to get the WRF model to complete). The second required piece is a validation test that you conduct with the model output (we need to know “are the results correct?”).

The timing aspect is simple: the goal is to complete the computations as quickly as possible. We add up the list of reported times from a standard output file. And secondly, the model results have to be reasonable to be eligible for consideration. This “reasonable results” brings up the very complicated idea of validation of forecast data.

 

One need only think of the weather, in which case the prediction even for a few days ahead is impossible.

– Albert Einstein

 

Ever since Edward Lorenz (1963), there has been widespread appreciation that a chaotic system (such as the output from the WRF model, and in fact the output from any weather or climate model) is sensitive to small perturbations in the initial state, such as a rounding difference due to an optimization flag (you have probably heard of the “butterfly effect”). A statistical analysis of the first few time steps of model output is sufficient to determine the validity of the solution (any farther than a few time steps, and the validation is just not possible). Note that the amount of time that is required to generate and output the validation data is not part of the timing evaluation.

For the single domain practice case, there will be two separate model simulations that are conducted: the first will be to generate validating data, and the second will be used for the timing run. (Note: for the actual competition, a single run of the WRF model will generate the validating data and the information used to determine the elapsed time.) A script that is included in the test kit will use the generated text output and gridded binary files to provide you the information to return to us. Of course, during this practice phase, you do not need to return anything.

During the actual competition, the run-time configuration of the WRF model will be modified to allow a completely different simulation (different day, different location, different run-time options). However, the EXACT SAME source code and executable can be used to run the competition case, so any work in optimizing the WRF model that you have done can be directly used during the actual competition. While the validating data and the internal workings of the scripts will be different, the user interfaces will be similar. To repeat, the WRF model source code that you use in this practice will be identical to the code that you will use during the actual competition (yes, even the executables will be the same). The steps that you take to determine the timing and simulation validity will be nearly the same.

Requirements

To build and run WRF you will need the WRF model source code, compilers (to convert the source code into object code), and external libraries (WRF requires parallel communication and I/O libraries).

The WRF model is maintained in an open-development GitHub repository. You will pull the source code from that repository.

  • WRF 4.2.1 from GitHub. A specific branch for ISC21 SCC has been created.

The WRF model source code requires compilers to build executables. Both a C and a Fortran compiler are required. It is not surprising that a compiler with a license ($$) tends to produce faster executables than a free compiler.

  • Fortran and C compilers. The Fortran compiler is used for most of the computational work, and the C code is used for most of the MPI communication. The Fortran compiler would require support for 2008 features. The WRF model may be optionally built with hybrid OpenMP+MPI support for possible performance benefits. The code is CPU-based, no GPU performance optimization is sought.

Only two external libraries are required by the WRF model. The NetCDF (Network Common Data Format) library is used extensively in both the weather and climate communities when dealing with gridded model output or with observation data. The NetCDF library sits on top of the HDF5 (Heirarchical Data Format, version 5), which we used to compress the binary output. The WRF model, as with all other weather and climate models, is designed to run with multiple processors, using the MPI (Message Passing Interface) API.

  • HDF5. This is required in order to have NetCDF4 compression support in the NetCDF library. We used version 1.10.5.

  • NetCDF. We used version 4.6.3.

  • MPI. The WRF model uses very simple MPI features. MPICH, MVAPICH, and Open MPI have all been used without any trouble.

Finally, after the model simulations complete, a script is used to compute the validation output. This script includes shell, python, and Fortran.

 

All of these requirements (except for accessing the source code) make the WRF model fairly tricky to use because it is difficult to get the build exactly correct. On most modern supercomputers, the system administrators have helped everyone out by supplying access to a great tool: software modules. Issuing a few “module” commands is sufficient to get the WRF model built and running without much trouble. On the niagara machine, we will absolutely use the “module” commands for assistance.

Retrieving the model source code

Another aspect of supercomputers is how they handle disk I/O. It is not unusual for only certain disk partitions to be “writable” from your executable. For example on niagara, the home directory is readable, but not writable from a queued job; however the scratch disk is always writable. Additionally, more than 50 GB of output will be generated when running WRF, also something that supercomputer centers prefer to have on a scratch space and NOT on the home disk. With this knowledge, in your preferred installation directory, run the following three commands to retrieve the source code branch that has been designed for the competition:

git clone https://github.com/wrf-model/WRF WRF_ISC21 cd WRF_ISC21 git checkout tags/ISC21 -b ISC21-branch

Environment setup

The niagara supercomputer is set up to use modules, which is going to be a huge help. If your system is set up to use use modules, typically the required options could be the compiler and the two mandatory libraries: NetCDF and MPI. During the build process, the WRF model specifically wants an environment variable named NETCDF. This environment variable is the single directory that contains the include, lib, and bin directories for NetCDF (typically, both the separate C and Fortran builds are installed under a single location, and niagara has them nicely packaged together). The netcdf4 capability (HDF5 compression) is required to read in the data. On the SciNet niagara machine, type the following:

module purge module load NiaEnv/2019b module load intel/2019u4 openmpi/4.0.1 hdf5/1.10.5 module load netcdf/4.6.3 module load intelpython3/2019u5 module load udunits/2.2.26 module load ncview/2.1.7 module save

These commands have to only be entered once. In every other window, or when logging in again, just enter:

module restore

The “module load” command for intelpython3 gives a consistent NetCDF for Fortran and Python (bonus!). Make sure to include the intelpython module in your slurm scripts if you intend to include any python processing from a queued job. The “module load” commands for udunits and ncview allow you to actually look at the generated data (otherwise, you are just blindly generating files filled with ones and zeroes, and bringing on the heat death of the universe, right?).

Once we have the modules all set up how we want, we need to use those to fill in an environment variable. We want the location where all of the NetCDF system is stored. We cheat and ask where a known NetCDF executable (ncdump) is located, and then use that info to set our env variable.

Once the OS tells us where the executable ncdump is located, we plug most of that answer into an assignment for an env variable:

Note that this environment variable is required in every window in which you will build the WRF source code. If you log out, you’ll need to reset this environment variable again if you need to rebuild the WRF executable. Given the nature of the optimizations that you likely will undertake, there will be MANY rebuilds of the WRF executable.

To get the necessary python pieces for NetCDF, we suggest that you just use what we eventually figured out:

 

Note: if you don’t have permissions to the file, copy it from here.

 

As one of the contest coordinators said “… installing those Python packages from scratch is anything but trivial.” While we know that you could configure the various commands to get NetCDF and Intel and Fortran and HDF5 and MPI and Python libraries synchronized, we want you to spend time on fun things - like computational performance! These site-packages for python in no way result in any performance degradation, as they are ONLY used to supply the NetCDF I/O capability to an after-the-fact validation python script.

Basic Configuration

While in the top-level WRF_ISC21 directory where you cloned the source tree, run the following commands:

The first command makes sure that the configuration is going to have a clean directory structure, with no old object codes laying around. If you ever run the ./configure command, make sure to also run the ./clean command!

If you set up the NETCDF environment variable above correctly, you should see output similar to the following, where the last line prompts you for a choice:

Each line refers to a type of compiler and type of architecture, and each numbered column refers to a type of parallel build. Column one is a serial build, where only one processor is used – not much good for this competition. Column two is OpenMP (threaded, shared-memory), which allows a maximum of 40 processes – also, not much good for this competition. Columns three (MPI only) and four (MPI + OpenMP) are likely the best bets. Let’s assume we have recent Intel Xeon or AMD EPYC processors, and we are using Intel compilers. That means that a good starting reply to the prompt could be “67”. (We could choose “66”, but then we would not have the option of running hybrid MPI+OpenMP jobs.) Part of the optimization exercise will be to find an optimal mix of MPI processes and OpenMP threads (perhaps no OpenMP is better with your setup!).

The next prompt requires we choose which kind of nesting will be allowed. Since we will not be running with moving nested domains in either the practice (this single domain case) or actual competition case (the 3-domain case), the default (1=basic) will suffice. Just press enter to continue. You should see output similar to the following:

We really only care about the last few lines. The initial configuration step, what you just did, only takes a few seconds to complete, as it is only setting up a text file that will be included in a series of hierarchical Makefiles used by the WRF model’s various directories.

Tuning the configure.wrf file

After this basic configuration step, there will be a configure.wrf file that can be edited to change compilation options. The following is suggested as a way to consider further performance enhancements:

  • If you are using AMD EPYC processors, remove the -xHost option everywhere it appears, and change -xCORE-AVX2 to -march=core-avx2

  • If you are using Intel Xeon Skylake, Cascade Lake, or more recent Intel Xeon processors, you may choose to change -xCORE-AVX2 to -xCORE-AVX512, but that is not guaranteed to give you the best performance. You will want to try separate builds.

Every time that the ./configure command is issued, it will overwrite the existing configure.wrf file. If a configure.wrf file already existed, it will be renamed configure.wrf.backup. You may choose to save your hard-coded optimization efforts by renaming configure.wrf to something else before reruning ./configure. You should re-run the ./configure command every time that you choose to change the compiler or compiler options. You may simply edit the configure.wrf file to modify compiler settings. Remember, always run the “./clean -a” command prior to the new ./configure command.

When scientists are running WRF under normal conditions, WRF with MPI will produce as many rsl.out.XXXX and rsl.error.XXXX files as there are MPI ranks. For the competition we are only interested in the logs produced by MPI rank 0, i.e., in files rsl.out.0000 and rsl.error.0000. Thus, we ask that you apply the following edit to your configure.wrf file:

In practice, the output from the different MPI ranks is important for debugging. For the competition, the provided simulations are well behaved, so there is no need to monitor the numerical stability for each of your MPI processes.

Another small tweak is required to the configure.wrf file - we need to tell the Makefiles where cpp (the C preprocessor) is located on the niagara supercomputer.

Compile

Building the WRF executable is done with the following ./compile command. You will want to save both the standard out and standard err, as most compilers send parts of their messages to exclusively one or the other files. The time to construct the executable is reduced by allowing more processors to assist in the build process. The example shows allowing six processes to participate. It is not unexpected that a complete build takes about an hour with an Intel compiler. The build command is able to run interactively on niagara (from the command line).

The end of file build_wrf.log indicates that the WRF executable, wrf.exe, has been successfully generated and lists the time to generate the executables. For the competition, only the wrf.exe file is required from among the generated executables.

As shown above, when the build is successfully finished, there will be a wrf.exe executable under main. A link to that executable under the run directory is provided by the build script.

If a new compiler or compiler version is to be tested, or if a new set of compiler optimizations flag is required, the WRF executable must be rebuilt from source from the very beginning. In the top-level directory of the WRF model:

Consider keeping multiple directories, as the executables take quite a while to rebuild, and you may want to keep older versions around.

Running WRF

The input files for your practice runs can be downloaded from here. The set of data for the practice kit (this single domain case) is larger than the files for use during the competition (the 3-domain case). The practice data is a public benchmark that cannot easily be restructured. This is why we will have slightly different instructions for the practice WRF example and the actual competition case.

After you have downloaded the tarfile and extracted the files, you should have a WRF_practice_kit directory. You also need to download the VALIDATION directory and its contents from the same shared location, placing it under WRF_practice_kit. Following that the directory will contain two types of files: those that are required to actually run the simulation (input data), and those files that are required for the validation step.

Files required for validating the simulation results:

  • The Makefile is intended for building an anova executable from the anova.f90 source; anova is built and used by the validation script validate.csh. Currently, this is set up to use the Intel compiler (likely, also your choice for the WRF build).

  • The Python scripts f2p.py and wrf_bench_checker.py, as well as timing.csh, are also used by validate.csh. The python scripts assume python3. The required libraries for python are handled entirely by the modules that we set up earlier (see - these ARE convenient!).

  • The VALIDATION directory contains exemplar output files that are used for comparisons.

Files required for running the actual WRF simulation:

  • run_wrf_large.csh-004 is a sample PBS script showing how WRF is expected to be run; it runs WRF twice: once for timing and once for validation. A simple version to possibly start with on niagara is RUN.slurm. A script with great examples for how to use the number of OpenMP threads and the number of MPI processes in computations is given with RUN_wrf_practice_avx2_oob.slurm.

  • The data files required by WRF are namelist.input, wrfinput_d01 and wrfbdy_d01; the run script copies namelist.input-VALIDATE and namelist.input-TIMING in turn to namelist.input before running WRF. The three files with the .dat suffix are also read by WRF; they are not strictly necessary because they would be computed if not present, but we provide them to avoid the extra computation.

  • WRF also requires a set of fixed, auxiliary data files that are found in the run directory of the main source directory, WRF_ISC21. You may copy those files to WRF_practice_kit or make symbolic links to them. Let's assume WRF_ISC21 at the same level as the directory WRF_practice_kit, and that you are currently “in” the WRF_practice_kit directory. Then, the following will create the symbolic links sufficient for you to run the WRF simulation, including a symbolic link to the WRF executable:

After this, you may adapt one of the provided sample run scripts (run_wrf_large.csh-004, RUN.slurm, RUN_wrf_practice_avx2_oob.slurm) for your particular workflow on niagara, and just run it (or more likely, submit it, such as on niagara).

You can check on the job status on niagara with:

Note that the execution of timing.csh or validation.csh is commented out in some of the run scripts; these quick timing and validation scripts may be run separately. Each of these scripts require some directory information as a command line argument.

For the timing script, the single argument to the script is the location of the text-based rsl.out.000 output file specifically, which we are using for the timing test.

For the validation script, there are two directories that are required as command line arguments: where is the exemplar data that we provide (VALIDATION), and where is the data that you generated (VALIDATE). We apologize for the nearly identical names - we did better with the 3-domain case.

The example output and interpretation of the output is shown a little lower below in the submission section.

The sample batch scripts have two separate executions of the WRF model. Note that in the actual competition for the 3-domain case, only a single WRF run is required (not two, as in this practice case), so that the timing and validation are both done within that single execution of the WRF model.

 

Just for fun, to “look” at the model data (either what you created or what is in the exemplar directory), issue the following command:

Under the “3d vars”, select “QVAPOR”. In the “Opts” button, select “USA states” and click “OK” at the bottom of that new window. This shows really cold dry air in south central Canada that is sagging into the upper midwest of the US, and warm moist air in the Gulf of Mexico. When the code is building for nearly an hour, clicking through these plots is an easy way to pass the time.

 

Suggestions for Optimization

Part of the optimization is determining the number and aspect ratio of the MPI processes, and determining the optimal mix of MPI and OpenMP processes. In the namelist.input file, in the &domains namelist record, the default values for these settings are given here:

The numtiles_x and numtiles_y control the number of OpenMP threads used (and their orientation) within each MPI process. The nproc_x and nproc_y control the number of MPI ranks (and their orientation).

The horizontally decomposed dimensions may not be < 10 (in either the x- or y-direction) for MPI. The total number of MPI processes is nproc_x * nproc_y. So for a total of 40 processes, for example, you could have 20 MPI processes, and two OpenMP threads per MPI process. With 20 MPI processes, you could have 20 decompositions in the south-north direction and one decomposition in the west-east direction (or 10x2, 5x4, 4x5, 2x10, 1x20). For the OpenMP threads (controlled by numtiles_x and numtiles_y), you may similarly choose how the threads are laid out. For a total of the two threads that we are using for an example: numtiles_x=1 and numtiles_y=2 (or 2x1). There is a LARGE performance space to consider with just the choice of MPI and OpenMP options.

Submissions

There is no need to submit anything with this input. This case is only for practice.

For the validation step, you want to see three “thumbs up” signs. This is a statistical evaluation of three of the important physical quantities in the model: a combination of the temperature and pressure (referred to as potential temperature “theta”), the moisture (the mixing ratio “qv”), and a component of the horizontal wind (left to right, not west to east “u”). This is conducting an ANOVA test (here’s a quick read on ANalysis Of VAriance testing).

We want to ask a simple question that is hard to answer “is the exemplar data and your data the same”? The ANOVA method is a statistical technique that is widely utilized in industrial manufacturing to determine if populations are the same, but of course it is couched in that statistical jargon of “do we accept the null hypothesis”. There are a number of issues to consider, which are referred to as factors in ANOVA testing. We want to account for all possible factors, and then see which factors in a process are important when considering reproducibility and overall quality. In our usage, we a priori know that a number of factors would produce different results (when comparing the exemplar vs. your data).

For example, we know the following are important factors that cause a difference in the simulation:

  • the time of the forecast: We use several time periods in the validation, the initial time of the forecast is, of course, different than time period four.

  • the geophysical location of validation: We use 13 small boxes in the domain, as we expect the simulation to have different answers in different locations.

  • which vertical level: We use one at the surface, one a km above ground, and one level towards the top of the model’s lid. Again the weather will be different at the ground level vs. where jets fly).

  • the variable of interest: We separate wind, temperature, and moisture, as we would expect the temperature to behave differently than the wind.

  • exemplar vs. your data: The important remaining factor is your data vs. our data – and we do not want that factor to be a large source of difference. That (hopefully unconfounded) difference is what we specifically test.

There are many other factor that would impact the model results in meaningful ways, but we are not allowing you to exercise those options (such as different physical parameterization schemes, or different numerical schemes). However, it is likely that an optimized library for intrinsic functions would be entirely OK, as would using real*8 vs real*4 floats (that would be slower, though), enhanced optimization is OK, using a different compiler or chip are OK, etc. The ANOVA test, as long as we stay within the first few time steps is quite adept at identifying acceptable data and rejecting incorrect data.

The VALIDATION directory is the exemplar directory that is filled with data that we provide. The VALIDATE directory is populated by you with each WRF simulation (by both the run_wrf_large.csh-004 and RUN.slurm example scripts). Again, sorry for the too-similar naming convention!

The timing test is very quick. A short shell script uses the typical Linux “grep cat awk” set of commands to give us the time to solution for various components of the model:

HAPPY COMPUTING!

Once you are comfortable running through these steps to build and run the code, and can generate the output for timing and validation, you may move onto the 3-domain case. Good luck!