Getting Started with the Coding Challenge for ISC25 SCC
Overview
The ICON (ICOsahedral Nonhydrostatic) earth system model is a unified next-generation global numerical weather prediction and climate modelling system.To prepare the model for the exascale era of supercomputing systems, ICON is currently undergoing a major refactoring. Given the heterogeneous hardware, performance portability is crucial. For this purpose, the code base is converted from a monolithic code into a modularized, scalable and flexible code.
In the new ICON software design, the model consists of several encapsulated modules and each module can be independently ported to new architectures, using thereby different parallel programming paradigms. The standard C++ includes parallel execution policies in the language starting with C++17, but more features were added with C++20 and C++23, while a complete parallel definition will be achieved with C++26. Many CXX compilers started to implement the support needed for these features, with nvidia nvc++ being one of the most advanced on this matter. More documentation about the std::par support in nvc++ can be found online.
Looking at the current TOP 500 list one can see that the 3 Exascale Systems have an energy consumption between 24,7 MW and 38,7 MW. High resolution simulations can only be done by using large parts of the system, if not the whole system. This makes energy efficiency a crucial parameter when running high resolution climate simulations at Exascale.
Note: The page may be changed until the competition starts, maybe sure to follow up until the opening ceremony.
The Task
Your task is to parallelize and optimize the micro-physics (μphys) standalone module extracted from the ICON model for heterogeneous CPU+GPU platforms. Starting from a C++ serial implementation
Implement an MPI-parallel version that is using ISO C++ standard via std::par compiler assisted offloading to GPU
wall clock runtime is faster than benchmarks (provided at the beginning)
results from different profiles are visualized and explained
efficient run configurations (e.g. for V100, use N blocks x M threads x P registers/thread)) GPU offloading through std::par
Optimize the code in order to achieve best energy efficiency (fastest execution times at the lowest energy consumption). Run the parallel version on up to 4 nodes (with 4 GPUs each) and with different input files, which differ in the grid resolution i.e. mesh size.
Find a threshold, in terms of number of cells/GPU that gives the most energy efficient results. For this assessment several input files with different mesh sizes are given.
Implementation | Platforms | Compilers |
---|---|---|
MPI & C++ std::par | x86_64 CPU, NVIDIA GPU | nvc++ (nvhpc@24.7) |
Optimisations could include, but are not restricted to:
different cache configurations
different compilation flags
different communication patterns CPU-GPU
GPU to GPU communication via MPI
configurable workload distribution per thread/block
etc
Cluster Access
Cluster Access will be given after March 15th.
The team captain of each team will need to register for an account at DKRZ here :
In case your e-mail is not accepted, you should write an email to support@dkrz.de. We will then activate your e-mail address so that you can register.
After your account has been set up you can login and request membership to project 1273 (ICON at student cluster competition isc25) at the following link
In the field Project* enter 1273 and the message should contain the name of the captain and the team (see below).
Being a member of the project 1273 you can access the source code on gitlab here
<The link will be given 1-2 weeks before the competition starts>
Levante Hardware for the coding challenge
1 GPU node for development
up to 4 GPU nodes for testing allocated through SLURM jobs of max 30 min
Prerequisites
Levante nodes have all the dependencies available to the users. The README file from the code repository contains detailed information about dependencies; there is already a script (scripts/levante-setup.sh
) which configures the environment for code development.
In case the teams prefer to have the development on their laptops, the following tools/libs are needed:
nvidia C++ compiler & libc:
from sources NVIDIA HPC SDK 24.7 Downloads
use the installed software stack on Levante:
module load nvhpc/24.7-gcc-11.2.0 export LD_LIBRARY_PATH=/sw/spack-levante/gcc-13.3.0-s2dxrt/lib64/:$LD_LIBRARY_PATH
NETCDF - This can be installed in several ways:
from sources https://downloads.unidata.ucar.edu/netcdf-c/4.9.2/netcdf-c-4.9.2.tar.gz
[for MacOS] : using https://formulae.brew.sh/formula/netcdf-cxx
using spack:
spack install netcdf-cxx4
on Levante make use of the pre-installed NETCDF lib, which can be loaded with spack
spack load netcdf-cxx4@4.3.1
CDO:
from sources https://code.mpimet.mpg.de/attachments/28882
[for MacOS] using https://code.mpimet.mpg.de/projects/cdo/wiki/MacOS_Platform
using spack:
spack install cdo
on Levante make use of the pre-installed cdo lib:
module load cdo
Tasks
Each team has a programming task and an optimization task.
Programming task (50%)
the code compiles on all supported platforms
all existing unit-tests are passed
the results are numerically CORRECT (compared to a benchmark)
the input files are in
tasks/
folder;in the folder
reference_result/
we provide reference results for different input files and platforms: cpu/gpu. Use thecdo infon -sub
command to compare your results, which are written by the μphys application to the fileoutput.nc
, with the reference results as followscdo infon -sub output.nc reference_results/seq_dbg_double.nc
Your results must be within the intervals defined for each variable shown in the figure below
Optimisation task (50%)
best energy efficiency counts i.e. wall clock runtime is faster than benchmarks (provided at the beginning) at the lowest possible energy consumption.
the energy consumption measurements are carried out using the included run script. This script starts the job and also the tool nvidia-smi. At the end of the job one file per node, named
nvsmi.log.jobid.nodenumber
is generated, which contains run time and energy measurementsusing the python script
process_and_plot_nvsmi_02.py
, which takes as input the nvsmi.log.* files, 3 graphs are generatedspeedup (Y-axis) over number of nodes (X-axis)
power consumption shown as a plot of Wh (Y-axis) over nodes (X-axis)
the relative-energy/speedup (Y-axis) over nodes (X-axis)
results from different profiles are visualized and explained
Submissions
For evaluation, each team needs to submit in their associated working directory on Levante :
Add your implementation to
<repository>/implementation/std
folder. Provide link to your repository on Levante (make sure you give us read and execute rights over the folder structure)Scripts to build CPU/GPU executable for correctness/performance (e.g. O0 vs O3)
For checking the correctness of the results, your implementation compiled with nvc++ 24.7 (and no optimisation flags) should produce bit-identical (or close to bit-identical) results to
<repository>/implementations/seq/graupel.cpp
(which was provided at the beginning) in both single and double precision, when run a CPU node. For this, you should provide the scripts which produces these two builds, similar to this.For performance, you can do ANY source code changes and use ANY compile flags while making sure that the results are within the tolerance intervals defined above with respect to the CPU results. You need to provide the associated scripts, but with your own setup (e.g. compiler & flags). These should run on a GPU node of Levante, because we will only time the runs on the GPU. Use the files from the
scripts/
folder to guide you.
Slurm logs to confirm results