/
Getting Started with the Coding Challenge for ISC25 SCC

Getting Started with the Coding Challenge for ISC25 SCC

 

Overview

The ICON (ICOsahedral Nonhydrostatic) earth system model is a unified next-generation global numerical weather prediction and climate modelling system.To prepare the model for the exascale era of supercomputing systems, ICON is currently undergoing a major refactoring. Given the heterogeneous hardware, performance portability is crucial. For this purpose, the code base is converted from a monolithic code into a modularized, scalable and flexible code.

Structure of the new ICON-C model

In the new ICON software design, the model consists of several encapsulated modules and each module can be independently ported to new architectures, using thereby different parallel programming paradigms. The standard C++ includes parallel execution policies in the language starting with C++17, but more features were added with C++20 and C++23, while a complete parallel definition will be achieved with C++26. Many CXX compilers started to implement the support needed for these features, with nvidia nvc++ being one of the most advanced on this matter. More documentation about the std::par support in nvc++ can be found online.

Looking at the current TOP 500 list one can see that the 3 Exascale Systems have an energy consumption between 24,7 MW and 38,7 MW. High resolution simulations can only be done by using large parts of the system, if not the whole system. This makes energy efficiency a crucial parameter when running high resolution climate simulations at Exascale.

 

Note: The page may be changed until the competition starts, maybe sure to follow up until the opening ceremony.

 

The Task

Your task is to parallelize and optimize the micro-physics (μphys) standalone module extracted from the ICON model for heterogeneous CPU+GPU platforms. Starting from a C++ serial implementation

  1. Implement an MPI-parallel version that is using ISO C++ standard via std::par compiler assisted offloading to GPU

    1. wall clock runtime is faster than benchmarks (provided at the beginning)

    2. results from different profiles are visualized and explained

    3. efficient run configurations (e.g. for V100, use N blocks x M threads x P registers/thread)) GPU offloading through std::par

  2. Optimize the code in order to achieve best energy efficiency (fastest execution times at the lowest energy consumption). Run the parallel version on up to 4 nodes (with 4 GPUs each) and with different input files, which differ in the grid resolution i.e. mesh size. 

  3. Find a threshold, in terms of number of cells/GPU that gives the most energy efficient results. For this assessment several input files with different mesh sizes are given.


Implementation

Platforms

Compilers

Implementation

Platforms

Compilers

MPI & C++ std::par

x86_64 CPU, NVIDIA GPU

nvc++ (nvhpc@24.7)

Optimisations could include, but are not restricted to:

  • different cache configurations

  • different compilation flags

  • different communication patterns CPU-GPU

  • GPU to GPU communication via MPI

  • configurable workload distribution per thread/block

  • etc

Cluster Access

 

Cluster Access will be given after March 15th.

 

  • The team captain of each team will need to register for an account at DKRZ here :

  • In case your e-mail is not accepted, you should write an email to support@dkrz.de. We will then activate your e-mail address so that you can register.

  • After your account has been set up you can login and request membership to project 1273 (ICON at student cluster competition isc25) at the following link

In the field Project* enter 1273 and the message should contain the name of the captain and the team (see below).

Screenshot 2024-02-06 101041.png

  • Being a member of the project 1273 you can access the source code on gitlab here

    • <The link will be given 1-2 weeks before the competition starts>

  • Levante Hardware for the coding challenge

    • 1 GPU node for development

    • up to 4 GPU nodes for testing allocated through SLURM jobs of max 30 min

Prerequisites

Levante nodes have all the dependencies available to the users. The README file from the code repository contains detailed information about dependencies; there is already a script (scripts/levante-setup.sh) which configures the environment for code development.

In case the teams prefer to have the development on their laptops, the following tools/libs are needed:

  1. nvidia C++ compiler & libc:

    1. from sources NVIDIA HPC SDK 24.7 Downloads

    2. use the installed software stack on Levante:

      module load nvhpc/24.7-gcc-11.2.0 export LD_LIBRARY_PATH=/sw/spack-levante/gcc-13.3.0-s2dxrt/lib64/:$LD_LIBRARY_PATH
  2. NETCDF - This can be installed in several ways:

    1. from sources https://downloads.unidata.ucar.edu/netcdf-c/4.9.2/netcdf-c-4.9.2.tar.gz

    2. [for MacOS] : using https://formulae.brew.sh/formula/netcdf-cxx

    3. using spack:

      spack install netcdf-cxx4
    4. on Levante make use of the pre-installed NETCDF lib, which can be loaded with spack

      spack load netcdf-cxx4@4.3.1
  3. CDO:

    1. from sources https://code.mpimet.mpg.de/attachments/28882

    2. [for MacOS] using https://code.mpimet.mpg.de/projects/cdo/wiki/MacOS_Platform

    3. using spack:

      spack install cdo
    4. on Levante make use of the pre-installed cdo lib:

      module load cdo

Tasks

Each team has a programming task and an optimization task.

  1. Programming task (50%)

    1. the code compiles on all supported platforms

    2. all existing unit-tests are passed

    3. the results are numerically CORRECT (compared to a benchmark)

      1. the input files are in tasks/ folder;

      2. in the folder reference_result/ we provide reference results for different input files and platforms: cpu/gpu. Use the cdo infon -sub command to compare your results, which are written by the μphys application to the file output.nc, with the reference results as follows

      3. cdo infon -sub output.nc reference_results/seq_dbg_double.nc

      4. Your results must be within the intervals defined for each variable shown in the figure below

Screenshot 2024-12-04 162600.png

  1. Optimisation task (50%)

    1. best energy efficiency counts i.e. wall clock runtime is faster than benchmarks (provided at the beginning) at the lowest possible energy consumption.

    2. the energy consumption measurements are carried out using the included run script. This script starts the job and also the tool nvidia-smi. At the end of the job one file per node, named nvsmi.log.jobid.nodenumber is generated, which contains run time and energy measurements

    3. using the python script process_and_plot_nvsmi_02.py , which takes as input the nvsmi.log.* files, 3 graphs are generated

      1. speedup (Y-axis) over number of nodes (X-axis)

      2. power consumption shown as a plot of Wh (Y-axis) over nodes (X-axis)

      3. the relative-energy/speedup (Y-axis) over nodes (X-axis)

    4. results from different profiles are visualized and explained

Submissions

For evaluation, each team needs to submit in their associated working directory on Levante :

  1. Add your implementation to <repository>/implementation/std folder. Provide link to your repository on Levante (make sure you give us read and execute rights over the folder structure)

  2. Scripts to build CPU/GPU executable for correctness/performance (e.g. O0 vs O3)

    1. For checking the correctness of the results, your implementation compiled with nvc++ 24.7 (and no optimisation flags) should produce bit-identical (or close to bit-identical) results to <repository>/implementations/seq/graupel.cpp (which was provided at the beginning) in both single and double precision, when run a CPU node. For this, you should provide the scripts which produces these two builds, similar to this.

    2. For performance, you can do ANY source code changes and use ANY compile flags while making sure that the results are within the tolerance intervals defined above with respect to the CPU results. You need to provide the associated scripts, but with your own setup (e.g. compiler & flags). These should run on a GPU node of Levante, because we will only time the runs on the GPU. Use the files from the scripts/ folder to guide you.

  3. Slurm logs to confirm results

 


 

 

 

Related content