Getting Started with the Coding Challenge for ISC24 SCC

 

Overview

The ICON (ICOsahedral Nonhydrostatic) earth system model is a unified next-generation global numerical weather prediction and climate modelling system.To prepare the model for the exascale era of supercomputing systems, ICON is currently undergoing a major refactoring. Given the heterogeneous hardware, performance portability is crucial. For this purpose, the code base is converted from a monolithic code into a modularized, scalable and flexible code.

Structure of the new ICON-C model

In the new ICON Consolidated (ICON-C) software design, the model consists of several encapsulated modules and each module can be independently ported to new architectures, using thereby different parallel programming paradigms.

 

Note: The page may be changed until the competition stats, maybe sure to follow up until the opening ceremony.

 

Coding Challenge presentation to the teams:

 

Presentation file:

The Task

Your task is to parallelize and optimize the micro-physics (μphys) standalone module extracted from the ICON model. Starting from a C++ serial implementation, extend it using a directive-based programming API to target heterogeneous platforms and optimise it to achieve best execution times.

GPU enabled parallel versions of μphys

The implementation must use either OpenACC* or OpenMP**, and should target the following platforms:

Implementation

Platforms

Compilers

Implementation

Platforms

Compilers

OpenACC

x86_64 CPU, NVIDIA GPU

nvhpc++, GNU/g++

OpenMP

x86_64 CPU, NVIDIA GPU

GNU/g++, LLVM/clang++,

Optimisations could include, but are not restricted to:

  • new data formats (e.g. AOS/SOA)

  • different cache configurations

  • different compilation flags

  • different communication patterns CPU-GPU

  • GPU to GPU communication via MPI

  • configurable workload distribution per thread/block

  • etc

Cluster Access

  • The team captain of each team will need to register for an account at DKRZ here :

https://luv.dkrz.de/projects/newuser/

  • In case your e-mail is not accepted, you should write an email to support@dkrz.de. We will then activate your e-mail address so that you can register.

  • After your account has been set up you can login and request membership to project 1273 (ICON at student cluster competition isc24) at the following link

In the field Project* enter 1273 and the message should contain the name of the captain and the team (see below).

  • Being a member of the project 1273 you can access the source code on gitlab here

  • Levante Hardware for the coding challenge

    • 1 GPU node for development

    • 4 GPU nodes for testing allocated through SLURM jobs of max 30 min

Prerequisites

Levante nodes have all the dependencies available to the users. In case the teams prefer to have the development on their laptops, the following tools/libs are needed:

  1. C++ compiler & libc ; to use a gcc compiler which is able to do offloading you can either use your spack clone or use one of the following installed compilers:

    module load gcc/.12.3.0-gcc-11.2.0-nvptx module load gcc/.13.2.0-gcc-11.2.0-nvptx module load nvhpc/22.5-gcc-11.2.0
  2. NETCDF - This can be installed in several ways:

    1. from sources https://downloads.unidata.ucar.edu/netcdf-c/4.9.2/netcdf-c-4.9.2.tar.gz

    2. [for MacOS] : using https://formulae.brew.sh/formula/netcdf-cxx

    3. using spack:

      spack install netcdf-cxx4
    4. on Levante make use of the pre-installed NETCDF lib, which can be loaded with spack

      spack load netcdf-cxx4@4.3.1
  3. CDO:

    1. from sources https://code.mpimet.mpg.de/attachments/28882

    2. [for MacOS] using https://code.mpimet.mpg.de/projects/cdo/wiki/MacOS_Platform

    3. using spack:

    4. on Levante make use of the pre-installed cdo lib, which can be loaded with spack

Tasks

Each team has a programming task and an optimisation task, while extra points are awarded for an optional task (only if the first ones are at least partially addressed).

  1. Programming task (50%)

    1. the code compiles on all supported platforms

    2. all existing unit-tests are passed

    3. code readability following C++ code guidelines

    4. the results are numerically CORRECT (compared to a benchmark)

      1. the input files are in tasks/ folder; larger file for optimisation purposes can be found [here]

      2. in the file path-to-muphys-cpp/reference_results/sequential_double_output.nc results from the sequential run are provided. Use the cdo infon -sub command to compare your results, which are written by the μphys application to the file output.nc, with the reference results as follows

      3. for the comparsion between CPU and GPU runs we expect differences as seen below

        In the column Mean the average difference for each field (Parameter_name) should be less or equal to 10e-11.

      4. for CPU only runs we expect bit-identical results, which can be checked with cdo diffn command and the output of the cdo diffn command should look like this

cdo diffn: Processed 156800 values from 14 variables over 2 timesteps [0.05s 24MB]

  1. Optimisation task (50%)

    1. wall clock runtime is faster than benchmarks (provided at the begining)

    2. results from different profiles are visualized and explained

    3. efficient run configurations (e.g. for V100, use N blocks x M threads x P registers/thread)

  2. (optional) Maintainability task (for extra points, up to 20%)

    1. the code is covered by unit-tests

    2. the code is well document in doxygen style

    3. the final solution is more portable than expected

    4. experience report about working with OpenMP/OpenACC

 

 

Submissions

For evaluation, each team needs to submit in their associated working directory on Levante :

  1. Path to the implementation (e.g. impl/openX folder)

  2. Scripts to build CPU/GPU executable for correctness/performance (e.g. O0 vs O3)

    1. For checking the correctness of the results, your implementation compiled with GCC (and -O0 flag) should produce bit-identical (or close to bit-identical) results to scc_at_isc24/implementations/seq/graupel.cpp (which was provided at the beginning) in both single and double precision, when run a CPU node. For this, you should provide the scripts which produces these two builds, similar to this:

    2. For performance, you can do ANY source code changes, use ANY compiler and ANY compile flags while making sure that the results are within the threshold (1e-11) with respect to the CPU results. You need to provide the associated scripts, similar to the one above, but with your own setup (e.g. compiler & flags). These should run on a GPU node of Levante, because we will only time the runs on the GPU. Use the files from the scripts folder to guide you.

  3. Slurm logs to confirm results

  4. Summary list of optimizations performed

  5. Plots to confirm performance results

  6. (opt) Profiler analysis output & interpretation

  7. (opt) Experience report for using OpenMP/OpenACC

 


* OpenACC version: min 2.6

** OpenMP version: min 4.5, recommended: 5.0