/
Coding Challenge for ISC22 SCC

Coding Challenge for ISC22 SCC

For ISC22 Student Cluster Competition, we will have a coding challenge for the participating teams!

Cluster and Software to be used

The Mission

  • Alter the Xcompact3d code to support NVIDIA BlueField-2 DPU with non-blocking collective offload.

  • Try to gain best performance while overlapping communication and computation

Getting Started

  • Each team will get cluster access for HPC-AI Advisory council Thor cluster.

  • Once the team manage to get access to the login node, use slurm to allocate thor nodes, see Getting Started with HPC-AI AC Clusters for example.

  • Each team will get license file from X-ScaleSolutions to be able to use the MVAPICH-DPU software. you will also receive from X-ScaleSolutions the User Manual and examples how to use the MPI.

 

Coding Challenge Overview

 

 

Slides:

 

Tasks and Submissions

 

Input file: Can we downloaded here.

0. Task ”Getting familiar with the BlueField-2 DPU”

[practice task, no grading] – Run host based OSU_alltoall, and OSU_ialltoall over 8 nodes on Thor, using the BlueField-2 adapter. For this test use MVAPICH-DPU MPI to get a baseline.

  1. OSU_alltoall 8 PPN, 8 nodes

  2. OSU_alltoall 32 PPN, 8 nodes

  3. OSU_ialltoall 8 PPN, 8 nodes, no DPU offload

  4. OSU_ialltoall 8 PPN, 8 nodes, DPU offload

  5. OSU_ialltoall 32 PPN, 8 nodes, no DPU offload

  6. OSU_ialltoall 32 PPN, 8 nodes, DPU offload

Sample script: RUN-osu.slurm

#!/bin/bash -l #SBATCH -p thor #SBATCH --nodes=16 #SBATCH -J osu #SBATCH --time=15:00 #SBATCH --exclusive module load gcc/8.3.1 mvapich2-dpu/2021.08 srun -l hostname -s | awk '{print $2}' | grep -v bf | sort > hostfile srun -l hostname -s | awk '{print $2}' | grep bf | sort |uniq > dpufile NPROC=$(cat hostfile |wc -l) EXE=$MVAPICH2_DPU_DIR/libexec/osu-micro-benchmarks/osu_ialltoall # No DPU offload mpirun_rsh -np $NPROC -hostfile hostfile MV2_USE_DPU=0 $EXE # DPU offload mpirun_rsh -np $NPROC -hostfile hostfile -dpufile dpufile $EXE

Please note that we are running processes only on the hosts. Mvapich will take care of DPU offloading.

Job submission:

sbatch -N 16 -w thor0[25-32],thor-bf[25-32] --ntasks-per-node=8 RUN-osu.slurm

Question: What is the minimum message size that the MPI perform DPU offload for the MPI_Ialltoall call?

1. Task ”Modifying the given application codebase to leverage NVIDIA DPU”

[contribution to total score 30%] – Modify the code to use non-blocking strategy (call MPI_Ialltoall instead of MPI_Alltoall), and run the modified xcompact3d application using DPU offload mode.

Application source code: https://github.com/xcompact3d/Incompact3d (tag: v4.0)

Submission criteria:

  • Submit the entire modified code plus building and submission scripts to Filippo Spiga (zip file)

    • Bonus points for readability and level of comments

What is allowed?

  • Modify any portion of the source code for as long as the results is correct and reproducible

What is NOT allowed?

  • Reducing artificially the I/O

  • Change the physics of the inputs files

  • Change version of the code

Suggestions:

  • 2decomp sub-folder contains a library that perform transpositions (files transpose_*_to_*) of the decomposed domain

  • Focus on changing the main application files, replacing relevant and expensive transpose calls (transpose_*_to_*) with an appropriate non-blocking pair (transpose_*_to_*_start and transpose_*_to_*_wait)

2. Task ”Performance assessment of original versus modified application”

[contribution to total score 30%] – Run the original and modified xcompact3d application using the cylinder input case (/global/home/groups/isc_scc/coding-challenge/input.i3d). You are not allowed to change the problem size but you should adjust “Domain decomposition” in the input file. Obtain performance measurements using 8 nodes with and without the DPU adapter (note: Thor servers equipped with 2 adapters, ConnectX-6 and BlueField-2, mlx5_2 should be used on the host), make sure to vary PPN (4, 8, 16, 32). Run MPI profiler (mpiP or IPM) to understand if MPI overlap is happening and how the parallel behaviour of the application has changed.

Submission criteria:

  • Submit all building scripts and outputs

  • Submit baseline scaling results (graph) using unmodified application

  • Submit baseline scaling results (graph) using modified application

    • What is the message size used for MPI_alltoall / MPI_Ialltoall? How message size related to performance improvements?

  • The modified code must be correct and return exactly the same results by any giver input and set of execution parameters (number of MPI processes, number of MPI processes per node, grid size, grid decomposition)

3. Task ”Summarize all your findings and results”

[contribution to total score 20%] – Submit a report of what you managed to achieve and learned, include a description of the code changes (better if also done as comments in the modified code) and all meaningful comparison of results with and without DPU offload. Elaborate why you get (or did not get) performance improvements.

Submission criteria:

  • Report - a document/slides that explain what you did and what was learned

    • Tables and graphs and MPI traces are welcome, alongside a clear description what has been done

    • Try to highlight clearly the contributions made by each team members in which tasks

  • Performance improvement of your modified code vs. the original code provided

    • A ranking will be created listing all teams who successfully submitted a working code

4. Bonus Task “Explore different inputs and configurations”

[contribution up to 20%] – Modify the input (or use new ones) to create larger problems or smaller problems, change number of iterations or change the node count and compare with and without offload.

Submission criteria:

  • Submit all building scripts and outputs

  • Submit baseline scaling results (graph) using unmodified application

  • Submit baseline scaling results (graph) using modified application

    • What is the message size used for MPI_alltoall / MPI_Ialltoall? How message size related to performance improvements?

  • Wrap up your findings in a separated Annex of the main report

  • The modified code must be correct and return exactly the same results by any giver input and set of execution parameters (number of MPI processes, number of MPI processes per node, grid size, grid decomposition)

References