For ISC22 Student Cluster Competition, we will have a coding challenge for the participating teams!

Cluster and Software to be used

The Mission

Getting Started

Coding Challenge Overview

Slides:

Tasks and Submissions

Input file: Can we downloaded here.

0. Task ”Getting familiar with the BlueField-2 DPU”

[practice task, no grading] – Run host based OSU_alltoall, and OSU_ialltoall over 8 nodes on Thor, using the BlueField-2 adapter. For this test use MVAPICH-DPU MPI to get a baseline.

  1. OSU_alltoall 8 PPN, 8 nodes

  2. OSU_alltoall 32 PPN, 8 nodes

  3. OSU_ialltoall 8 PPN, 8 nodes, no DPU offload

  4. OSU_ialltoall 8 PPN, 8 nodes, DPU offload

  5. OSU_ialltoall 32 PPN, 8 nodes, no DPU offload

  6. OSU_ialltoall 32 PPN, 8 nodes, DPU offload

Sample script: RUN-osu.slurm

#!/bin/bash -l
#SBATCH -p thor
#SBATCH --nodes=16
#SBATCH -J osu
#SBATCH --time=15:00
#SBATCH --exclusive

module load gcc/8.3.1 mvapich2-dpu/2021.08

srun -l hostname -s | awk '{print $2}' | grep -v bf | sort > hostfile
srun -l hostname -s | awk '{print $2}' | grep bf | sort |uniq > dpufile
NPROC=$(cat hostfile |wc -l)

EXE=$MVAPICH2_DPU_DIR/libexec/osu-micro-benchmarks/osu_ialltoall
# No DPU offload
mpirun_rsh -np $NPROC -hostfile hostfile MV2_USE_DPU=0 $EXE
# DPU offload
mpirun_rsh -np $NPROC -hostfile hostfile -dpufile dpufile $EXE

Please note that we are running processes only on the hosts. Mvapich will take care of DPU offloading.

Job submission:

sbatch -N 16 -w thor0[25-32],thor-bf[25-32] --ntasks-per-node=8 RUN-osu.slurm

Question: What is the minimum message size that the MPI perform DPU offload for the MPI_Ialltoall call?

1. Task ”Modifying the given application codebase to leverage NVIDIA DPU”

[contribution to total score 30%] – Modify the code to use non-blocking strategy (call MPI_Ialltoall instead of MPI_Alltoall), and run the modified xcompact3d application using DPU offload mode.

Application source code: https://github.com/xcompact3d/Incompact3d (tag: v4.0)

Submission criteria:

What is allowed?

What is NOT allowed?

Suggestions:

2. Task ”Performance assessment of original versus modified application”

[contribution to total score 30%] – Run the original and modified xcompact3d application using the cylinder input case (/global/home/groups/isc_scc/coding-challenge/input.i3d). You are not allowed to change the problem size but you should adjust “Domain decomposition” in the input file. Obtain performance measurements using 8 nodes with and without the DPU adapter (note: Thor servers equipped with 2 adapters, ConnectX-6 and BlueField-2, mlx5_2 should be used on the host), make sure to vary PPN (4, 8, 16, 32). Run MPI profiler (mpiP or IPM) to understand if MPI overlap is happening and how the parallel behaviour of the application has changed.

Submission criteria:

3. Task ”Summarize all your findings and results”

[contribution to total score 20%] – Submit a report of what you managed to achieve and learned, include a description of the code changes (better if also done as comments in the modified code) and all meaningful comparison of results with and without DPU offload. Elaborate why you get (or did not get) performance improvements.

Submission criteria:

4. Bonus Task “Explore different inputs and configurations”

[contribution up to 20%] – Modify the input (or use new ones) to create larger problems or smaller problems, change number of iterations or change the node count and compare with and without offload.

Submission criteria:

References