Coding Challenge for ISC22 SCC
For ISC22 Student Cluster Competition, we will have a coding challenge for the participating teams!
Cluster and Software to be used
The teams will be using the 8 nodes, HPCAC-AI Thor cluster with equipped with NVIDIA BlueField-2 Data Processing Unit (DPU).
MPI library to be used is MVAPICH-DPU MPI (only)
The teams should use SLURM to submit their jobs and share the cluster with others.
The Mission
Alter the Xcompact3d code to support NVIDIA BlueField-2 DPU with non-blocking collective offload.
Try to gain best performance while overlapping communication and computation
Getting Started
Each team will get cluster access for HPC-AI Advisory council Thor cluster.
Once the team manage to get access to the login node, use slurm to allocate thor nodes, see Getting Started with HPC-AI AC Clusters for example.
Each team will get license file from X-ScaleSolutions to be able to use the MVAPICH-DPU software. you will also receive from X-ScaleSolutions the User Manual and examples how to use the MPI.
Coding Challenge Overview
Slides:
Tasks and Submissions
Input file: Can we downloaded here.
0. Task ”Getting familiar with the BlueField-2 DPU”
[practice task, no grading] – Run host based OSU_alltoall
, and OSU_ialltoall
over 8 nodes on Thor, using the BlueField-2 adapter. For this test use MVAPICH-DPU MPI to get a baseline.
OSU_alltoall 8 PPN, 8 nodes
OSU_alltoall 32 PPN, 8 nodes
OSU_ialltoall 8 PPN, 8 nodes, no DPU offload
OSU_ialltoall 8 PPN, 8 nodes, DPU offload
OSU_ialltoall 32 PPN, 8 nodes, no DPU offload
OSU_ialltoall 32 PPN, 8 nodes, DPU offload
Sample script: RUN-osu.slurm
#!/bin/bash -l
#SBATCH -p thor
#SBATCH --nodes=16
#SBATCH -J osu
#SBATCH --time=15:00
#SBATCH --exclusive
module load gcc/8.3.1 mvapich2-dpu/2021.08
srun -l hostname -s | awk '{print $2}' | grep -v bf | sort > hostfile
srun -l hostname -s | awk '{print $2}' | grep bf | sort |uniq > dpufile
NPROC=$(cat hostfile |wc -l)
EXE=$MVAPICH2_DPU_DIR/libexec/osu-micro-benchmarks/osu_ialltoall
# No DPU offload
mpirun_rsh -np $NPROC -hostfile hostfile MV2_USE_DPU=0 $EXE
# DPU offload
mpirun_rsh -np $NPROC -hostfile hostfile -dpufile dpufile $EXE
Please note that we are running processes only on the hosts. Mvapich will take care of DPU offloading.
Job submission:
sbatch -N 16 -w thor0[25-32],thor-bf[25-32] --ntasks-per-node=8 RUN-osu.slurm
Question: What is the minimum message size that the MPI perform DPU offload for the MPI_Ialltoall
call?
1. Task ”Modifying the given application codebase to leverage NVIDIA DPU”
[contribution to total score 30%] – Modify the code to use non-blocking strategy (call MPI_Ialltoall
instead of MPI_Alltoall
), and run the modified xcompact3d application using DPU offload mode.
Application source code: https://github.com/xcompact3d/Incompact3d (tag: v4.0)
Submission criteria:
Submit the entire modified code plus building and submission scripts to Filippo Spiga (zip file)
Bonus points for readability and level of comments
What is allowed?
Modify any portion of the source code for as long as the results is correct and reproducible
What is NOT allowed?
Reducing artificially the I/O
Change the physics of the inputs files
Change version of the code
Suggestions:
2decomp
sub-folder contains a library that perform transpositions (filestranspose_*_to_*
) of the decomposed domainFocus on changing the main application files, replacing relevant and expensive transpose calls (
transpose_*_to_*
) with an appropriate non-blocking pair (transpose_*_to_*_start
andtranspose_*_to_*_wait
)
2. Task ”Performance assessment of original versus modified application”
[contribution to total score 30%] – Run the original and modified xcompact3d application using the cylinder input case (/global/home/groups/isc_scc/coding-challenge/input.i3d). You are not allowed to change the problem size but you should adjust “Domain decomposition” in the input file. Obtain performance measurements using 8 nodes with and without the DPU adapter (note: Thor servers equipped with 2 adapters, ConnectX-6 and BlueField-2, mlx5_2 should be used on the host), make sure to vary PPN (4, 8, 16, 32). Run MPI profiler (mpiP or IPM) to understand if MPI overlap is happening and how the parallel behaviour of the application has changed.
Submission criteria:
Submit all building scripts and outputs
Submit baseline scaling results (graph) using unmodified application
Submit baseline scaling results (graph) using modified application
What is the message size used for
MPI_alltoall / MPI_Ialltoall
? How message size related to performance improvements?
The modified code must be correct and return exactly the same results by any giver input and set of execution parameters (number of MPI processes, number of MPI processes per node, grid size, grid decomposition)
3. Task ”Summarize all your findings and results”
[contribution to total score 20%] – Submit a report of what you managed to achieve and learned, include a description of the code changes (better if also done as comments in the modified code) and all meaningful comparison of results with and without DPU offload. Elaborate why you get (or did not get) performance improvements.
Submission criteria:
Report - a document/slides that explain what you did and what was learned
Tables and graphs and MPI traces are welcome, alongside a clear description what has been done
Try to highlight clearly the contributions made by each team members in which tasks
Performance improvement of your modified code vs. the original code provided
A ranking will be created listing all teams who successfully submitted a working code
4. Bonus Task “Explore different inputs and configurations”
[contribution up to 20%] – Modify the input (or use new ones) to create larger problems or smaller problems, change number of iterations or change the node count and compare with and without offload.
Submission criteria:
Submit all building scripts and outputs
Submit baseline scaling results (graph) using unmodified application
Submit baseline scaling results (graph) using modified application
What is the message size used for
MPI_alltoall / MPI_Ialltoall
? How message size related to performance improvements?
Wrap up your findings in a separated Annex of the main report
The modified code must be correct and return exactly the same results by any giver input and set of execution parameters (number of MPI processes, number of MPI processes per node, grid size, grid decomposition)