Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
$ sinfo -p gh
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gh         up   infinite      8   idle  gh[001-008]

...

Running Jobs

Slurm is the system job scheduler. Each job has a maximum walltime of 12 hours and nodes are allocated by default in exclusive mode (one user allocating always a full node, no sharing). GPU is always visible once a job allocated a node, no need to use any gres options.

...

Code Block
#!/bin/bash -l
#SBATCH --ntasks=64
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=9
#SBATCH --nodes=4
#SBATCH --partition=gg
#SBATCH --time=1:00:00
#SBATCH --exclusive

. /global/scratch/groups/gh/bootstrap-gh-env.sh
module purge
module load openmpi/4.1.6-gcc-12.3.0-wftkmyd

export OMP_NUM_THREADS=9
mpirun -np 64 --map-by ppr:16:node:PE=9 \
   --report-bindings uname -a

Working with Singularity containers

Singularity is the only container engine present at the moment. Docker or enroot workflows need to be adapted to run (as user) on Thea.

Example 1: Run interactively pre-staged Singularity containers

(1) Allocate an interactive node

Code Block
salloc -n 1 -N 1 -p gh -t 1:00:00

(2) Select container and invoke singularity run

Code Block
export CONT="/global/scratch/groups/gh/sif_images/pytorch-23.12-py3.sif"
singularity run --nv "${CONT}"

NOTE - Accessing a SIF container is usually fast enough also when the file is locate on the lustre filesystem. Copying it on /local will improve the bootstrap time marginally.

Example 2: Run interactively pre-staged Singularity containers

Code Block
export CONT="/global/scratch/groups/gh/sif_images/pytorch-23.12-py3.sif"
srun --mpi=pmi2 -N 1 -n 1 --ntasks-per-node=1 -p gh -t 4:00:00 \ 
    singularity -v run --nv "${CONT}" python my_benchmark_script.sh

NOTE - The current path where srun and singularity are executed is automatically exposed inside the container.

Example 3: How to squash and run a NGC container into a new read-only Singularity image

TIP - Building a container is a very intense I/O operation, it is better to leverage /local when possible but remember to copy your sif image or sandbox folder back to ${SCRATCH} before the job is completed otherwise all files are lost.

1. Allocate an interactive node

Code Block
salloc -n 1 -N 1 -p gh -t 1:00:00

2. Set additional env variables

Make sure singularity pull operates entirely from /local for performance reasons and capacity constrains

Code Block
mkdir /local/tmp_singularity
mkdir /local/tmp_singularity_cache
export APPTAINER_TMPDIR=/local/tmp_singularity
export APPTAINER_CACHEDIR=/local/tmp_singularity_cache

3. Pull locally singularity image

Code Block
singularity  pull pytorch-23.12-py3.sif docker://nvcr.io/nvidia/pytorch:23.12-py3 

Example 4: How to create a Singularity Sandbox and run / repackage a new container image

1. Grab one node in interactive mode

Code Block
salloc -n 1 -N 1 -p gh -t 2:00:00

2. Identify which container to extend via a sandbox and prep the environment

Code Block
export CONT_DIR=/global/scratch/groups/gh/sif_images
export CONT_NAME="pytorch-23.12-py3.sif"
mkdir /local/$SLURM_JOBID
export APPTAINER_TMPDIR=/local/$SLURM_JOBID/_tmp_singularity
export APPTAINER_CACHEDIR=/local/$SLURM_JOBID/_cache_singularity
rm -rf ${APPTAINER_TMPDIR} && mkdir -p ${APPTAINER_TMPDIR}
rm -rf ${APPTAINER_CACHEDIR} && mkdir -p ${APPTAINER_CACHEDIR}

3. Make a copy of base container as reading and verifying it is faster on local disk

Code Block
cp ${CONT_DIR}/${CONT_NAME} /local/$SLURM_JOBID/

4. Create a Singularity definition file

Start with the original NGC container as base image and add extra packages in the %post phase

Code Block
cat > custom-pytorch.def << EOF
Bootstrap: localimage
From: /local/${SLURM_JOBID}/${CONT_NAME}
 
%post
    apt-get update
    apt-get -y install python3-venv
    pip install --upgrade pip
    pip install transformers accelerate huggingface_hub
EOF

After this there are two options:

5A. Create the sandbox on the persistent storage

TIP - Use this method if you want to customise your image by bulding manually software or debugging a failing pip command.

Code Block
cd /global/scratch/users/$USER
singularity build --sandbox custom-python-sandbox custom-pytorch.def

When completed, run on an interactive node via

Code Block
singularity run --nv custom-python-sandbox -bash-command /bin/bash

5B. Create a new SIF image

TIP - Use this method if you want to create a read-only image to run workloads and you are confident all %post steps can run successfully without manual intervention.

Code Block
cd /global/scratch/users/$USER
singularity build custom-python.sif custom-python.def

When completed, run on an interactive node via

Code Block
singularity run --nv custom-python.sif

Storage

When you login you are in the $HOME. There is also extra scratch space.

...