Getting Started with Thea Clusters
Â
Cluster access request
Â
To request access to Thea clusters fill this form
Once you have username and can access the login nodes, follow this example here: Getting Started with HPC-AI AC Clusters to allocate GH nodes.
Connect to the lab
Once you got your username, login to the clusters:
$ ssh <userid>@gw.hpcadvisorycouncil.com
Â
To check available GH nodes using slurm commands.
$ sinfo -p gh
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gh up infinite 8 idle gh[001-008]
Running Jobs
Slurm is the system job scheduler. Each job has a maximum walltime of 12 hours and nodes are allocated by default in exclusive mode (one user allocating always a full node, no sharing). GPU is always visible once a job allocated a node, no need to use any gres
options.
Please avoid allocating nodes interactively if possible or set the time limit short because we are sharing the resources with multiple users.
Allocating Examples
Â
How to allocate one GH200 node:
salloc -n 72 -N 1 -p gh -t 1:00:00
How to allocate two GH200 nods:
How to allocate one specific GH200 node:
How to allocate two specific GH200 nodes:
How to allocate 4 GH200 nodes but force to exclude a specific one (gh001):
How to allocate one Grace-only node:
How to allocate four Grace-only nodes:
How to submit a batch job
Â
Batch job
Example of job batch script running on 2 GH200 nodes and 2 task per node via mpirun
Example of job batch script running on 2 GH200 nodes and 2 task per node via srun
Example of job batch script running on 2 Grace-only nodes and full MPI-only via mpirun
Example of job batch script running on 4 Grace-only nodes and MPI+OpenMP combination via mpirun
Â
Working with Singularity containers
Singularity is the only container engine present at the moment. Docker or enroot workflows need to be adapted to run (as user) on Thea.
Example 1: Run interactively pre-staged Singularity containers
(1) Allocate an interactive node
(2) Select container and invoke singularity run
NOTE - Accessing a SIF container is usually fast enough also when the file is locate on the lustre filesystem. Copying it on /local
will improve the bootstrap time marginally.
Example 2: Run interactively pre-staged Singularity containers
NOTE - The current path where srun and singularity are executed is automatically exposed inside the container.
Â
Example 3: How to squash and run a NGC container into a new read-only Singularity image
TIP - Building a container is a very intense I/O operation, it is better to leverage /local
when possible but remember to copy your sif image or sandbox folder back to ${SCRATCH}
before the job is completed otherwise all files are lost.
1. Allocate an interactive node
2. Set additional env variables
Make sure singularity pull operates entirely from /local
for performance reasons and capacity constrains
3. Pull locally singularity image
Â
Example 4: How to create a Singularity Sandbox and run / repackage a new container image
1. Grab one node in interactive mode
2. Identify which container to extend via a sandbox and prep the environment
3. Make a copy of base container as reading and verifying it is faster on local disk
4. Create a Singularity definition file
Start with the original NGC container as base image and add extra packages in the %post
phase
After this there are two options:
5A. Create the sandbox on the persistent storage
TIP - Use this method if you want to customise your image by bulding manually software or debugging a failing pip
command.
When completed, run on an interactive node via
5B. Create a new SIF image
TIP - Use this method if you want to create a read-only image to run workloads and you are confident all %post
steps can run successfully without manual intervention.
When completed, run on an interactive node via
Â
Storage
When you login you are in the $HOME. There is also extra scratch space.
Please run jobs from the scratch partition. It is a Lustre filesystem and it is mounted over InfiniBand.
Â