Getting Started with HPC-AI AC Clusters
This post will help you get started with clusters on HPCAC-AI cluster center. We are using helios cluster in this document.
Once you got your username, login to the clusters:
$ ssh <userid>@gw.hpcadvisorycouncil.com
To check available helios nodes using slurm commands.
$ sinfo -p helios
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
helios up infinite 4 alloc helios[011-012],heliosbf2a[011-012]
helios up infinite 76 idle helios[001-010,013-032],heliosbf2a[001-012,013-016]
$squeue -p helios
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
494552 helios interact ... R 25:51 4 helios[011-012],heliosbf2a[011-012]
To allocate nodes interactively
Please avoid allocating nodes interactively if possible or set the time limit short because we are sharing the resources with multiple users.
# CPU nodes only
$ salloc -N 2 -p helios --time=1:00:00 -w helios001,helios002
# CPU and BlueField nodes
$ salloc -N 4 -p helios --time=1:00:00 -w helios00[1-2],heliosbf2a00[1-2]
To submit a batch job
# CPU nodes only
$ sbatch -N 4 -p helios --time=1:00:00 -w helios00[1-4] <slurm script>
# CPU and BlueField nodes
$ sbatch -N 4 -p helios --time=1:00:00 -w helios00[1-2],heliosbf2a00[1-2] <slurm script>
Note: helios cluster has NVIDIA BlueField-2 cards with ARM processors on it. Those adapters can also be seen in slurm as “nodes” marked with heliosbf2a0[01-16], while the hosts are named helios[001-032].
Storage
When you login you are in the $HOME. There is also extra scratch space.
nfs home -> /global/home/users/$USER/
Lustre -> /global/scratch/users/$USER/
Please run jobs from the scratch partition. It is a Lustre filesystem and it is mounted over IB on every compute node.
Basic environment
Based on the order of module load/unload, extra module are present to the user. Remember to load compilers first.
Loading HPC-X with Intel compiler 2022.
module load intel/2022.1.2
module load compiler/2022.1.0
module load mkl/2022.0.2
module load hpcx/2.12.0
# To find modules for tools and libraries that are depending on HPC-X.
module available |grep hpcx
Loading HPC-X with GNU compiler
module load gcc/8.3.1
module load hpcx/2.12.0
Running OSU latency using HPC-X
$ mpirun -np 2 -H host1,host2 -map-by node -mca coll_hcoll_enable 0 -x UCX_NET_DEVICES=mlx5_0:1 osu_latency -i 10000 -x 10000
# OSU MPI Latency Test v5.8
# Size Latency (us)
0 0.91
1 0.92
2 0.91
4 0.92
8 0.92
16 0.92
32 0.94
64 1.01
128 1.05
256 1.27
512 1.36
1024 1.50
2048 2.07
4096 2.85
8192 4.44
16384 5.88
32768 8.17
65536 12.06
131072 18.61
262144 18.54
524288 29.06
1048576 50.34
2097152 95.72
4194304 183.98