This post will help you get started with clusters on HPCAC-AI cluster center. We are using helios cluster in this document.

Once you got your username, login to the clusters:

$ ssh <userid>@gw.hpcadvisorycouncil.com

To check available helios nodes using slurm commands.

$ sinfo -p helios
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
helios         up   infinite      4  alloc helios[011-012],heliosbf2a[011-012]
helios         up   infinite     76   idle helios[001-010,013-032],heliosbf2a[001-012,013-016]
$squeue -p helios
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
494552   helios    interact   ...    R      25:51      4 helios[011-012],heliosbf2a[011-012]

To allocate nodes interactively

Please avoid allocating nodes interactively if possible or set the time limit short because we are sharing the resources with multiple users.

# CPU nodes only 
$ salloc -N 2 -p helios --time=1:00:00 -w helios001,helios002
# CPU and BlueField nodes
$ salloc -N 4 -p helios --time=1:00:00 -w helios00[1-2],heliosbf2a00[1-2]

To submit a batch job

# CPU nodes only
$ sbatch -N 4 -p helios --time=1:00:00 -w helios00[1-4] <slurm script>
# CPU and BlueField nodes
$ sbatch -N 4 -p helios --time=1:00:00 -w helios00[1-2],heliosbf2a00[1-2] <slurm script>

Note: helios cluster has NVIDIA BlueField-2 cards with ARM processors on it. Those adapters can also be seen in slurm as “nodes” marked with heliosbf2a0[01-16], while the hosts are named helios[001-032].

Storage

When you login you are in the $HOME. There is also extra scratch space.
nfs home -> /global/home/users/$USER/
Lustre -> /global/scratch/users/$USER/
Please run jobs from the scratch partition. It is a Lustre filesystem and it is mounted over IB on every compute node.

Basic environment

Based on the order of module load/unload, extra module are present to the user. Remember to load compilers first.

module load intel/2022.1.2
module load compiler/2022.1.0
module load mkl/2022.0.2
module load hpcx/2.12.0
# To find modules for tools and libraries that are depending on HPC-X.
module available |grep hpcx
module load gcc/8.3.1
module load hpcx/2.12.0

Running OSU latency using HPC-X

$ mpirun -np 2 -H host1,host2  -map-by node -mca coll_hcoll_enable 0 -x UCX_NET_DEVICES=mlx5_0:1 osu_latency -i 10000 -x 10000
# OSU MPI Latency Test v5.8
# Size          Latency (us)
0                       0.91
1                       0.92
2                       0.91
4                       0.92
8                       0.92
16                      0.92
32                      0.94
64                      1.01
128                     1.05
256                     1.27
512                     1.36
1024                    1.50
2048                    2.07
4096                    2.85
8192                    4.44
16384                   5.88
32768                   8.17
65536                  12.06
131072                 18.61
262144                 18.54
524288                 29.06
1048576                50.34
2097152                95.72
4194304               183.98