This post will help you get started with clusters on HPCAC-AI cluster center. We are using helios cluster in this document.
Once you got your username, login to the clusters:
$ ssh <userid>@gw.hpcadvisorycouncil.com |
To check available helios nodes using slurm commands.
$ sinfo -p helios PARTITION AVAIL TIMELIMIT NODES STATE NODELIST helios up infinite 4 alloc helios[011-012],heliosbf2a[011-012] helios up infinite 76 idle helios[001-010,013-032],heliosbf2a[001-012,013-016] $squeue -p helios JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 494552 helios interact ... R 25:51 4 helios[011-012],heliosbf2a[011-012] |
Please avoid allocating nodes interactively if possible or set the time limit short because we are sharing the resources with multiple users.
# CPU nodes only $ salloc -N 2 -p helios --time=1:00:00 -w helios001,helios002 # CPU and BlueField nodes $ salloc -N 4 -p helios --time=1:00:00 -w helios00[1-2],heliosbf2a00[1-2] |
# CPU nodes only $ sbatch -N 4 -p helios --time=1:00:00 -w helios00[1-4] <slurm script> # CPU and BlueField nodes $ sbatch -N 4 -p helios --time=1:00:00 -w helios00[1-2],heliosbf2a00[1-2] <slurm script> |
Note: helios cluster has NVIDIA BlueField-2 cards with ARM processors on it. Those adapters can also be seen in slurm as “nodes” marked with heliosbf2a0[01-16], while the hosts are named helios[001-032].
When you login you are in the $HOME. There is also extra scratch space. nfs home -> /global/home/users/$USER/ Lustre -> /global/scratch/users/$USER/ Please run jobs from the scratch partition. It is a Lustre filesystem and it is mounted over IB on every compute node. |
Based on the order of module load/unload, extra module are present to the user. Remember to load compilers first.
Loading HPC-X with Intel compiler 2022.
module load intel/2022.1.2 module load compiler/2022.1.0 module load mkl/2022.0.2 module load hpcx/2.12.0 # To find modules for tools and libraries that are depending on HPC-X. module available |grep hpcx |
Loading HPC-X with GNU compiler
module load gcc/8.3.1 module load hpcx/2.12.0 |
$ mpirun -np 2 -H host1,host2 -map-by node -mca coll_hcoll_enable 0 -x UCX_NET_DEVICES=mlx5_0:1 osu_latency -i 10000 -x 10000 # OSU MPI Latency Test v5.8 # Size Latency (us) 0 0.91 1 0.92 2 0.91 4 0.92 8 0.92 16 0.92 32 0.94 64 1.01 128 1.05 256 1.27 512 1.36 1024 1.50 2048 2.07 4096 2.85 8192 4.44 16384 5.88 32768 8.17 65536 12.06 131072 18.61 262144 18.54 524288 29.06 1048576 50.34 2097152 95.72 4194304 183.98 |