Thea - System Architecture
Thea system is seamnglessly integrated in the HPC Advisory Council infrastructure. There is one single entry point and one slurm instance. Multiple "clusters" are de-facto separated partitions.
Hardware
8 x Quanta S74G-2U GH200
Grace CPU 72 cores @ 3.4GHz with 480 GB LPDDR5X
Hopper GPU with 96 GB HMB3
NDR 400Gb/s InfiniBand network, 1 adapter per node
8 x Supermicro ARS-221GL-NR
dual Grace CPU 72 cores @ 3.4GHz with 480 GB LPDDR5X
NDR 400Gb/s InfiniBand network, 1 adapter per node (attached to socket-0)
Software
Both GH200 partition gh and Grace-Gracegg partition boot the same compute image:
Rocky Linux 9.3 with a custom hybrid 6.2.0-1013-nvidia-64k kernel (64k pages enabled)
CUDA driver 550.54.14 (aplicable only on
gh)MLNX_OFED_LINUX-23.10-1.1.9.0
GNU GCC 11.4.1 (version 12.3 available via modulefiles)
GNU libc 2.34
Storage
Home:
/global/home/users/$USER($HOME)Provided via NFS
Accessible by all nodes
Quota is enforced
Scratch:
/global/scratch/users/$USERProvided via Lustre
Persistent but no backup
Accessible by all nodes
Local:
/local(available only onghnodes)Provided via NVMe local disk
it gets automatically purged when a job ends
Capacity ~830 GBytes usable
We suggest to add to your init shell script (e.g. .bashrc) the following to have easy and fast access to scratch (to adapt if another shell is used):
export SCRATCH=/global/scratch/users/$USER