Thea - System Architecture
Thea system is seamnglessly integrated in the HPC Advisory Council infrastructure. There is one single entry point and one slurm instance. Multiple "clusters" are de-facto separated partitions.
Â
Hardware
8 x Quanta S74G-2U GH200
Grace CPU 72 cores @ 3.4GHz with 480 GB LPDDR5X
Hopper GPU with 96 GB HMB3
NDR 400Gb/s InfiniBand network, 1 adapter per node
8 x Supermicro ARS-221GL-NR
dual Grace CPU 72 cores @ 3.4GHz with 480 GB LPDDR5X
NDR 400Gb/s InfiniBand network, 1 adapter per node (attached to socket-0)
Â
Software
Both GH200 partition gh
and Grace-Gracegg
partition boot the same compute image:
Rocky Linux 9.3 with a custom hybrid 6.2.0-1013-nvidia-64k kernel (64k pages enabled)
CUDA driver 550.54.14 (aplicable only on
gh
)MLNX_OFED_LINUX-23.10-1.1.9.0
GNU GCC 11.4.1 (version 12.3 available via modulefiles)
GNU libc 2.34
Â
Storage
Home:
/global/home/users/$USER
($HOME
)Provided via NFS
Accessible by all nodes
Quota is enforced
Scratch:
/global/scratch/users/$USER
Provided via Lustre
Persistent but no backup
Accessible by all nodes
Local:
/local
(available only ongh
nodes)Provided via NVMe local disk
it gets automatically purged when a job ends
Capacity ~830 GBytes usable
We suggest to add to your init shell script (e.g. .bashrc
) the following to have easy and fast access to scratch (to adapt if another shell is used):
export SCRATCH=/global/scratch/users/$USER