Thea - System Architecture

Thea system is seamnglessly integrated in the HPC Advisory Council infrastructure. There is one single entry point and one slurm instance. Multiple "clusters" are de-facto separated partitions.

 

Hardware

  • 8 x Quanta S74G-2U GH200

    • Grace CPU 72 cores @ 3.4GHz with 480 GB LPDDR5X

    • Hopper GPU with 96 GB HMB3

    • NDR 400Gb/s InfiniBand network, 1 adapter per node

  • 8 x Supermicro ARS-221GL-NR

    • dual Grace CPU 72 cores @ 3.4GHz with 480 GB LPDDR5X

    • NDR 400Gb/s InfiniBand network, 1 adapter per node (attached to socket-0)

 

Software

Both GH200 partition gh and Grace-Gracegg partition boot the same compute image:

  • Rocky Linux 9.3 with a custom hybrid 6.2.0-1013-nvidia-64k kernel (64k pages enabled)

  • CUDA driver 550.54.14 (aplicable only on gh)

  • MLNX_OFED_LINUX-23.10-1.1.9.0

  • GNU GCC 11.4.1 (version 12.3 available via modulefiles)

  • GNU libc 2.34

 

Storage

  • Home: /global/home/users/$USER ($HOME)

    • Provided via NFS

    • Accessible by all nodes

    • Quota is enforced

  • Scratch: /global/scratch/users/$USER

    • Provided via Lustre

    • Persistent but no backup

    • Accessible by all nodes

  • Local: /local (available only on gh nodes)

    • Provided via NVMe local disk

    • it gets automatically purged when a job ends

    • Capacity ~830 GBytes usable

We suggest to add to your init shell script (e.g. .bashrc) the following to have easy and fast access to scratch (to adapt if another shell is used):

export SCRATCH=/global/scratch/users/$USER