Aspire-1 (NSCC) Cluster Access for ISC21 SCC

ISC21 Committee will supply your team captain the username and password required to connect to Aspire cluster of National Supercomputing Centre Singapore (NSCC) - aspire.nscc.sg

You will need to follow the guidelines given here to connect to VPN and connect to the cluster login nodes, see https://help.nscc.sg/wp-content/uploads/Getting-Started-ASPIRE-1-v1.08-final.pdf

Get to know PBS scheduler, see https://help.nscc.sg/pbspro-quickstartguide/ and online. If you are familier with Slurm, see here comparison.

Once connected, create a sample file and submit a job. Here is a basic example to get the CPU type and Network adapter on the node.

$ cat submit.pbs #!/bin/bash #PBS -q normal #PBS -l select=1:ncpus=1:mem=100M #PBS -l walltime=00:10:00 #PBS -P <your project ID> #PBS -o outputfile.o #PBS -e errorfile.e echo Checking The CPU and Network echo lscpu lscpu echo lspci | grep Mel lspci | grep Mell

 

Submit the job

$ qsub submit.pbs 9954607.wlm01

 

Check the output:

$ cat outputfile.o Checking The CPU and Network lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 1 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 4 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz Stepping: 2 CPU MHz: 1200.000 BogoMIPS: 5187.61 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-5 NUMA node1 CPU(s): 6-11 NUMA node2 CPU(s): 12-17 NUMA node3 CPU(s): 18-23 lspci | grep Mel 81:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] ====================================================================================== Resource Usage on 2020-04-25 10:08:30.888043: JobId: 9954616.wlm01 Project: 21120227 Exit Status: 0 NCPUs Requested: 1 NCPUs Used: 1 CPU Time Used: 00:00:00 Memory Requested: 100mb Memory Used: 0kb Vmem Used: 0kb Walltime requested: 00:10:00 Walltime Used: 00:00:00 Execution Nodes Used: (std1708:ncpus=1:mem=102400kb) ======================================================================================

 

For DGX access, follow this guide:

https://help.nscc.sg/wp-content/uploads/AI_System_QuickStart.pdf

DGX HW description:

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Stepping: 1 CPU MHz: 2794.907 CPU max MHz: 3600.0000 CPU min MHz: 1200.0000 BogoMIPS: 4390.10 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 51200K NUMA node0 CPU(s): 0-19,40-59 NUMA node1 CPU(s): 20-39,60-79 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr s se sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4 _2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp _l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hl e avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mb m_local dtherm ida arat pln pts md_clear flush_l1d total used free shared buff/cache available Mem: 503 58 340 0 105 442 Swap: 0 0 0 OFED-internal-4.4-2.0.7: Ubuntu 18.04.2 LTS \n \l Linux dgx4105 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Filesystem Size Used Avail Use% Mounted on udev 252G 0 252G 0% /dev tmpfs 51G 3.2M 51G 1% /run /dev/sda2 440G 395G 22G 95% / tmpfs 252G 12K 252G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 252G 0 252G 0% /sys/fs/cgroup /dev/sda1 487M 6.1M 481M 2% /boot/efi /dev/sdb1 7.0T 4.9T 1.8T 74% /raid 192.168.160.101:/home 3.4P 2.1P 1.4P 61% /home [davidcho@nscc03 ~]$ cat !$ cat dgx4105.txt Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Stepping: 1 CPU MHz: 2794.907 CPU max MHz: 3600.0000 CPU min MHz: 1200.0000 BogoMIPS: 4390.10 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 51200K NUMA node0 CPU(s): 0-19,40-59 NUMA node1 CPU(s): 20-39,60-79 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d total used free shared buff/cache available Mem: 503 58 340 0 105 442 Swap: 0 0 0 OFED-internal-4.4-2.0.7: Ubuntu 18.04.2 LTS \n \l Linux dgx4105 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Filesystem Size Used Avail Use% Mounted on udev 252G 0 252G 0% /dev tmpfs 51G 3.2M 51G 1% /run /dev/sda2 440G 395G 22G 95% / tmpfs 252G 12K 252G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 252G 0 252G 0% /sys/fs/cgroup /dev/sda1 487M 6.1M 481M 2% /boot/efi /dev/sdb1 7.0T 4.9T 1.8T 74% /raid 192.168.160.101:/home 3.4P 2.1P 1.4P 61% /home 192.168.156.29@o2ib,192.168.156.30@o2ib:/scratch 2.8P 1.8P 993T 65% /scratch tmpfs 51G 0 51G 0% /run/user/0 Sat May 23 06:15:29 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 | | N/A 35C P0 43W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ 06:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 07:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 85:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 86:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 89:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 8a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 05:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 0c:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 84:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 8b:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] hca_id: mlx5_1 transport: InfiniBand (0) fw_ver: 12.23.1020 node_guid: ec0d:9a03:00a4:bbde sys_image_guid: ec0d:9a03:00a4:bbde vendor_id: 0x02c9 vendor_part_id: 4115 hw_ver: 0x0 board_id: MT_2180110032 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 251 port_lid: 1417 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_3 transport: InfiniBand (0) fw_ver: 12.23.1020 node_guid: ec0d:9a03:00aa:2960 sys_image_guid: ec0d:9a03:00aa:2960 vendor_id: 0x02c9 vendor_part_id: 4115 hw_ver: 0x0 board_id: MT_2180110032 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 251 port_lid: 1419 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 12.23.1020 node_guid: ec0d:9a03:00aa:29b8 sys_image_guid: ec0d:9a03:00aa:29b8 vendor_id: 0x02c9 vendor_part_id: 4115 hw_ver: 0x0 board_id: MT_2180110032 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 251 port_lid: 1416 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_2 transport: InfiniBand (0) fw_ver: 12.23.1020 node_guid: ec0d:9a03:00aa:2988 sys_image_guid: ec0d:9a03:00aa:2988 vendor_id: 0x02c9 vendor_part_id: 4115 hw_ver: 0x0 board_id: MT_2180110032 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 251 port_lid: 1422 port_lmc: 0x00 link_layer: InfiniBand

Â