...
Code Block |
---|
$ echo $HPCX_SHARP_DIR /global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/sharp $ echo $OMPI_HOME /global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/ompi |
4. set Set the parameters sharp_enabled 2 and virt_enabled 2 on and routing_engine ftree,updn (or any other routing algorithm) in the opensm configuration file under file /etc/opensm/opensm.conf (or any other location)
Code Block |
---|
sharp_enabled 2 virtrouting_enabledengine 2ftree,updn |
5. Set the parameter root_guid_file /etc/opensm/root_guid.cfg in the opensm configuration file /etc/opensm/opensm.conf (or any other location)
Code Block |
---|
root_guid_file /etc/opensm/root_guid.cfg |
6. Put the root GUIDs in /etc/opensm/root_guid.cfg, e.g.,
Code Block |
---|
0x7cfe900300a5a2c0 |
7. Start the OpenSM using this file the prepared opensm.conf file (you can add other flags as well, e.g. priority).
Code Block |
---|
sudo opensm -F /etc/opensm/opensm.conf -g 0xe41d2d0300a3ab5c -p 3 -B ------------------------------------------------- OpenSM 5.0.0.MLNX20180219.c610c42 Config file is `/etc/opensm/opensm.conf`: Reading Cached Option File: /etc/opensm/opensm.conf Loading Cached Option:sharp_enabled = 2routing_engine = ftree,updn Loading Cached Option:root_guid_file = /etc/opensm/root_guid.cfg Loading Cached Option:virtsharp_enabled = 2 Command Line Arguments: guid = 0xe41d2d0300a3ab5c Priority = 3 Daemon mode Log File: /var/log/opensm.log ------------------------------------------------- OpenSM 5.0.0.MLNX20180219.c610c42 Using default GUID 0x248a070300964da6 Entering DISCOVERING state Entering MASTER state |
...
8. Make sure that this is the sm running on the cluster, and there is no other sm.
Code Block |
---|
$ sudo sminfo sminfo: sm lid 11 sm guid 0x248a070300964da60xe41d2d0300a3ab5c, activity count 5303 priority 14 state 3 SMINFO_MASTER |
79. Make sure that the activation nodes were activated by OpenSM
...
Code Block |
---|
$ sudo service sharpd start Redirecting to /bin/systemctl start sharpd.service |
Verification
1. Run ibdiagnet --sharp and check the aggregation nodes
Code Block |
---|
$ sudo ibdiagnet --sharp
...
$ cat /var/tmp/ibdiagnet2/ibdiagnet2.sharp
# This database file was automatically generated by IBDIAG
AN:Mellanox Technologies Aggregation Node, lid:73, node guid:0x7cfe900300a5a2c8
AN:Mellanox Technologies Aggregation Node, lid:70, node guid:0xec0d9a03001c7068 |
2. Run simple microbenchmark such as osu_barrier to confirm SHARP is working
Code Block |
---|
$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 osu_barrier -i 100000
# OSU MPI Barrier Latency Test v5.4.1
# Avg Latency(us)
4.17
$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx5_0:1 osu_barrier -i 100000
# OSU MPI Barrier Latency Test v5.4.1
# Avg Latency(us)
3.29
$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_SHARP=1 osu_barrier -i 100000
# OSU MPI Barrier Latency Test v5.4.1
# Avg Latency(us)
1.64
|
Note: in above testing the first command runs osu_barrier without HCOLL, the second command runs it with HCOLL but without SHARP, the third command runs it with HCOLL and SHARP.