Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
$ echo $HPCX_SHARP_DIR
/global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/sharp
$ echo $OMPI_HOME
/global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/ompi


4. set Set the parameters sharp_enabled 2  and virt_enabled 2 on and routing_engine ftree,updn (or any other routing algorithm) in the opensm configuration file under file /etc/opensm/opensm.conf (or any other location)

Code Block
sharp_enabled 2
virtrouting_enabledengine 2ftree,updn


5. Set the parameter root_guid_file /etc/opensm/root_guid.cfg in the opensm configuration file /etc/opensm/opensm.conf (or any other location)

Code Block
root_guid_file /etc/opensm/root_guid.cfg


6. Put the root GUIDs in /etc/opensm/root_guid.cfg, e.g.,

Code Block
0x7cfe900300a5a2c0


7. Start the OpenSM using this file the prepared opensm.conf file (you can add other flags as well, e.g. priority).

Code Block
sudo opensm -F /etc/opensm/opensm.conf -g 0xe41d2d0300a3ab5c -p 3 -B
-------------------------------------------------
OpenSM 5.0.0.MLNX20180219.c610c42
Config file is `/etc/opensm/opensm.conf`:
 Reading Cached Option File: /etc/opensm/opensm.conf
 Loading Cached Option:sharp_enabled = 2routing_engine = ftree,updn
 Loading Cached Option:root_guid_file = /etc/opensm/root_guid.cfg
 Loading Cached Option:virtsharp_enabled = 2
Command Line Arguments:
 guid = 0xe41d2d0300a3ab5c
 Priority = 3
 Daemon mode
 Log File: /var/log/opensm.log
-------------------------------------------------
OpenSM 5.0.0.MLNX20180219.c610c42

Using default GUID 0x248a070300964da6
Entering DISCOVERING state

Entering MASTER state

...


8. Make sure that this is the sm running on the cluster, and there is no other sm.

Code Block
$ sudo sminfo
sminfo: sm lid 11 sm guid 0x248a070300964da60xe41d2d0300a3ab5c, activity count 5303 priority 14 state 3 SMINFO_MASTER


79. Make sure that the activation nodes were activated by OpenSM

...

Code Block
$ sudo service sharpd start
Redirecting to /bin/systemctl start sharpd.service


Verification

1. Run ibdiagnet --sharp and check the aggregation nodes

Code Block
$ sudo ibdiagnet --sharp 
...

$ cat /var/tmp/ibdiagnet2/ibdiagnet2.sharp
# This database file was automatically generated by IBDIAG

AN:Mellanox Technologies Aggregation Node, lid:73, node guid:0x7cfe900300a5a2c8

AN:Mellanox Technologies Aggregation Node, lid:70, node guid:0xec0d9a03001c7068


2. Run simple microbenchmark such as osu_barrier to confirm SHARP is working

Code Block
$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 osu_barrier -i 100000

# OSU MPI Barrier Latency Test v5.4.1
# Avg Latency(us)
             4.17

$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0  -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx5_0:1 osu_barrier -i 100000

# OSU MPI Barrier Latency Test v5.4.1
# Avg Latency(us)
             3.29

$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_SHARP=1 osu_barrier -i 100000

# OSU MPI Barrier Latency Test v5.4.1
# Avg Latency(us)
             1.64

Note: in above testing the first command runs osu_barrier without HCOLL, the second command runs it with HCOLL but without SHARP, the third command runs it with HCOLL and SHARP.