Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »


Set up the cluster and OpenSM

1. Make sure that your cluster meet with the minimum requirements, see the deployment guide here.


2. Login to the head node and raise the relevant modules.

$ module load intel/2018.1.163
$ module load hpcx/2.1.0


3. Make sure that both directories (HPCX_SHARP_DIR, OMPI_HOME) are available 

$ echo $HPCX_SHARP_DIR
/global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/sharp
$ echo $OMPI_HOME
/global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/ompi


4. Set the parameters sharp_enabled 2 and routing_engine ftree,updn (or any other routing algorithm) in the opensm configuration file /etc/opensm/opensm.conf (or any other location)

sharp_enabled 2
routing_engine ftree,updn


5. Set the parameter root_guid_file /etc/opensm/root_guid.cfg in the opensm configuration file /etc/opensm/opensm.conf (or any other location)

root_guid_file /etc/opensm/root_guid.cfg


6. Put the root GUIDs in /etc/opensm/root_guid.cfg, e.g.,

0x7cfe900300a5a2c0


7. Start the OpenSM using the prepared opensm.conf file (you can add other flags as well, e.g. priority).

sudo opensm -F /etc/opensm/opensm.conf -g 0xe41d2d0300a3ab5c -p 3 -B
-------------------------------------------------
OpenSM 5.0.0.MLNX20180219.c610c42
Config file is `/etc/opensm/opensm.conf`:
 Reading Cached Option File: /etc/opensm/opensm.conf
 Loading Cached Option:routing_engine = ftree,updn
 Loading Cached Option:root_guid_file = /etc/opensm/root_guid.cfg
 Loading Cached Option:sharp_enabled = 2
Command Line Arguments:
 guid = 0xe41d2d0300a3ab5c
 Priority = 3
 Daemon mode
 Log File: /var/log/opensm.log


8. Make sure that this is the sm running on the cluster, and there is no other sm.

$ sudo sminfo
sminfo: sm lid 11 sm guid 0xe41d2d0300a3ab5c, activity count 5303 priority 14 state 3 SMINFO_MASTER


9. Make sure that the activation nodes were activated by OpenSM

$ sudo ibnetdiscover | grep "Agg"
[37]    "H-7cfe900300a5a2c8"[1](7cfe900300a5a2c8)               # "Mellanox Technologies Aggregation Node" lid 73 4xEDR
[37]    "H-ec0d9a03001c7068"[1](ec0d9a03001c7068)               # "Mellanox Technologies Aggregation Node" lid 70 4xEDR
Ca      1 "H-7cfe900300a5a2c8"          # "Mellanox Technologies Aggregation Node"
Ca      1 "H-ec0d9a03001c7068"          # "Mellanox Technologies Aggregation Node"


Note: Using OpenSM v4.9 or later doesn't require any special configuration in the Aggregation manager for fat-tree topologies, for other topologies or older OpenSM refer to the deployment guide.


Enable SHARP Deamons 

1. Setup SHARP Aggregation Manager (sharp_am) on the opensm node

sudo $HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s -d sharp_am
Copying /global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/sharp/systemd/system/sharp_am.service to /etc/systemd/system/sharp_am.service
Service sharp_am is installed


2. Start SHARP AM service 

$ sudo service sharp_am start
Redirecting to /bin/systemctl start sharp_am.service


3. Setup sharpd on all cluster nodes (using pdsh or any other method)

$ sudo $HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s -d sharpd
Copying /global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/sharp/systemd/system/sharpd.service to /etc/systemd/system/sharpd.service

Service sharpd is installed


4. Start the SHARP daemon on all cluster nodes

$ sudo service sharpd start
Redirecting to /bin/systemctl start sharpd.service


Verification

1. Run ibdiagnet --sharp and check the aggregation nodes

$ sudo ibdiagnet --sharp 
...

$ cat /var/tmp/ibdiagnet2/ibdiagnet2.sharp
# This database file was automatically generated by IBDIAG

AN:Mellanox Technologies Aggregation Node, lid:73, node guid:0x7cfe900300a5a2c8

AN:Mellanox Technologies Aggregation Node, lid:70, node guid:0xec0d9a03001c7068

2. Run simple microbenchmark such as osu_barrier to confirm SHARP is working

$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 osu_barrier -i 100000

# OSU MPI Barrier Latency Test v5.4.1
# Avg Latency(us)
             4.17

$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0  -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx5_0:1 osu_barrier -i 100000

# OSU MPI Barrier Latency Test v5.4.1
# Avg Latency(us)
             3.29

$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_SHARP=1 osu_barrier -i 100000

# OSU MPI Barrier Latency Test v5.4.1
# Avg Latency(us)
             1.64

Note: in above testing the first command runs osu_barrier without HCOLL, the second command runs it with HCOLL but without SHARP, the third command runs it with HCOLL and SHARP.

  • No labels