Content Comparison

Set up the cluster and OpenSM

1. Make sure that your cluster meet with the minimum requirements, see the deployment guide here.

2. Login to the head node and raise the relevant modules.

Code Block
$ module load intel/2018.1.163 $ module load hpcx/2.1.0

3. Make sure that both directories (HPCX_SHARP_DIR, OMPI_HOME) are available

Code Block
$ echo $HPCX_SHARP_DIR /global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/sharp $ echo $OMPI_HOME /global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/ompi

4. Set the parameters sharp_enabled 2 and routing_engine ftree,updn (or any other routing algorithm) in the opensm configuration file /etc/opensm/opensm.conf (or any other location)

Code Block
sharp_enabled 2 routing_engine ftree,updn

5. Set the parameter root_guid_file /etc/opensm/root_guid.cfg in the opensm configuration file /etc/opensm/opensm.conf (or any other location)

Code Block
root_guid_file /etc/opensm/root_guid.cfg

6. Put the root GUIDs in /etc/opensm/root_guid.cfg, e.g.,

Code Block
0x7cfe900300a5a2c0

7. Start the OpenSM using the prepared opensm.conf file (you can add other flags as well, e.g. priority).

Code Block

sudo opensm -F /etc/opensm/opensm.conf -g 0xe41d2d0300a3ab5c -p 3 -B
-------------------------------------------------
OpenSM 5.0.0.MLNX20180219.c610c42
Config file is `/etc/opensm/opensm.conf`:
 Reading Cached Option File: /etc/opensm/opensm.conf
 Loading Cached Option:routing_engine = ftree,updn
 Loading Cached Option:root_guid_file = /etc/opensm/root_guid.cfg
 Loading Cached Option:sharp_enabled = 2
Command Line Arguments:
 guid = 0xe41d2d0300a3ab5c
 Priority = 3
 Daemon mode
 Log File: /var/log/opensm.log

8. Make sure that this is the sm running on the cluster, and there is no other sm.

Code Block
$ sudo sminfo sminfo: sm lid 11 sm guid 0xe41d2d0300a3ab5c, activity count 5303 priority 14 state 3 SMINFO_MASTER

9. Make sure that the activation nodes were activated by OpenSM

Code Block

$ sudo ibnetdiscover | grep "Agg"
[37]    "H-7cfe900300a5a2c8"[1](7cfe900300a5a2c8)               # "Mellanox Technologies Aggregation Node" lid 73 4xEDR
[37]    "H-ec0d9a03001c7068"[1](ec0d9a03001c7068)               # "Mellanox Technologies Aggregation Node" lid 70 4xEDR
Ca      1 "H-7cfe900300a5a2c8"          # "Mellanox Technologies Aggregation Node"
Ca      1 "H-ec0d9a03001c7068"          # "Mellanox Technologies Aggregation Node"

Note: Using OpenSM v4.9 or later doesn't require any special configuration in the Aggregation manager for fat-tree topologies, for other topologies or older OpenSM refer to the deployment guide.

Enable SHARP Deamons

1. Setup SHARP Aggregation Manager (sharp_am) on the opensm node

Code Block

sudo $HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s -d sharp_am
Copying /global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/sharp/systemd/system/sharp_am.service to /etc/systemd/system/sharp_am.service
Service sharp_am is installed

2. Start SHARP AM service

Code Block
$ sudo service sharp_am start Redirecting to /bin/systemctl start sharp_am.service

3. Setup sharpd on all cluster nodes (using pdsh or any other method)

Code Block

$ sudo $HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s -d sharpd
Copying /global/software/centos-7/modules/intel/2018.1.163/hpcx/2.1.0/sharp/systemd/system/sharpd.service to /etc/systemd/system/sharpd.service

Service sharpd is installed

4. Start the SHARP daemon on all cluster nodes

Code Block
$ sudo service sharpd start Redirecting to /bin/systemctl start sharpd.service

Verification

1. Run ibdiagnet --sharp and check the aggregation nodes

Code Block

$ sudo ibdiagnet --sharp 
...

$ cat /var/tmp/ibdiagnet2/ibdiagnet2.sharp
# This database file was automatically generated by IBDIAG

AN:Mellanox Technologies Aggregation Node, lid:73, node guid:0x7cfe900300a5a2c8

AN:Mellanox Technologies Aggregation Node, lid:70, node guid:0xec0d9a03001c7068

2. Run simple microbenchmark such as osu_barrier to confirm SHARP is working

Code Block

$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 osu_barrier -i 100000

# OSU MPI Barrier Latency Test v5.4.1
# Avg Latency(us)
             4.17

$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0  -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx5_0:1 osu_barrier -i 100000

# OSU MPI Barrier Latency Test v5.4.1
# Avg Latency(us)
             3.29

$ mpirun -np 32 -npernode 1 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_SHARP=1 osu_barrier -i 100000

# OSU MPI Barrier Latency Test v5.4.1
# Avg Latency(us)
             1.64

Note: in above testing the first command runs osu_barrier without HCOLL, the second command runs it with HCOLL but without SHARP, the third command runs it with HCOLL and SHARP.

Advance Consideration

Multi-Channel

1. When using full PPN on the node (e.g. 32 Broadwell dual CPU or 40 on skylake servers) it is recommended to use set multi-channel

In the 3 level hierarchy:

First we will sub-group in the socket (e.g. 16 ranks), one group per socket. (-x HCOLL_SBGP=basesmsocket)

Second, each socket leader will form another group which is node level group per numa (-x HCOLL_SBGP=basesmuma)

Third, there will be one node leader per node (the leader of the second group) - one rank per node. (-x HCOLL_SBGP=p2p)

The collective algorithm will take advantage of this algorithm.

SBGP: sub grouping

The defaults is all 3 levels: -x HCOLL_SBGP=basesmsocket,basesmuma,p2p

In a of a intra-socket noise (when using full PPN for example) it is recommended to use 2 level hierarchy:

With 2 levels subgrouping it will be without the second level.

First we will sub-group in the socket (e.g. 16 ranks), one group per socket. (-x HCOLL_SBGP=basesmsocket)

Second , there will be one node leader per node (the leader of the second group) - one rank per node. (-x HCOLL_SBGP=p2p)

SBGP: sub grouping

-x HCOLL_SBGP=basesmsocket,p2p

2. In case we changed the HCOLL_SBGP, we need to align the HCOLL_BCOL (which is the communication channel)

For two level hierarchy we will use

-x HCOLL_BCOL=basesmuma,ucx_p2p

For three level hierarchy (default) we will use:

-x HCOLL_BCOL=basesmuma,basesmuma,ucx_p2p

basesmuma - use the shared memory for both level one and two.

ucx_p2p - for communication out of the node for level three.

To summarize, use the following for multi-channel for allreduce small messages (full PPN)

-x HCOLL_BCOL=basesmuma,ucx_p2p -x HCOLL_SBGP=basesmsocket,p2p

Note: if the ranks used are only within one socket it is not relevant.

Fragmentation (allreduce tuning)

To allow larger messages over SHARP use:

-x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4096 (default 256)

The fragmentation performance depends on #OSTS assigned to the group

1. SHARP resources for SwitchIB-2

For allreduce messages bigger than 256B (up to 4KB) you can use fragmentation on the message.

This is the fragment size:

-x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=256 (sharp fragment size)

The default is 128B.

2. OST - outstanding transactions. The default is 16.

One OST is one credit (after sending the sender is waiting for completion), for high performance reasons, you can enlarge it to 256.

-x SHARP_COLL_JOB_QUOTA_OSTS=256 ( # sharp osts)

3. Number of outstanding communicators that can use SHARP.

The default is 8. which means that by default each group gets 16/8=2 OST.

if the number of OSTs are 256, than the each group gets 256/8=32 OSTs

Each groups get #osts = (SHARP_COLL_JOB_QUOTA_OSTS / SHARP_COLL_JOB_QUOTA_MAX_GROUPS).

-x SHARP_COLL_JOB_QUOTA_MAX_GROUPS=8 ( #sharp groups).

For example for osu_allreduce up to 4KB (256 payload * 16 OST per group) use the following:

-x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4096

-x SHARP_COLL_JOB_QUOTA_OSTS=128 (per job. We have 128/8=16 OSTs per group)

-x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=256

Note: It is recommended to play around with the parameters and test it compare to the non-SHARP allreduce tests.

Debugging SHARP

1. Add log level 3 (information) to get the log for communicator creation.

-x SHARP_COLL_LOG_LEVEL=3

Version	Old Version 4	New Version 5
Changes made by	Ophir Maor	Ophir Maor
Saved on	Aug 08, 2018	Aug 10, 2018

Versions Compared

Key

Set up the cluster and OpenSM

Enable SHARP Deamons

Verification

Advance Consideration

Multi-Channel

Fragmentation (allreduce tuning)

Debugging SHARP