Page Comparison

...

Note: in above testing the first command runs osu_barrier without HCOLL, the second command runs it with HCOLL but without SHARP, the third command runs it with HCOLL and SHARP.

Advance Consideration

Multi-Channel

1. When using full PPN on the node (e.g. 32 Broadwell dual CPU or 40 on skylake servers) it is recommended to use set multi-channel

In the 3 level hierarchy:

First we will sub-group in the socket (e.g. 16 ranks), one group per socket. (-x HCOLL_SBGP=basesmsocket)

Second, each socket leader will form another group which is node level group per numa (-x HCOLL_SBGP=basesmuma)

Third, there will be one node leader per node (the leader of the second group) - one rank per node. (-x HCOLL_SBGP=p2p)

The collective algorithm will take advantage of this algorithm.

SBGP: sub grouping

The defaults is all 3 levels: -x HCOLL_SBGP=basesmsocket,basesmuma,p2p

In a of a intra-socket noise (when using full PPN for example) it is recommended to use 2 level hierarchy:

With 2 levels subgrouping it will be without the second level.

First we will sub-group in the socket (e.g. 16 ranks), one group per socket. (-x HCOLL_SBGP=basesmsocket)

Second , there will be one node leader per node (the leader of the second group) - one rank per node. (-x HCOLL_SBGP=p2p)

SBGP: sub grouping

-x HCOLL_SBGP=basesmsocket,p2p

2. In case we changed the HCOLL_SBGP, we need to align the HCOLL_BCOL (which is the communication channel)

For two level hierarchy we will use

-x HCOLL_BCOL=basesmuma,ucx_p2p

For three level hierarchy (default) we will use:

-x HCOLL_BCOL=basesmuma,basesmuma,ucx_p2p

basesmuma - use the shared memory for both level one and two.

ucx_p2p - for communication out of the node for level three.

To summarize, use the following for multi-channel for allreduce small messages (full PPN)

-x HCOLL_BCOL=basesmuma,ucx_p2p -x HCOLL_SBGP=basesmsocket,p2p

Note: if the ranks used are only within one socket it is not relevant.

Fragmentation (allreduce tuning)

To allow larger messages over SHARP use:

-x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4096 (default 256)

The fragmentation performance depends on #OSTS assigned to the group

1. SHARP resources for SwitchIB-2

For allreduce messages bigger than 256B (up to 4KB) you can use fragmentation on the message.

This is the fragment size:

-x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=256 (sharp fragment size)

The default is 128B.

2. OST - outstanding transactions. The default is 16.

One OST is one credit (after sending the sender is waiting for completion), for high performance reasons, you can enlarge it to 256.

-x SHARP_COLL_JOB_QUOTA_OSTS=256 ( # sharp osts)

3. Number of outstanding communicators that can use SHARP.

The default is 8. which means that by default each group gets 16/8=2 OST.

if the number of OSTs are 256, than the each group gets 256/8=32 OSTs

Each groups get #osts = (SHARP_COLL_JOB_QUOTA_OSTS / SHARP_COLL_JOB_QUOTA_MAX_GROUPS).

-x SHARP_COLL_JOB_QUOTA_MAX_GROUPS=8 ( #sharp groups).

For example for osu_allreduce up to 4KB (256 payload * 16 OST per group) use the following:

-x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4096

-x SHARP_COLL_JOB_QUOTA_OSTS=128 (per job. We have 128/8=16 OSTs per group)

-x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=256

Note: It is recommended to play around with the parameters and test it compare to the non-SHARP allreduce tests.

Versions Compared

Old Version 2

New Version 3

Key

Advance Consideration

Multi-Channel

Fragmentation (allreduce tuning)