Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Note: in above testing the first command runs osu_barrier without HCOLL, the second command runs it with HCOLL but without SHARP, the third command runs it with HCOLL and SHARP.


Advance Consideration


Multi-Channel 

1. When using full PPN on the node (e.g. 32 Broadwell dual CPU or 40 on skylake servers) it is recommended to use set multi-channel 

In the 3 level hierarchy:

First we will sub-group in the socket (e.g. 16 ranks), one group per socket. (-x HCOLL_SBGP=basesmsocket)

Second, each socket leader will form another group which is node level group per numa (-x HCOLL_SBGP=basesmuma)

Third, there will be one node leader per node (the leader of the second group) - one rank per node. (-x HCOLL_SBGP=p2p)

The collective algorithm will take advantage of this algorithm.

SBGP: sub grouping

The defaults is all 3 levels: -x HCOLL_SBGP=basesmsocket,basesmuma,p2p


In a of a intra-socket noise (when using full PPN for example) it is recommended to use 2 level hierarchy:

With 2 levels subgrouping it will be without the second level.

First we will sub-group in the socket (e.g. 16 ranks), one group per socket. (-x HCOLL_SBGP=basesmsocket)

Second , there will be one node leader per node (the leader of the second group) - one rank per node. (-x HCOLL_SBGP=p2p)


SBGP: sub grouping 

-x HCOLL_SBGP=basesmsocket,p2p


2. In case we changed the HCOLL_SBGP, we need to align the HCOLL_BCOL (which is the communication channel)

For two level hierarchy we will use 

-x HCOLL_BCOL=basesmuma,ucx_p2p


For three level hierarchy (default) we will use:

-x HCOLL_BCOL=basesmuma,basesmuma,ucx_p2p


basesmuma - use the shared memory for both level one and two.

ucx_p2p - for communication out of the node for level three.


To summarize, use the following for multi-channel for allreduce small messages (full PPN)

-x HCOLL_BCOL=basesmuma,ucx_p2p -x HCOLL_SBGP=basesmsocket,p2p


Note: if the ranks used are only within one socket it is not relevant.


Fragmentation (allreduce tuning)

To allow larger messages over SHARP use:

-x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4096 (default 256)

The fragmentation performance depends on #OSTS assigned to the group


1. SHARP resources for SwitchIB-2

For allreduce messages bigger than 256B (up to 4KB) you can use fragmentation on the message.

This is the fragment size:

-x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=256 (sharp fragment size)

The default is 128B.


2. OST - outstanding transactions. The default is 16.

One OST is one credit (after sending the sender is waiting for completion), for high performance reasons, you can enlarge it to 256.


-x SHARP_COLL_JOB_QUOTA_OSTS=256 ( # sharp osts)


3. Number of outstanding communicators that can use SHARP.

The default is 8. which means that by default each group gets 16/8=2 OST.

if the number of OSTs are 256, than the each group gets 256/8=32 OSTs

Each groups get #osts = (SHARP_COLL_JOB_QUOTA_OSTS / SHARP_COLL_JOB_QUOTA_MAX_GROUPS).


-x SHARP_COLL_JOB_QUOTA_MAX_GROUPS=8 ( #sharp groups).


For example for osu_allreduce up to 4KB (256 payload * 16 OST per group) use the following:

-x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4096

-x SHARP_COLL_JOB_QUOTA_OSTS=128      (per job. We have 128/8=16 OSTs per group)

-x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=256


Note: It is recommended to play around with the parameters and test it compare to the non-SHARP allreduce tests.