...
Note: in above testing the first command runs osu_barrier without HCOLL, the second command runs it with HCOLL but without SHARP, the third command runs it with HCOLL and SHARP.
Advance Consideration
Multi-Channel
1. When using full PPN on the node (e.g. 32 Broadwell dual CPU or 40 on skylake servers) it is recommended to use set multi-channel
In the 3 level hierarchy:
First we will sub-group in the socket (e.g. 16 ranks), one group per socket. (-x HCOLL_SBGP=basesmsocket)
Second, each socket leader will form another group which is node level group per numa (-x HCOLL_SBGP=basesmuma)
Third, there will be one node leader per node (the leader of the second group) - one rank per node. (-x HCOLL_SBGP=p2p)
The collective algorithm will take advantage of this algorithm.
SBGP: sub grouping
The defaults is all 3 levels: -x HCOLL_SBGP=basesmsocket,basesmuma,p2p
In a of a intra-socket noise (when using full PPN for example) it is recommended to use 2 level hierarchy:
With 2 levels subgrouping it will be without the second level.
First we will sub-group in the socket (e.g. 16 ranks), one group per socket. (-x HCOLL_SBGP=basesmsocket)
Second , there will be one node leader per node (the leader of the second group) - one rank per node. (-x HCOLL_SBGP=p2p)
SBGP: sub grouping
-x HCOLL_SBGP=basesmsocket,p2p
2. In case we changed the HCOLL_SBGP, we need to align the HCOLL_BCOL (which is the communication channel)
For two level hierarchy we will use
-x HCOLL_BCOL=basesmuma,ucx_p2p
For three level hierarchy (default) we will use:
-x HCOLL_BCOL=basesmuma,basesmuma,ucx_p2p
basesmuma - use the shared memory for both level one and two.
ucx_p2p - for communication out of the node for level three.
To summarize, use the following for multi-channel for allreduce small messages (full PPN)
-x HCOLL_BCOL=basesmuma,ucx_p2p -x HCOLL_SBGP=basesmsocket,p2p
Note: if the ranks used are only within one socket it is not relevant.
Fragmentation (allreduce tuning)
To allow larger messages over SHARP use:
-x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4096 (default 256)
The fragmentation performance depends on #OSTS assigned to the group
1. SHARP resources for SwitchIB-2
For allreduce messages bigger than 256B (up to 4KB) you can use fragmentation on the message.
This is the fragment size:
-x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=256 (sharp fragment size)
The default is 128B.
2. OST - outstanding transactions. The default is 16.
One OST is one credit (after sending the sender is waiting for completion), for high performance reasons, you can enlarge it to 256.
-x SHARP_COLL_JOB_QUOTA_OSTS=256 ( # sharp osts)
3. Number of outstanding communicators that can use SHARP.
The default is 8. which means that by default each group gets 16/8=2 OST.
if the number of OSTs are 256, than the each group gets 256/8=32 OSTs
Each groups get #osts = (SHARP_COLL_JOB_QUOTA_OSTS / SHARP_COLL_JOB_QUOTA_MAX_GROUPS).
-x SHARP_COLL_JOB_QUOTA_MAX_GROUPS=8 ( #sharp groups).
For example for osu_allreduce up to 4KB (256 payload * 16 OST per group) use the following:
-x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4096
-x SHARP_COLL_JOB_QUOTA_OSTS=128 (per job. We have 128/8=16 OSTs per group)
-x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=256
Note: It is recommended to play around with the parameters and test it compare to the non-SHARP allreduce tests.