Random Ring Effective Bandwidth (b_eff) with Adaptive Routing Analysis

The effective bandwidth beff measures the accumulated bandwidth of the communication network of parallel and/or distributed computing systems. Several message sizes, communication patterns and methods are used. The algorithm uses an average to take into account that short and long messages are transferred with different bandwidth values in real applications.

In this post we will review the value of adaptive routing on b_eff benchmark.

Natural Ring and Random Ring Tests

The effective bandwidth measures the bandwidth of the communication network of parallel and/or distributed computing systems
Several message sizes, communication patterns and methods are used
The algorithm uses an average to take into account that short and long messages are transferred with different bandwidth values in real applications
The test generate several output tables, one of them is the random ring.

Random Ring Bandwidth – Creates random communication ring: Rank i communicates with Rank j and Rank k (randomly selected Rank j,k)
…
More overlaps between the routes

For more details of the test, refer to https://fs.hlrs.de/projects/par/mpi/b_eff/

Adaptive Routing

Adaptive routing enables network status to be taken into consideration when choosing the route for a network packet, providing an opportunity for improved fabric utilization. Adaptive routing also provides enhancements to RAS features of the overall system and used to route around failed links and switches. When enabled, the leaf switch on the network will select the egress port among the best possible routes available, based on the load on that route.

 

Cluster Configuration

Setup-1

 

Cluster Hardware:
Dual Socket Intel Xeon Platinum 8260L CPU @ 2.40GHz
Mellanox ConnectX-6 HDR InfiniBand
Mellanox Quantum Switch HDR InfiniBand
Memory: 192GB DDR4 2677MHz RDIMMs per node

Software:
OS: CentOS 7.7, MLNX_OFED 4.7-3
MPI: HPC-X 2.6.0, UCX 1.8

Setup-2

Cluster Hardware:
Dual Socket Intel Xeon Platinum 8280 CPU @ 2.60GHz
Mellanox ConnectX-6 HDR100 InfiniBand
Mellanox Quantum Switch HDR InfiniBand
Memory: 192GB DDR4 2677MHz RDIMMs per node

Software:
OS: CentOS 7.6, MLNX_OFED 4.6-1
MPI: HPC-X 2.6.0, UCX 1.8

Performance Analysis

1. With adaptive routing we observed 99% of the effective bandwidth when 32 nodes compered to 2 nodes on Setup-1.

 

2. With adaptive routing we observed 90% of the effective bandwidth when 1024 nodes compared to 2 nodes on Setup-2.

 

References