How to set up IntelMPI over RoCEv2

Overview 

This post shows how to set up and run applications with Intel MPI over RoCEv2 devices by using osu_bw as an example, on a pair of nodes (jupiter[002-003]) with the RoCEv2 device mlx5_1 and Ethernet device name "enp5s0f1".

Note: Before you start, make sure that QoS is configured on the network (e.g. Flow Control, PFC or RoCC Congestion Control/ECN).

Configuration 

 1.  Figure out the RoCE device name to configure with ibdev2netdev, then confirm the link layer is Ethernet.

$ ibdev2netdev 
mlx5_1 port 1 ==> enp5s0f1 (Up)
$ ibstat mlx5_1
CA 'mlx5_1'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.20.1010
        Hardware version: 0
        Node GUID: 0x7cfe9003005d7e53
        System image GUID: 0x7cfe9003005d7e52
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x04010000
                Port GUID: 0x7efe90fffe5d7e53
                Link layer: Ethernet


2. Configure and confirm it's in "RoCE v2" mode instead of "IB/RoCE v1" mode with cma_roce_mode (Ref. 1).

$ sudo cma_roce_mode -d mlx5_1 -p 1
IB/RoCE v1
$ sudo cma_roce_mode -d mlx5_1 -p 1 -m 2
RoCE v2
$ sudo cma_roce_mode -d mlx5_1 -p 1
RoCE v2

3. Prepare DAPL user-level DAT rdma provider with the following content. Note you will need to replace the Ethernet device name "enp5s0f1" with your actual device name.

$ cat dat.conf
ofa-v2-cma-roe-enp5s0f1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "enp5s0f1 0" ""

4. Run OSU Bandwidth benchmark with the DAPL fabric and the provider prepared above.

$ module load osu/5.4-impi-2018.1.163-gcc-4.8.5
$ mpirun -n 2 -ppn 1 -hosts jupiter002,jupiter003 -genv I_MPI_DEBUG 4 -genv I_MPI_FALLBACK 0 -genv I_MPI_FABRICS shm:dapl -genv DAT_OVERRIDE ./dat.conf -genv I_MPI_DAT_LIBRARY /usr/lib64/libdat2.so -genv I_MPI_DAPL_PROVIDER=ofa-v2-cma-roe-enp5s0f1 osu_bw

Note:  it is important to use -genv I_MPI_FABRICS shm:dapl  and -genv I_MPI_FALLBACK 0   and not just -dapl, this will guarantee that no fabric fallback will happen. If they simply use -dapl this allows the fabric to fallback to other DAPL capable device. 

5. To confirm traffic is going through RoCEv2 please follow the guidance in Ref. 2.

References

  1. How To Set the Default RoCE Mode When Using RDMA CM
  2. How-To Dump RDMA traffic Using the Inbox tcpdump tool (ConnectX-4)