How to set up IntelMPI over RoCEv2
This post shows how to set up and run applications with Intel MPI over RoCEv2 devices by using osu_bw as an example, on a pair of nodes (jupiter[002-003]) with the RoCEv2 device mlx5_1 and Ethernet device name "enp5s0f1".
Note: Before you start, make sure that QoS is configured on the network (e.g. Flow Control, PFC or RoCC Congestion Control/ECN).
1. Figure out the RoCE device name to configure with ibdev2netdev, then confirm the link layer is Ethernet.
$ ibdev2netdev mlx5_1 port 1 ==> enp5s0f1 (Up) $ ibstat mlx5_1 CA 'mlx5_1' CA type: MT4115 Number of ports: 1 Firmware version: 12.20.1010 Hardware version: 0 Node GUID: 0x7cfe9003005d7e53 System image GUID: 0x7cfe9003005d7e52 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x04010000 Port GUID: 0x7efe90fffe5d7e53 Link layer: Ethernet
2. Configure and confirm it's in "RoCE v2" mode instead of "IB/RoCE v1" mode with cma_roce_mode (Ref. 1).
$ sudo cma_roce_mode -d mlx5_1 -p 1 IB/RoCE v1 $ sudo cma_roce_mode -d mlx5_1 -p 1 -m 2 RoCE v2 $ sudo cma_roce_mode -d mlx5_1 -p 1 RoCE v2
3. Prepare DAPL user-level DAT rdma provider with the following content. Note you will need to replace the Ethernet device name "enp5s0f1" with your actual device name.
$ cat dat.conf ofa-v2-cma-roe-enp5s0f1 u2.0 nonthreadsafe default dapl.2.0 "enp5s0f1 0" ""
4. Run OSU Bandwidth benchmark with the DAPL fabric and the provider prepared above.
$ module load osu/5.4-impi-2018.1.163-gcc-4.8.5 $ mpirun -n 2 -ppn 1 -hosts jupiter002,jupiter003 -genv I_MPI_DEBUG 4 -genv I_MPI_FALLBACK 0 -genv I_MPI_FABRICS shm:dapl -genv DAT_OVERRIDE ./dat.conf -genv I_MPI_DAT_LIBRARY /usr/lib64/ -genv I_MPI_DAPL_PROVIDER=ofa-v2-cma-roe-enp5s0f1 osu_bw
Note: it is important to use -genv I_MPI_FABRICS shm:dapl and -genv I_MPI_FALLBACK 0 and not just -dapl, this will guarantee that no fabric fallback will happen. If they simply use -dapl this allows the fabric to fallback to other DAPL capable device.
5. To confirm traffic is going through RoCEv2 please follow the guidance in Ref. 2.