Getting Started with InfiniBand QoS
Getting Started with Basic QoS Test (Strict Priority)
Before you start, make sure you understand the concepts, see Understanding Basic InfiniBand QoS.
For a basic test, you can have two hosts connected via an InfiniBand switch, sending RDMA traffic on two different service levels.
SL 0 - to be used for best effort traffic
SL 1 - to be used for high priority traffic
In the test we can use two CPU cores (core 0, core 1), each will run ib_write_bw on different SL, expecting to see the high priority traffic reaching maximum performance.
Configuration
Check the mapping between SL to VL use smpquery sl2vl (with the lid address for example)
$ sudo smpquery sl2vl -L 141
# SL2VL table: Lid 91
# SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in 0, out 0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
To change the mapping configuration use OpenSM config file:
$ diff /etc/opensm/opensm.conf /etc/opensm/opensm.conf.orig
< qos TRUE
> qos FALSE
< qos_max_vls 2
< qos_high_limit 255
< qos_vlarb_high 1:192
< qos_vlarb_low 0:64
< qos_sl2vl 0,1
> qos_max_vls 0
> qos_high_limit -1
> qos_vlarb_high (null)
> qos_vlarb_low (null)
> qos_sl2vl (null)
Start the OpenSM
$ sudo opensm -g 0x98039b03009fcfd6 -F /etc/opensm/opensm.conf -B
-------------------------------------------------
OpenSM 5.4.0.MLNX20190422.ed81811
Config file is `/etc/opensm/opensm.conf`:
Reading Cached Option File: /etc/opensm/opensm.conf
Loading Cached Option:qos = TRUE
Loading Changed QoS Cached Option:qos_max_vls = 2
Loading Changed QoS Cached Option:qos_high_limit = 255
Loading Changed QoS Cached Option:qos_vlarb_low = 0:64
Loading Changed QoS Cached Option:qos_vlarb_high = 1:192
Loading Changed QoS Cached Option:qos_sl2vl = 0,1
Warning: Cached Option qos_sl2vl: < 16 VLs listed
Command Line Arguments:
Guid <0x98039b03009fcfd6>
Daemon mode
Log File: /var/log/opensm.log
Check sl2vl mapping table:
$ sudo smpquery sl2vl -L 141
# SL2VL table: Lid 141
# SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in 0, out 0: | 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
Check VL arbiter tables:
$ sudo smpquery vlarb 141
# VLArbitration tables: Lid 141 port 0 LowCap 8 HighCap 8
# Low priority VL Arbitration Table:
VL : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0x40|0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
# High priority VL Arbitration Table:
VL : |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0xC0|0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
check SL port counters
$ sudo perfquery -X 141 1
# PortXmitDataSL counters: Lid 141 port 1
PortSelect:......................1
CounterSelect:...................0x0000
XmtDataSL0:......................3677098498
XmtDataSL1:......................2771713603
XmtDataSL2:......................0
XmtDataSL3:......................0
XmtDataSL4:......................0
XmtDataSL5:......................0
XmtDataSL6:......................0
XmtDataSL7:......................0
XmtDataSL8:......................0
XmtDataSL9:......................0
XmtDataSL10:.....................0
XmtDataSL11:.....................0
XmtDataSL12:.....................0
XmtDataSL13:.....................0
XmtDataSL14:.....................0
XmtDataSL15:.....................0
Run RDMA Traffic
Low priority traffic: (core 0, SL 0)
$ numactl --cpunodebind=0 ib_write_bw -d mlx5_0 -i 1 --report_gbits -F --sl=0 -D 10
High priority traffic (core 1, SL1)
$ numactl --cpunodebind=1 ib_write_bw -d mlx5_0 -i 1 --report_gbits -F --sl=1 -D 10
Make sure you get 0 Gb/s on SL 0 (no packet could be sent)
$ numactl --cpunodebind=0 ib_write_bw -d mlx5_0 -i 1 --report_gbits -F --sl=0 -D 10
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x8f QPN 0xdd16 PSN 0x25f4a4 RKey 0x0e1848 VAddr 0x002b65b2130000
remote address: LID 0x8d QPN 0x02c6 PSN 0xdb2c00 RKey 0x17d997 VAddr 0x002b8263ed0000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 0 0.000000 0.000000 0.000000
---------------------------------------------------------------------------------------
Make sure you get close to 100 Gb/s on SL 1
$ numactl --cpunodebind=1 ib_write_bw -d mlx5_0 -i 1 --report_gbits -F --sl=1 -D 10
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 1104900 0.00 96.55 0.184149
---------------------------------------------------------------------------------------
Getting Started with Basic QoS Test (WRR)
Before you start, make sure you understand the concepts, see Understanding Basic InfiniBand QoS.
Weighted Round Robin (WRR) arbitrer allows to split the possible bandwidth between high and low priority traffic, without possible starvation as it may happen in the strict priority example. With WRR, you may give different weight for every SL while the arbitrer will perform WRR between them.
For this basic test, you can have two hosts connected via an InfiniBand switch, sending RDMA traffic on two different service levels.
SL 0 - to be used for best effort traffic (1/4 of the traffic in this example)
SL 1 - to be used for high priority traffic (3/4 of the traffic in this example)
In the test we can use two CPU cores (core 0, core 1), each will run ib_write_bw on different SL, expecting to see the high priority traffic reaching 3/4 of the link speed.
Configuration
Check the mapping between SL to VL use smpquery sl2vl (with the lid address for example)
$ sudo smpquery sl2vl -L 141
# SL2VL table: Lid 91
# SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in 0, out 0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
To change the mapping configuration use OpenSM config file:
$ diff /etc/opensm/opensm.conf /etc/opensm/opensm.conf.orig
< qos TRUE
> qos FALSE
< qos_max_vls 2
< qos_high_limit -1
< qos_vlarb_high 1:192
< qos_vlarb_low 0:64
< qos_sl2vl 0,1
> qos_max_vls 0
> qos_high_limit -1
> qos_vlarb_high (null)
> qos_vlarb_low (null)
> qos_sl2vl (null)
Start the OpenSM
$ sudo opensm -g 0x98039b03009fcfd6 -F /etc/opensm/opensm.conf -B
-------------------------------------------------
OpenSM 5.4.0.MLNX20190422.ed81811
Config file is `/etc/opensm/opensm.conf`:
Reading Cached Option File: /etc/opensm/opensm.conf
Loading Cached Option:qos = TRUE
Loading Changed QoS Cached Option:qos_max_vls = 2
Loading Changed QoS Cached Option:qos_vlarb_low = 0:64
Loading Changed QoS Cached Option:qos_vlarb_high = 1:192
Loading Changed QoS Cached Option:qos_sl2vl = 0,1
Warning: Cached Option qos_sl2vl: < 16 VLs listed
Command Line Arguments:
Guid <0x98039b03009fcfd6>
Daemon mode
Log File: /var/log/opensm.log
Check sl2vl mapping table:
$ sudo smpquery sl2vl -L 141
# SL2VL table: Lid 141
# SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in 0, out 0: | 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
Check VL arbiter tables:
$ sudo smpquery vlarb 141
# VLArbitration tables: Lid 141 port 0 LowCap 8 HighCap 8
# Low priority VL Arbitration Table:
VL : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0x40|0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
# High priority VL Arbitration Table:
VL : |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0xC0|0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
To check SL port counters
$ sudo perfquery -X 141 1
# PortXmitDataSL counters: Lid 141 port 1
PortSelect:......................1
CounterSelect:...................0x0000
XmtDataSL0:......................3677098498
XmtDataSL1:......................2771713603
XmtDataSL2:......................0
XmtDataSL3:......................0
XmtDataSL4:......................0
XmtDataSL5:......................0
XmtDataSL6:......................0
XmtDataSL7:......................0
XmtDataSL8:......................0
XmtDataSL9:......................0
XmtDataSL10:.....................0
XmtDataSL11:.....................0
XmtDataSL12:.....................0
XmtDataSL13:.....................0
XmtDataSL14:.....................0
XmtDataSL15:.....................0
Run RDMA Traffic
Low priority traffic: (core 0, SL 0)
$ numactl --cpunodebind=0 ib_write_bw -d mlx5_0 -i 1 --report_gbits -F --sl=0 -D 10
High priority traffic (core 1, SL1)
$ numactl --cpunodebind=1 ib_write_bw -d mlx5_0 -i 1 --report_gbits -F --sl=1 -D 10
Make sure you get ~1/4 of the link speed n SL 0
$ numactl --cpunodebind=0 ib_write_bw -d mlx5_0 -i 1 --report_gbits -F --sl=0 -D 10
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x8f QPN 0xdd16 PSN 0x25f4a4 RKey 0x0e1848 VAddr 0x002b65b2130000
remote address: LID 0x8d QPN 0x02c6 PSN 0xdb2c00 RKey 0x17d997 VAddr 0x002b8263ed0000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 276300 0.00 24.14 0.046050
---------------------------------------------------------------------------------------
Make sure you get close to 3/4 Gb/s on SL 1
$ numactl --cpunodebind=1 ib_write_bw -d mlx5_0 -i 1 --report_gbits -F --sl=1 -D 10
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 828900 0.00 72.43 0.138149
---------------------------------------------------------------------------------------
Useful commands
kill openSM
$ sudo kill $(ps -ef | grep opensm | grep root | awk '{print $2}') ; sudo opensm -g 0x98039b03009fcfd6 -F /etc/opensm/opensm.conf -B
References
https://wiki.whamcloud.com/pages/viewpage.action?pageId=72713941
https://lustre-discuss.lustre.narkive.com/FBoojkAs/infiniband-qos-with-lustre-ko2iblnd
http://www.mellanox.com/pdf/whitepapers/deploying_qos_wp_10_19_2005.pdf