Understanding Tag Matching Offload on ConnectX-5 Adapters

Hardware Tag Matching is a technology available only on Mellanox ConnectX-5 adapters, that allows offloading the processing of point-to-point MPI messages from the host machine onto the adapter card. It enables zero copy of MPI messages, i.e., messages are written directly to the user's buffer without intermediate buffering and copies. It also provides a complete rendezvous progress. Such overlap capability enables the CPU to proceed with computation while the remote data is gathered by the adapter.

The following configuration recommendation are aligned with HPC-X v2.1 version:

  • TM offload is supported for rc, rc_x, dc and dc_x transports.
  • TM offload is disabled by default.
  • To enable TM for all transports, use UCX_RC_TM_ENABLE.
  • MLNX_OFED_LINUX-4.1-4.1.1.0 or higher is required for TM with DC; earlier MLNX_OFED versions only support RC TM.
  • The command "ucx_info -f" prints out a full list of environment variables that can be used to modify UCX behavior.
  • For advance users, there are three TM offload related variables that may be helpful in tuning a particular application that does not show any improvement with offload enabled:
    • UCX_TM_THRESH: This variable defines the threshold for using TM offload. Messages smaller than this value will be handled in SW. The default value is 1024b, because using TM offload implies noticeable performance overhead (which is better to avoid with small messages).
    • UCX_TM_MAX_BCOPY: This variable defines the maximum message size for using bounce buffers optimization. The UCP internal preregistered buffer is offloaded for all messages that are larger than UCX_TM_THRESH but smaller than this value. Then, when a message arrives, it is copied into the user buffer. The default value is 1024b, which means that this optimization is disabled by default, because it is the same as UCX_TM_THRESH.
    • UCX_TM_FORCE_THRESH: Threshold for forcing tag matching offload mode. UCP does not offload any message if there is any non-offloaded uncompleted receive operation (for instance message smaller than UCX_TM_THRESH). This is done for the sake of preserving message ordering. Thus, when a receive operation whose buffer is larger than this threshold is invoked, UCP tries to offload all uncompleted receive operations plus the one being processed.
      If the application sends lots of relatively small messages it makes sense to set UCX_TM_THRESH to some small value (say, 8). In this case more messages will be offloaded, but overhead for small message offloading will be flattened by the bounce buffer optimization.
      Also, depending on the application's communication pattern, toggling UCX_TM_FORCE_THRESH may help as well.


See HPC-X 2.0 Boosts Performance of Grid Benchmark, for application that uses TM and get benefit from that.


Additional references for Tag Matching can be found here.