This article discusses a best-known method (BKM) for using the libfabric1 infrastructure on a cluster computing system. The focus is on how to transition from an Open Fabrics Alliance (OFA) framework to the Open Fabrics Interfaces1 (OFI) framework, where a description of the fabric providers that support libfabric are given. In general, the OFA to OFI transition has a charter to make the Intel® MPI Library software layer lighter, where most of the network communication controls are being shifted to a lower level (for example, the OFI provider level). For more information, go to the Libfabric OpenFabrics.1 The reader should note that the following information is based on the Open Fabrics Interfaces Working Group, and hence this document is heavily annotated with citations to the URL: https://github.com/ofiwg/libfabric so that the reader can obtain even more detailed information when it is needed.
What is libfabric?
The Open Fabrics Interfaces1 (OFI) is a framework focused on exporting fabric communication services to applications.
See the OFI web site for more details. This URL includes a description and overview of the project and detailed documentation for the libfabric APIs.
Building and installing libfabric from the source
Distribution tar packages are available from the GitHub* releases tab.1 If you are building libfabric from a developer Git clone, you must first run the autogen.sh
script. This will invoke the GNU* Autotools to bootstrap libfabric's configuration and build mechanisms. If you are building libfabric from an official distribution tarball, then you do not need to run autogen.sh
. This means that libfabric distribution tarballs are already bootstrapped for you.
Libfabric currently supports GNU/Linux*, Free BSD*, and OS X*. Although OS X* is mentioned here, the Intel® MPI Library does not support OS X.
Configuration options1
The configure script has many built-in command-line options. The reader should issue the command:
./configure --help
to view those options. Some useful configuration switches are:
--prefix=<directory>
Throughout this article, <directory
> should be interpreted as a meta-symbol for the actual directory path that is to be supplied by the user. By default, make install
places the files in the /usr
directory tree. If the --prefix
option is used it indicates that libfabric files should be installed into the directory tree specified by <directory
>. The executables that are built from the configure
command will be placed into <directory>/bin
.
--with-valgrind=<directory>
The meta-symbol <directory
> is the directory where valgrind is installed. If valgrind is found, valgrind annotations are enabled. This may incur a performance penalty.
--enable-debug
Enable debug code paths. This enables various extra checks and allows for using the highest verbosity logging output that is normally compiled out in production builds.
--enable-<provider>=[yes | no | auto | dl | <directory>]
--disable-<provider>
This enables or disables the fabric provider referenced by the meta-symbol <provider>
. Valid options are:
auto
(This is the default if the --enable-<provider>
option is not specified).
The provider will be enabled if all its requirements are satisfied. If one of the requirements cannot be satisfied, the provider is disabled.yes
(This is the default if the --enable-<provider>
option is specified).
The configure script will abort if the provider cannot be enabled (for example, due to some of its requirements not being available).no
Disable the provider. This is synonymous with --disable-<provider>
.dl
Enable the provider and build it as a loadable library.<directory>
Enable the provider and use the installation given in <directory>
.
Providers1 are gni*, mxm*, psm, psm2, sockets, udp, usnic*, and verbs.
Examples1
Consider the following example:
$ ./configure --prefix=/opt/libfabric --disable-sockets && make -j 32 && sudo make install
This tells libfabric to disable the sockets
provider and install libfabric in the /opt/libfabric
tree. All other providers will be enabled if possible, and all debug features will be disabled.
Alternatively:
$ ./configure --prefix=/opt/libfabric --enable-debug --enable-psm=dl && make -j 32 && sudo make install
This tells libfabric to enable the psm provider as a loadable library, enable all debug code paths, and install libfabric to the /opt/libfabric
tree. All other providers will be enabled if possible.
Validate installation1
The fi_info utility can be used to validate the libfabric and provider installation, as well as provide details about provider support and available interfaces. See the fi_info(1) man page for details on using the fi_info utility. fi_info is installed as part of the libfabric package.
A more comprehensive test suite is available via the fabtests software package. Also, fi_pingpong, which is a Ping-Pong test for transmitting data between two processes may be used for validation purposes as well.
Who are the libfabric1 providers?
gni*1
The Generic Network Interface (GNI) provider runs on Cray XC* systems utilizing the user-space Generic Network Interface (uGNI), which provides low-level access to the Aries* interconnect. Aries is the Cray custom interconnect ASIC (Application-Specific Integrated Circuit). The Aries interconnect is designed for low-latency, one-sided messaging and also includes direct hardware support for common atomic operations and optimized collectives. Note, however, that OFI does not provide an API for collectives. Some kind of path for optimization of the collectives can be done with the fi_trigger APIs, where details can be found at the URL: https://ofiwg.github.io/libfabric/master/man/fi_trigger.3.html. However, as of this writing, Intel MPI Library does not use fi_trigger (Triggered operations).
See the fi_gni(7) man page for more details.
Dependencies1
- The GNI provider requires GCC version 4.9 or higher.
mxm*1
The MXM provider has been deprecated and was removed after the libfabric1.4.0 release.
psm1
The psm (Performance Scaled Messaging) provider runs over the PSM 1.x interface that is currently supported by the Intel® True Scale Fabric. PSM provides tag-matching message queue functions that are optimized for MPI implementations. PSM also has limited Active Message support, which is not officially published but is quite stable and well documented in the source code (part of the OFED release). The psm provider makes use of both the tag-matching message queue functions and the Active Message functions to support a variety of libfabric data transfer APIs, including tagged message queue, message queue, RMA (Remote Memory Access), and atomic operations.
The psm provider can work with the psm2-compat library, which exposes a PSM 1.x interface over the Intel® Omni-Path Fabric.
See the fi_psm(7) man page for more details.
psm21
The psm2 provider runs over the PSM 2.x interface that is supported by the Intel Omni-Path Fabric. PSM 2.x has all the PSM 1.x features plus a set of new functions with enhanced capabilities. Since PSM 1.x and PSM 2.x are not application binary interface (ABI) compatible, the psm2 provider only works with PSM 2.x and does not support Intel True Scale Fabric.
See the fi_psm2(7) man page for more details.
sockets1
The sockets provider is a general purpose provider that can be used on any system that supports TCP sockets. The provider is not intended to provide performance improvements over regular TCP sockets but rather to allow developers to write, test, and debug application code even on platforms that do not have high-performance fabric hardware. The sockets provider supports all libfabric provider requirements and interfaces.
See the fi_sockets(7) man page for more details.
udp1
The udp (user datagram protocol) provider is a basic provider that can be used on any system that supports User Datagram Protocol (UDP) sockets. UDP is an alternative communications protocol to Transmission Control Protocol (TCP) used primarily for establishing low-latency and loss tolerating connections between applications on the Internet. The provider is not intended to provide performance improvements over regular UDP sockets but rather to allow application and provider developers to write, test, and debug their code. The udp provider forms the foundation of a utility provider that enables the implementation of libfabric features over any hardware. Intel MPI Library does not support the udp provider.
See the fi_udp(7) man page for more details.
usnic*1
The usnic provider is designed to run over the Cisco VIC* (virtualized NIC) hardware on Cisco UCS* (Unified Computing System) servers. It utilizes the Cisco usnic (userspace NIC) capabilities of the VIC to enable ultra-low latency and other offload capabilities on Ethernet networks. Intel MPI Library does not support the usnic provider.
See the fi_usnic(7) man page for more details.
Dependencies1
The usnic provider depends on library files from either libnl version 1 (sometimes known as libnl or libnl1) or version 3 (sometimes known as libnl3). If you are compiling libfabric from source and want to enable usnic support, you will also need the matching libnl header files (for example, if you are building with libnl version 3, you need both the header and library files from version 3).
Configure options1
--with-libnl=<directory>
If specified, look for libnl support. If it is not found, the usnic provider will not be built. If <directory>
is specified, check for libnl version 3 in the directory. If version 3 is not found, check for version 1. If no <directory>
argument is specified, this option is redundant with --with-usnic
.
verbs*1
The verbs provider enables applications using OFI to be run over any verbs hardware (InfiniBand*, iWarp*, and so on). It uses the Linux Verbs API for network transport and provides a translation of OFI calls to appropriate verbs API calls. It uses librdmacm for communication management and libibverbs for other control and data transfer operations.
See the fi_verbs(7) man page for more details.
Dependencies1
The verbs provider requires libibverbs (v1.1.8 or newer) and librdmacm (v1.0.16 or newer). If you are compiling libfabric from source and want to enable verbs support, you will also need the matching header files for the above two libraries. If the libraries and header files are not in default paths, specify them in the CFLAGS, LDFLAGS, and LD_LIBRARY_PATH environment variables.
Selecting a fabric provider within OFI when using the Intel® MPI Library
For OFI when using Intel MPI Library, the selection of a provider from the libfabric library is done through the environment variable called I_MPI_OFI_PROVIDER, which defines the name of the OFI provider to load.
Syntax
export I_MPI_OFI_PROVIDER=<name>
where <name>
is the OFI provider to load. Figure 1 shows a list of OFI providers1 in the row of rectangles that are second from the bottom.
![]()
Figure 1.The libfabric* architecture under Open Fabric Interfaces1 (OFI).
The discussion that follows provides a description of OFI providers that can be selected with the I_MPI_OFI_PROVIDER environment variable.
Using a DAPL or a DAPL UD equivalent when migrating to OFI
DAPL is an acronym for Direct Access Programming Library. For DAPL UD, the acronym UD stands for the User Datagram protocol, and this data transfer is a more memory-efficient alternative to the standard Reliable Connection (RC) transfer. UD implements a connectionless model that allows for a many-to-one connection transfer to be set up for managing communication using a fixed number of connection pairs, even as more MPI ranks are launched.
At the moment, there is no DAPL UD equivalent within OFI.
Using gni* under OFI
To use the gni provider under OFI, set the following environment variable:
export I_MPI_OFI_PROVIDER=gni
OVERVIEW1
The GNI provider runs on Cray XC systems utilizing the user-space Generic Network Interface (uGNI), which provides low-level access to the Aries interconnect. The Aries interconnect is designed for low-latency, one-sided messaging and also includes direct hardware support for common atomic operations and optimized collectives. Intel MPI Library works with the GNI provider on an “as is” basis.
REQUIREMENTS1
The GNI provider runs on Cray XC systems that run the Cray Linux Environment 5.2 UP04 or higher using gcc version 4.9 or higher.
The article by Lubin2 talks about using the gni fabric.
Using mxm* under OFI
As of this writing, the MXM provider has been deprecated and was removed after the libfabric 1.4.0 release.
Using TCP (Transmission Control Protocol) under OFI
To use the sockets provider under OFI set the following environment variable:
export I_MPI_OFI_PROVIDER=sockets
OVERVIEW1
The sockets provider is a general purpose provider that can be used on any system that supports TCP sockets. The provider is not intended to provide performance improvements over regular TCP sockets but rather to allow developers to write, test, and debug application code even on platforms that do not have high-performance fabric hardware. The sockets provider supports all libfabric provider requirements and interfaces.
SUPPORTED FEATURES1
The sockets provider supports all the features defined for the libfabric API. Key features include:
Endpoint types1
The provider supports all endpoint types: FI_EP_MSG, FI_EP_RDM, and FI_EP_DGRAM.
Endpoint capabilities1
The following data transfer interface is supported for all endpoint types: fi_msg. Additionally, these interfaces are supported for reliable endpoints (FI_EP_MSG and FI_EP_RDM): fi_tagged, fi_atomic, and fi_rma.
Modes1
The sockets provider supports all operational modes including FI_CONTEXT and FI_MSG_PREFIX.
Progress1
Sockets provider supports both FI_PROGRESS_AUTO and FI_PROGRESS_MANUAL, with a default set to auto. When progress is set to auto, a background thread runs to ensure that progress is made for asynchronous requests.
LIMITATIONS1
The sockets provider attempts to emulate the entire API set, including all defined options. In order to support development on a wide range of systems, it is implemented over TCP sockets. As a result, the performance numbers are lower compared to other providers implemented over high-speed fabrics and lower than what an application might see implementing sockets directly.
Using UDP under OFI
As of this writing, the UDP provider is not supported by Intel MPI Library because of the lack of required capabilities within the provider.
Using usnic* under OFI
As of this writing, Intel® MPI Library does not work with usnic*.
Using TMI under OFI
The Tag Matching Interface (TMI) provider was developed for Performance Scaled Messaging (PSM) and Performance Scaled Messaging 2 (PSM2). Therefore under OFI, use Performance Scaled Messaging (PSM) as an alternative to using TMI/PSM by setting the following environment variable:
export I_MPI_OFI_PROVIDER=psm
OVERVIEW1
The psm provider runs over the PSM 1.x interface that is currently supported by the Intel True Scale Fabric. PSM provides tag-matching message queue functions that are optimized for MPI implementations. PSM also has limited Active Message support, which is not officially published, but is quite stable and is well documented in the source code (part of the OFED release). The psm provider makes use of both the tag-matching message queue functions and the Active Message functions to support a variety of libfabric data transfer APIs, including tagged message queue, message queue, RMA (Remote Memory Access), and atomic operations.
The psm provider can work with the psm2-compat library, which exposes a PSM 1.x interface over the Intel Omni-Path Fabric.
LIMITATIONS1
The psm provider does not support all the features defined in the libfabric API. Here are some of the limitations:
Endpoint types1
Only support non-connection based types FI_DGRAM and FI_RDM
Endpoint capabilities1
Endpoints can support any combination of data transfer capabilities FI_TAGGED, FI_MSG, FI_ATOMICS, and FI_RMA. These capabilities can be further refined by FI_SEND, FI_RECV, FI_READ, FI_WRITE, FI_REMOTE_READ, and FI_REMOTE_WRITE to limit the direction of operations. The limitation is that no two endpoints can have overlapping receive or RMA (Remote Memory Access) target capabilities in any of the above categories. For example, it is fine to have two endpoints with FI_TAGGED | FI_SEND, one endpoint with FI_TAGGED | FI_RECV, one endpoint with FI_MSG, one endpoint with FI_RMA | FI_ATOMICS. But, it is not allowed to have two endpoints with FI_TAGGED, or two endpoints with FI_RMA.
FI_MULTI_RECV is supported for the non-tagged message queue only.
Other supported capabilities include FI_TRIGGER.
Modes1
FI_CONTEXT is required for the FI_TAGGED and FI_MSG capabilities. This means that any request belonging to these two categories that generates a completion must pass as the operation context a valid pointer to the data structure type, struct fi_context, and the space referenced by the pointer must remain untouched until the request has completed. If none of FI_TAGGED and FI_MSG is asked for, the FI_CONTEXT mode is not required.
Progress1
The psm provider requires manual progress. The application is expected to call the fi_cq_read or fi_cntr_read function from time to time when no other libfabric function is called to ensure progress is made in a timely manner. The provider does support the auto progress mode. However, the performance can be significantly impacted, if the application purely depends on the provider to make auto progress.
Unsupported features1
These features are unsupported: connection management, scalable endpoint, passive endpoint, shared receive context, and send/inject with immediate data.
Using PSM2 under OFI
To use the psm2 provider under OFI, set the following environment variable:
export I_MPI_OFI_PROVIDER=psm2
OVERVIEW1
The psm2 provider runs over the PSM 2.x interface that is supported by the Intel Omni-Path Fabric. PSM 2.x has all the PSM 1.x features plus a set of new functions with enhanced capabilities. Since PSM 1.x and PSM 2.x are not ABI compatible the psm2 provider only works with PSM 2.x, and does not support Intel True Scale Fabric. If you have Intel® Omni-Path Architecture, use the PSM2 provider.
LIMITATIONS1
The psm2 provider does not support all of the features defined in the libfabric API. Here are some of the limitations:
Endpoint types1
The only supported non-connection based types are FI_DGRAM and FI_RDM.
Endpoint capabilities1
Endpoints can support any combination of data transfer capabilities FI_TAGGED, FI_MSG, FI_ATOMICS, and FI_RMA. These capabilities can be further refined by FI_SEND, FI_RECV, FI_READ, FI_WRITE, FI_REMOTE_READ, and FI_REMOTE_WRITE to limit the direction of operations.
FI_MULTI_RECV is supported for non-tagged message queue only.
Other supported capabilities include FI_TRIGGER, FI_REMOTE_CQ_DATA, and FI_SOURCE.
Modes1
FI_CONTEXT is required for the FI_TAGGED and FI_MSG capabilities. This means that any request belonging to these two categories that generates a completion must pass as the operation context a valid pointer to data structure type, struct fi_context, and the space referenced by the pointer must remain untouched until the request has completed. If none of FI_TAGGED and FI_MSG is asked for, the FI_CONTEXT mode is not required.
Progress1
The psm2 provider requires manual progress. The application is expected to call the fi_cq_read or fi_cntr_read function from time to time when no other libfabric function is called to ensure progress is made in a timely manner. The provider does support auto progress mode. However, the performance can be significantly impacted, if the application purely depends on the provider to make auto progress.
Unsupported features1
These features are unsupported: connection management, scalable endpoint, passive endpoint, shared receive context, and send/inject with immediate data over tagged message queue.
Using verbs under OFI
To use the verbs provider under OFI, set the following environment variable:
export I_MPI_OFI_PROVIDER=verbs
OVERVIEW1
The verbs provider enables applications using OFI to be run over any verbs hardware (InfiniBand, iWarp, and so on). It uses the Linux Verbs API for network transport and provides a translation of OFI calls to appropriate verbs API calls. It uses librdmacm for communication management, and libibverbs for other control and data transfer operations.
SUPPORTED FEATURES1
The verbs provider supports a subset of OFI features.
Endpoint types1
Only FI_EP_MSG (Reliable Connection-Oriented) and FI_EP_RDM (Reliable Datagram) are supported, but the official OFI documentation declares FI_EP_RDM as experimental because it is under active development and this includes wire protocols. Intel MPI Library works over RDM endpoints. Note that changes in the wire protocol typically mean that all peers must work in an aligned environment. Therefore, different versions of libfabric are not compatible.
Endpoint capabilities1
FI_MSG, FI_RMA, FI_ATOMIC.
Modes1
A verbs provider requires applications to support the following modes: FI_LOCAL_MR for all applications. FI_RX_CQ_DATA for applications that want to use RMA (Remote Memory Access). Applications must take responsibility of posting receives for any incoming CQ (Completion Queue) data.
Progress1
A verbs provider supports FI_PROGRESS_AUTO: Asynchronous operations make forward progress automatically.
Operation flags1
A verbs provider supports FI_INJECT, FI_COMPLETION, FI_REMOTE_CQ_DATA.
Msg Ordering1
A verbs provider supports the following messaging ordering on the TX side: * Read after Read * Read after Write * Read after Send * Write after Write * Write after Send * Send after Write * Send after Send.
Is the multi-rail feature supported under OFI?
When using multi-rail under OFA, the command-line syntax for invoking “mpirun” with Intel MPI Library might look something like:
export I_MPI_FABRICS=ofa:ofa
mpirun -n 8 -env I_MPI_OFA_ADAPTER_NAME adapter1 ./program.exe : -n 8 -env I_MPI_OFA_ADAPTER_NAME adapter2 ./program.exe
For the command-line above, 8 MPI ranks use the host channel adapter (HCA) called adapter1 and the other 8 MPI ranks use the HCA named adapter2.
Another multi-rail common case under OFA is to have every MPI rank use all the available host channel adapters and all the open ports from every HCA. Suppose the cluster system has 4 nodes where each system has 2 HCAs with 2 open ports each. Then every MPI rank may use 4 hardware cables for communication. The command-line syntax for invoking “mpirun” with Intel MPI Library might look something like:
export I_MPI_FABRICS=ofa:ofa
mpirun –f <host-file> -n 16 –ppn 4 –env I_MPI_OFA_NUM_ADAPTERS=2 –env I_MPI_OFA_NUM_PORTS=2 ./program.exe
>
where there are 4 MPI ranks associated with each node, and <host-file>
is a meta-symbol for a file name that contains the names of the 4 compute servers. The environment variable setting I_MPI_OFA_NUM_ADAPTERS=2
enables utilization of 2 HCAs, and the environment variable I_MPI_OFA_NUM_PORTS=2
enables utilization of 2 ports.
For using multi-rail under OFI, the Unified Communication X (UCX) working group has defined a framework that will support multi-rail semantics.3 UCX is a collaboration between industry, laboratories, and academia to create an open-source production grade communication framework for data centric and high-performance computing applications (Figure 2).
![]()
Figure 2.The Unified Communication X framework3.
Regarding the current status of UCX and the multi-rail fabric: as of this writing, multi-rail is not implemented yet for OFI.
References
- Open Fabrics Initiative Working Group
- M. Lubin, “Intel® Cluster Tools in a Cray* environment. Part 1.”
- Unified Communication X