Why Is InfiniBand (and other interconnects) So Fast?

August 26th, 2009 10:16 am
Posted by Douglas Eadline
Tags: , , , , , , , , ,

The advent of user-space protocols provides a fast way to move data

The above title is misleading because "fast" can mean many different things. In the case of HPC, "fast" means whatever it takes to keep the cores busy! In a previous post, I mentioned four parameters that are used to define an interconnect (throughput, latency, N/2, and messaging rate). Of course, applications are the best way to evaluate an interconnect.

The most popular interconnects for HPC are Ethernet (GigE and 10-GigE), InfiniBand, and Myrinet. (At this point, many people lump Myrinet into the 10 GigE category as it supports the standard protocol as well as the Myricom protocols.) Each of these interconnects are used in both mainstream and HPC applications, but one usage mode sets HPC applications apart from almost all others.

When interconnects are used in HPC the best performance comes from a "user space" mode. Communication over a network normally takes place through the kernel. (i.e. the kernel manages, and in a sense guarantees, data will get to where it is supposed to go). This communication path, however, requires memory to be copied from the users program space to a kernel buffer. The kernel then manages the communication. On the receiver node, the kernel will accept the data and place it in a kernel buffer. The buffer is then copied to the users program space. The excess copying often adds to the latency for a given network. In addition, the kernel must process the TCP/IP stack for each communication. For applications that require low latency, the extra copying from user program space to kernel buffer on the sending node and then from kernel buffer to user program space on the receiving node can be very inefficient.

To improve latency, many vendors of high performance interconnects use a "user space" protocol instead of the kernel. Figure One illustrates this difference. For example, the solid lines indicate a standard Ethernet MPI connection. Note the communication passes through the kernel on both send and receive. Interconnects like Myrinet and InfiniBand provide a low latency user space protocol that does not use the kernel or incur any TCP/IP overhead. Instead, it moves data from the memory of one process to the memory of the other process (dashed lines). The fast interconnects also provide TCP and UDP layer layer so that they can be used with regular through kernel network services as well. (i.e. to run NFS etc.)

kernel-user-space
Figure One: Kernel Space vs User Space Transfer

Special libraries must be used to access the user space protocol. Users generally do not write code at this level. Instead, virtually all MPI libraries support either the Open Fabrics Enterprise Distribution OpenIB interface or the Myrinet MX interface. Users need to relink their applications to a "user space" MPI library to improve performance. Some MPI libraries (e.g.Intel MPI and Open MPI) allow run-time selection of the actual interconnect and thus avoid relinking/recompiling of codes.

What about Ethernet?

In the past, almost all user space implementations were done for high speed (i.e. expensive) networks. Users of Ethernet were confined to using kernel based (TCP/IP) MPI implementations. There are now three Linux projects that bring user-space communications to Ethernet. The first and oldest, is the Genoa Active Message MAchine or GAMMA. GAMMA is famous for achieving less than 10 μsecond latencies over GigE. It does require a patch to the Ethernet driver and only supports certain Intel Ethernet chip-sets. Results have been impressive.

Another optimized communication protocol is Intel® Direct Ethernet Transport (DET) which works by providing a uDAPL like InfiniBand interface over GigE. uDAPL is the User Direct Access Programming Library that defines a single set of user APIs for all RDMA-capable transports. DET includes a kernel module and a uDAPL library for Ethernet and will work on almost any Ethernet NIC. It can linked with any software requiring uDAPL library.

A newer and popular effort is the Open-MX project. Open-MX is based on the Myrinet MX protocol. Essentially, any software that links to the Myricom MX library should be able to link with Open-MX. Currently, Open MPI, MPICH2, and the PVFS2 file system have all been shown to work with Open-MX. While Open-MX will work with almost all GigE and 10-GigE chip-sets without modifying drivers, it does require kernel 2.6.15 or higher to work. Depending on the chip-set Open-MX latencies as low as 10 μseconds for GigE have been reported.

In terms of 10 Gigabit Etherenet, the processor overhead to keep the pipe full and manage TCP/IP communications has become quite excessive. In order to offload work from the processor the iWARP protocol used. iWARP enabled hardware allows TCP/IP-based Ethernet to address the three major sources of networking overhead -- transport (TCP/IP) processing, intermediate buffer copies, and application context switch overhead.

From the users perspective, user-space protocols are hidden under the MPI layer. Thus, there is almost no programming price to pay for better performance. If your cluster has InfiniBand or Myrinet, chances are you are already running in user-space, but it is always good to check. Ask your system administrator or consult your cluster documentation. And, stay out of the kernel!

Comments

Pingback from Why Is InfiniBand (and other interconnects) So Fast?
Time August 27, 2009 at 4:24 pm

[...] Full Story [...]

JOIN THE CONVERSATION


You must be a Registered Member in order to comment on Cluster Connection posts.

Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.

Login     Register Now


Author Info


Dr. Douglas Eadline has worked with parallel computers since 1988 (anyone remember the Inmos Transputer?). After co-authoring the original Beowulf How-To, he continued to write extensively about Linux HPC Clustering and parallel software issues. Much of Doug's early experience has been in software tools and and application performance. He has been building and using Linux clusters since 1995. Doug holds a Ph.D. in Chemistry from Lehigh University.