Cluster Interconnects: Messaging Rate

July 1st, 2009 10:37 am
Posted by Douglas Eadline
Tags: , , , , , ,

If the interconnect creates the cluster, then how do we measure it?

The cluster interconnect is a big contributor to application performance. Use a slow interconnect and your processors may end up idle while they wait for data. A balanced CPU/Interconnect is important for getting the most from your cluster. For example, the better the interconnect, the more it costs, which means the less you can spend on node or storage hardware. If your applications are embarrassingly parallel, then money spent on a fast interconnect may have been better used for more nodes. The converse is true as well - using a slower cheaper network may throttle high throughput applications, so less nodes and fast network will work better.

It is often said, the best benchmark of a cluster is your application(s). While this is certainly true it is not always possible. Looking at performance data for applications similar to yours is also an option. While this can be helpful, many users often start with micro or single point measurements that help qualify a given interconnect.

Traditionally these micro measurements have been bandwidth, latency, and N/2. Bandwidth, or throughput, is probably the most often quoted performance metric. This number is also used to identify networks (e.g Gigabit Ethernet is one billion bits per second (bps), 10-GigE is 10 billion bps). The throughput does vary by payload size  (messages size) so the maximum possible data rate is often reported for large payloads. Latency effects throughput and is often an important feature in HPC networks. Latency can thought of as the set-up and tear-down time required for a message. For example, traveling by plane is very fast, but the time spent at the airport prior to and after the flight is the "latency" of the flight. The smaller the message, the more the latency matters. Using our airplane analogy, if you are flying from New York to Boston, a 2 hour airport latency is a large part of you travel time. If you were traveling to Tokyo, then the airport time contributes much less to the overall trip time. Many HPC applications require lower latency because they use many shorter messages.

Because latency is so important in HPC, many interconnects report what is known as single byte latency (i.e. the latency or overhead required to send a singe byte of data). Indeed, the competition to produce the lowest HPC latency is quite fierce. Currently, a low latency interconnect has a single byte latency of between 1 and 3 μseconds using specialized interconnect protocols (i.e. not standard kernel networking). For comparison, GigE has a latency range of 20-80 μseconds using TCP/IP.

When evaluating networks, throughput and latency are not the whole story, however. The N/2 point is also included in the list of performance numbers. N/2 is defined as payload size at which the bandwidth is at half its maximum. Recall that bandwidth is dependent on payload size and due to latency, the smaller the payload, the smaller the bandwidth. As the payload size increases a maximum throughput is achieved. An example curve is shown in Figure One below.

Figure One: example throughput N/2 curve

Traditionally, the above information was often considered a guide to determine a good cluster interconnect (i.e. it should be high bandwidth, low latency, and low N/2 value). And there are benchmarks the you can run to determine the values mentioned above. Most HPC interconnect benchmarks are written in MPI (Message Passing Interface), but other protocols can be used as well. The following are some freely available benchmarks that can be used to measure an interconnect.

  • NetPIPE is a protocol independent (i.e TCP, MPI, MPI-2, SHMEM, TCGMSG, PVM, and others) performance tool that visually represents the network performance under a variety of conditions.
  • Intel MPI benchmarks are set of MPI benchmarks that will thoroughly exercise a network interconnect.
  • OMB (OSU Micro-Benchmarks) are a set of MPI benchmarks that address point-to-point communication and some multiple communication patterns common most multi-core platforms.

Using throughout, latency, and N/2 to evaluate worked rather well before the multi-core revolution. The ability to accurately access how a network performs uses the assumption that cluster nodes work in a pint-to-point fashion. That is, a single process on one cluster node is communicating with a single process on another node. With today's multi-core nodes, the number of process on a given node can easily be eight or more. The increased number of processes means that a single point-to-point number may not reflect the real performance of a cluster because there is now more contention for the interconnect - the node is sending and receiving more messages.

To help address this problem, the "message rate" metric was developed. Message rate is defined as the number of messages transmitted in a period of time and is determined by taking the message rate bandwidth (bytes/second) divided by the length of the message (typically done for 0 or 2 bytes) resulting in messages per second metric. The larger the number of messages a multi-core node can send the better it is expected too work on HPC codes. Note that high bandwidth, low latency, and low N/2 do not necessarily imply a good messaging rate.

In closing, even with a good messaging rate, single point numbers are not always the best indicator for interconnect performance. As stated, your application is still the best measure of performance. And remember, one good benchmark that mirrors your applications is worth a hundred opinions.


Pingback from Cluster Connection » Why Is InfiniBand (and other interconnects) So Fast?
Time August 26, 2009 at 10:16 am

[...] the case of HPC, “fast” means whatever it takes to keep the cores busy! In a previous post, I mentioned four parameters that are used to define an interconnect (throughput, latency, N/2, and [...]


You must be a Registered Member in order to comment on Cluster Connection posts.

Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.

Login     Register Now

Author Info

Dr. Douglas Eadline has worked with parallel computers since 1988 (anyone remember the Inmos Transputer?). After co-authoring the original Beowulf How-To, he continued to write extensively about Linux HPC Clustering and parallel software issues. Much of Doug's early experience has been in software tools and and application performance. He has been building and using Linux clusters since 1995. Doug holds a Ph.D. in Chemistry from Lehigh University.