<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cluster Connection &#187; Gigabit Ethernet</title>
	<atom:link href="http://www.clusterconnection.com/tag/gigabit-ethernet/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.clusterconnection.com</link>
	<description>Simplify HPC. Share the knowledge.</description>
	<lastBuildDate>Fri, 30 Dec 2011 21:23:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1</generator>
		<item>
		<title>Why Is InfiniBand (and other interconnects) So Fast?</title>
		<link>http://www.clusterconnection.com/2009/08/why-is-infiniband-and-other-interconnects-so-fast/</link>
		<comments>http://www.clusterconnection.com/2009/08/why-is-infiniband-and-other-interconnects-so-fast/#comments</comments>
		<pubDate>Wed, 26 Aug 2009 17:16:12 +0000</pubDate>
		<dc:creator>Douglas Eadline</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[10 Gigabit Ethernet]]></category>
		<category><![CDATA[Gigabit Ethernet]]></category>
		<category><![CDATA[InfiniBand]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[MPI]]></category>
		<category><![CDATA[Myrinet]]></category>
		<category><![CDATA[Open-MX]]></category>
		<category><![CDATA[TCP]]></category>
		<category><![CDATA[UDP]]></category>
		<category><![CDATA[user-space]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2009/08/why-is-infiniband-and-other-interconnects-so-fast/</guid>
		<description><![CDATA[The advent of user-space protocols provides a fast way to move data The above title is misleading because "fast" can mean many different things. In the case of HPC, "fast" means whatever it takes to keep the cores busy! In a previous post, I mentioned four parameters that are used to define an interconnect (throughput, [...]]]></description>
			<content:encoded><![CDATA[<p><em>The advent of user-space protocols provides a fast way to move data</em></p>
<p>The above title is misleading because "fast" can mean many different things. In the case of HPC, "fast" means whatever it takes to keep the cores busy! In a previous <a href="/2009/07/cluster-interconnects-messaging-rate/">post</a>, I mentioned four parameters that are used to define an interconnect (throughput, latency, N/2, and messaging rate). Of course, applications are the best way to evaluate an interconnect.</p>
<p>The most popular interconnects for HPC are Ethernet (GigE and 10-GigE), InfiniBand, and Myrinet. (At this point, many people lump Myrinet into the 10 GigE category as it supports the standard protocol as well as the Myricom protocols.) Each of these interconnects are used in both mainstream and HPC applications, but one usage mode sets HPC applications apart from almost all others.</p>
<p>When interconnects are used in HPC the best performance comes from a "user space" mode. Communication over a network normally takes place through the kernel. (i.e. the kernel manages, and in a sense guarantees, data will get to where it is supposed to go). This communication path, however, requires memory to be copied from the users program space to a kernel buffer. The kernel then manages the communication. On the receiver node, the kernel will accept the data and place it in a kernel buffer. The buffer is then copied to the users program space. The excess copying often adds to the latency for a given network. In addition, the kernel must process the TCP/IP stack for each communication. For applications that require low latency, the extra copying from user program space  to kernel buffer on the sending node and then from kernel buffer to user program space on the receiving node can be very inefficient.</p>
<p>To improve latency, many vendors of high performance interconnects use a "user space" protocol instead of the kernel. <em>Figure One</em> illustrates this difference. For example, the solid lines indicate a standard Ethernet MPI connection. Note the communication passes through the kernel on both send and receive. Interconnects like Myrinet and InfiniBand provide a low latency user space protocol that does not use the kernel or incur any TCP/IP overhead. Instead, it moves data from the memory of one process to the memory of the other process (dashed lines). The fast interconnects also provide TCP and UDP layer layer so that they can be used with regular through kernel network services as well. (i.e. to run NFS etc.)<br />
<center><img class="aligncenter size-full wp-image-1378" title="kernel-user-space" src="/wordpress/wp-content/uploads/2009/07/kernel-user-space.png" alt="kernel-user-space" width="608" height="378" /><br />
<em>Figure One: Kernel Space vs User Space Transfer</em></center></p>
<p>Special libraries must be used to access the user space protocol. Users generally do not write code at this level. Instead, virtually all MPI libraries support either the <a href="http://www.openfabrics.org/">Open Fabrics Enterprise Distribution</a> OpenIB interface or the Myrinet MX interface. Users need to relink their applications to a "user space" MPI library to improve performance. Some MPI libraries (e.g.<a href="http://software.intel.com/en-us/articles/intel-mpi-library/">Intel MPI</a> and <a href="http://www.open-mpi.org/">Open MPI</a>) allow run-time selection of the actual interconnect and thus avoid relinking/recompiling of codes.</p>
<h3>What about Ethernet?</h3>
<p>In the past, almost all user space implementations were done for high speed (i.e. expensive) networks. Users of Ethernet were confined to using kernel based (TCP/IP) MPI implementations. There are now three Linux projects that bring user-space communications to Ethernet. The first and oldest, is the <a href="http://www.disi.unige.it/project/gamma/">Genoa Active Message MAchine</a> or GAMMA. GAMMA is famous for achieving less than 10 μsecond latencies over GigE. It does require a patch to the Ethernet driver and only supports certain Intel Ethernet chip-sets. Results have been impressive.</p>
<p>Another optimized communication protocol is <a href="http://software.intel.com/en-us/articles/intel-direct-ethernet-transport/">Intel® Direct Ethernet Transport</a> (DET) which works by providing a uDAPL like InfiniBand interface over GigE. uDAPL is the User Direct Access Programming Library that defines a single set of user APIs for all RDMA-capable transports. DET includes a kernel module and a uDAPL library for Ethernet and will work on almost any Ethernet NIC. It can linked with any software requiring uDAPL library.</p>
<p>A newer and popular effort is the <a href="http://open-mx.gforge.inria.fr/">Open-MX</a> project. Open-MX is based on the Myrinet MX protocol. Essentially, any software that links to the Myricom MX library should be able to link with Open-MX. Currently, Open MPI, MPICH2, and the PVFS2 file system have all been shown to work with Open-MX. While Open-MX will work with almost all GigE and 10-GigE chip-sets without modifying drivers, it does require kernel 2.6.15 or higher to work. Depending on the chip-set Open-MX latencies as low as 10 μseconds for GigE have been reported.</p>
<p>In terms of 10 Gigabit Etherenet, the processor overhead to keep the pipe full and manage TCP/IP communications has become quite excessive. In order to offload work from the processor the <a href="http://en.wikipedia.org/wiki/IWARP">iWARP</a> protocol used. iWARP enabled hardware allows TCP/IP-based Ethernet to address the three major sources of networking overhead -- transport (TCP/IP) processing, intermediate buffer copies, and application context switch overhead.</p>
<p>From the users perspective, user-space protocols are hidden under the MPI layer. Thus, there is almost no programming price to pay for better performance. If your cluster has InfiniBand or Myrinet, chances are you are already running in user-space, but it is always good to check. Ask your system administrator or consult your cluster documentation. And, stay out of the kernel!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/08/why-is-infiniband-and-other-interconnects-so-fast/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Interconnects: 10GigE and InfiniBand</title>
		<link>http://www.clusterconnection.com/2009/07/interconnects-10gige-and-infiniband/</link>
		<comments>http://www.clusterconnection.com/2009/07/interconnects-10gige-and-infiniband/#comments</comments>
		<pubDate>Wed, 08 Jul 2009 17:29:38 +0000</pubDate>
		<dc:creator>Douglas Eadline</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[10 Gigabit Ethernet]]></category>
		<category><![CDATA[Gigabit Ethernet]]></category>
		<category><![CDATA[InfiniBand]]></category>
		<category><![CDATA[Top500]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/?p=1206</guid>
		<description><![CDATA[It is two horse race, but one horse is still in the barn Ask anyone, "What are the two choices for HPC interconnects?"  and they will tell you "InfiniBand and 10 Gigabit Ethernet (10 GigE)." For the most part they are correct, but 10GigE is just entering the HPC market. It has not even landed [...]]]></description>
			<content:encoded><![CDATA[<p><em>It is two horse race, but one horse is still in the barn</em></p>
<p>Ask anyone, "What are the two choices for HPC interconnects?"  and they will tell you "<a href="http://en.wikipedia.org/wiki/InfiniBand">InfiniBand</a> and <a href="http://en.wikipedia.org/wiki/10_Gigabit_Ethernet">10 Gigabit Ethernet</a> (10 GigE)." For the  most part they are correct, but 10GigE is just entering the HPC market.  It has not even landed in the <a href="http://www.top500.org/">Top500</a> arena, although users are still confident it will show up soon.  On the other hand, InfiniBand use is climbing steadily.</p>
<p>The following table shows the June 2008 and 2009 interconnect families on the Top500 list.<br />
I grouped any interconnect that had less than 1% share in 2009 into the "other" category.</p>
<p align="center">
<table border="1" cellpadding="5">
<tbody>
<tr>
<th>Interconnect</th>
<th>6/2008</th>
<th>6/2009</th>
</tr>
<tr>
<td>Myrinet</td>
<td>2.4</td>
<td>2.0</td>
</tr>
<tr>
<td>GigE</td>
<td>56.6</td>
<td>56.4</td>
</tr>
<tr>
<td>IB</td>
<td>24.2</td>
<td>30.2</td>
</tr>
<tr>
<td>Proprietary</td>
<td>8.2</td>
<td>8.4</td>
</tr>
<tr>
<td>Other</td>
<td>8.6</td>
<td>3.0</td>
</tr>
</tbody>
</table>
<p>The first thing to notice is that GigE still dominates the list with a 56% share as it did a year ago. Also of note, number 16 on the list, from University of Toronto, used Quad Xeon E55xx  and GigE! The only other big change is the continued growth of InfiniBand (from 24% to 30%) and decrease of the "other" category.</p>
<p>So why the confidence in 10 GigE? Simple, at one point GigE was as expensive  as 10 GigE is today, but due to the commodity uptake, the price came down to the "it is free on the motherboard" option. Plus, many users like the "plug and play" nature of Ethernet as it is well understood technology.</p>
<p>In closing, the Top500 is a single benchmark and not the only measure of HPC interconnects, but it does provide an interesting snapshot of what people are using. Right now it seem the biggest competitor to 10 GigE might be GigE.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/07/interconnects-10gige-and-infiniband/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

