<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cluster Connection &#187; MPI</title>
	<atom:link href="http://www.clusterconnection.com/tag/mpi/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.clusterconnection.com</link>
	<description>Simplify HPC. Share the knowledge.</description>
	<lastBuildDate>Fri, 30 Dec 2011 21:23:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1</generator>
		<item>
		<title>Why Is InfiniBand (and other interconnects) So Fast?</title>
		<link>http://www.clusterconnection.com/2009/08/why-is-infiniband-and-other-interconnects-so-fast/</link>
		<comments>http://www.clusterconnection.com/2009/08/why-is-infiniband-and-other-interconnects-so-fast/#comments</comments>
		<pubDate>Wed, 26 Aug 2009 17:16:12 +0000</pubDate>
		<dc:creator>Douglas Eadline</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[10 Gigabit Ethernet]]></category>
		<category><![CDATA[Gigabit Ethernet]]></category>
		<category><![CDATA[InfiniBand]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[MPI]]></category>
		<category><![CDATA[Myrinet]]></category>
		<category><![CDATA[Open-MX]]></category>
		<category><![CDATA[TCP]]></category>
		<category><![CDATA[UDP]]></category>
		<category><![CDATA[user-space]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2009/08/why-is-infiniband-and-other-interconnects-so-fast/</guid>
		<description><![CDATA[The advent of user-space protocols provides a fast way to move data The above title is misleading because "fast" can mean many different things. In the case of HPC, "fast" means whatever it takes to keep the cores busy! In a previous post, I mentioned four parameters that are used to define an interconnect (throughput, [...]]]></description>
			<content:encoded><![CDATA[<p><em>The advent of user-space protocols provides a fast way to move data</em></p>
<p>The above title is misleading because "fast" can mean many different things. In the case of HPC, "fast" means whatever it takes to keep the cores busy! In a previous <a href="/2009/07/cluster-interconnects-messaging-rate/">post</a>, I mentioned four parameters that are used to define an interconnect (throughput, latency, N/2, and messaging rate). Of course, applications are the best way to evaluate an interconnect.</p>
<p>The most popular interconnects for HPC are Ethernet (GigE and 10-GigE), InfiniBand, and Myrinet. (At this point, many people lump Myrinet into the 10 GigE category as it supports the standard protocol as well as the Myricom protocols.) Each of these interconnects are used in both mainstream and HPC applications, but one usage mode sets HPC applications apart from almost all others.</p>
<p>When interconnects are used in HPC the best performance comes from a "user space" mode. Communication over a network normally takes place through the kernel. (i.e. the kernel manages, and in a sense guarantees, data will get to where it is supposed to go). This communication path, however, requires memory to be copied from the users program space to a kernel buffer. The kernel then manages the communication. On the receiver node, the kernel will accept the data and place it in a kernel buffer. The buffer is then copied to the users program space. The excess copying often adds to the latency for a given network. In addition, the kernel must process the TCP/IP stack for each communication. For applications that require low latency, the extra copying from user program space  to kernel buffer on the sending node and then from kernel buffer to user program space on the receiving node can be very inefficient.</p>
<p>To improve latency, many vendors of high performance interconnects use a "user space" protocol instead of the kernel. <em>Figure One</em> illustrates this difference. For example, the solid lines indicate a standard Ethernet MPI connection. Note the communication passes through the kernel on both send and receive. Interconnects like Myrinet and InfiniBand provide a low latency user space protocol that does not use the kernel or incur any TCP/IP overhead. Instead, it moves data from the memory of one process to the memory of the other process (dashed lines). The fast interconnects also provide TCP and UDP layer layer so that they can be used with regular through kernel network services as well. (i.e. to run NFS etc.)<br />
<center><img class="aligncenter size-full wp-image-1378" title="kernel-user-space" src="/wordpress/wp-content/uploads/2009/07/kernel-user-space.png" alt="kernel-user-space" width="608" height="378" /><br />
<em>Figure One: Kernel Space vs User Space Transfer</em></center></p>
<p>Special libraries must be used to access the user space protocol. Users generally do not write code at this level. Instead, virtually all MPI libraries support either the <a href="http://www.openfabrics.org/">Open Fabrics Enterprise Distribution</a> OpenIB interface or the Myrinet MX interface. Users need to relink their applications to a "user space" MPI library to improve performance. Some MPI libraries (e.g.<a href="http://software.intel.com/en-us/articles/intel-mpi-library/">Intel MPI</a> and <a href="http://www.open-mpi.org/">Open MPI</a>) allow run-time selection of the actual interconnect and thus avoid relinking/recompiling of codes.</p>
<h3>What about Ethernet?</h3>
<p>In the past, almost all user space implementations were done for high speed (i.e. expensive) networks. Users of Ethernet were confined to using kernel based (TCP/IP) MPI implementations. There are now three Linux projects that bring user-space communications to Ethernet. The first and oldest, is the <a href="http://www.disi.unige.it/project/gamma/">Genoa Active Message MAchine</a> or GAMMA. GAMMA is famous for achieving less than 10 μsecond latencies over GigE. It does require a patch to the Ethernet driver and only supports certain Intel Ethernet chip-sets. Results have been impressive.</p>
<p>Another optimized communication protocol is <a href="http://software.intel.com/en-us/articles/intel-direct-ethernet-transport/">Intel® Direct Ethernet Transport</a> (DET) which works by providing a uDAPL like InfiniBand interface over GigE. uDAPL is the User Direct Access Programming Library that defines a single set of user APIs for all RDMA-capable transports. DET includes a kernel module and a uDAPL library for Ethernet and will work on almost any Ethernet NIC. It can linked with any software requiring uDAPL library.</p>
<p>A newer and popular effort is the <a href="http://open-mx.gforge.inria.fr/">Open-MX</a> project. Open-MX is based on the Myrinet MX protocol. Essentially, any software that links to the Myricom MX library should be able to link with Open-MX. Currently, Open MPI, MPICH2, and the PVFS2 file system have all been shown to work with Open-MX. While Open-MX will work with almost all GigE and 10-GigE chip-sets without modifying drivers, it does require kernel 2.6.15 or higher to work. Depending on the chip-set Open-MX latencies as low as 10 μseconds for GigE have been reported.</p>
<p>In terms of 10 Gigabit Etherenet, the processor overhead to keep the pipe full and manage TCP/IP communications has become quite excessive. In order to offload work from the processor the <a href="http://en.wikipedia.org/wiki/IWARP">iWARP</a> protocol used. iWARP enabled hardware allows TCP/IP-based Ethernet to address the three major sources of networking overhead -- transport (TCP/IP) processing, intermediate buffer copies, and application context switch overhead.</p>
<p>From the users perspective, user-space protocols are hidden under the MPI layer. Thus, there is almost no programming price to pay for better performance. If your cluster has InfiniBand or Myrinet, chances are you are already running in user-space, but it is always good to check. Ask your system administrator or consult your cluster documentation. And, stay out of the kernel!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/08/why-is-infiniband-and-other-interconnects-so-fast/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Comparing MPI and OpenMP</title>
		<link>http://www.clusterconnection.com/2009/08/comparing-mpi-and-openmp/</link>
		<comments>http://www.clusterconnection.com/2009/08/comparing-mpi-and-openmp/#comments</comments>
		<pubDate>Thu, 20 Aug 2009 19:48:06 +0000</pubDate>
		<dc:creator>Douglas Eadline</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[Hardware]]></category>
		<category><![CDATA[API]]></category>
		<category><![CDATA[Intel Cluster Ready]]></category>
		<category><![CDATA[MPI]]></category>
		<category><![CDATA[OpenMP]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2009/08/comparing-mpi-and-openmp/</guid>
		<description><![CDATA[The hardware environment may determine the best parallel programming tool to use The advent of multi-core processors has increased the need for parallel programs on the largest to the smallest of systems (clusters to laptops). There are many ways to express parallelism in a program. In HPC, the MPI (Message Passing Interface) has been the [...]]]></description>
			<content:encoded><![CDATA[<p><em>The hardware environment may determine the best parallel programming tool to use</em></p>
<p>The advent of multi-core processors has increased the need for parallel programs on the largest to the smallest of systems (clusters to laptops). There are many ways to express parallelism in a program. In HPC, the MPI (Message Passing Interface) has been the main tool of most programmers. MPI is often talked about as though it is a computer language on its own. In reality, MPI is an API (Applications Programming Interface), or programming library that allows Fortran and C (and sometimes C++) programs to send messages to each other.</p>
<p>Another method to express parallelism is OpenMP. Unlike MPI, OpenMP is not an API, but an extension to a compiler. To use OpenMP, the programmer adds "pragmas" (comments) to the program that are used as hints by the compiler. The resulting program uses operating system threads to run in parallel. Operating system threads can be thought of as separate subroutines running at the same time that share the same memory space. In addition to the fact that "MP" is in both the names of these methods, there is often some confusion about how each of these parallel paradigms works and where/when they should be applied. This article will explain the differences and provide a better understanding of these two powerful technologies.</p>
<p>While some programming issues will be discussed, we will not present any programming examples or explanations. If you are interested in using MPI and OpenMP, please see <a href="http://www.linux-mag.com/id/5759">MPI In 30 Minutes</a> and <a href="http://www.linux-mag.com/id/4609">OpenMP in 30 Minutes</a> tutorials. These tutorials will get you started quickly. We will also be talking about some of the example programs mentioned in these articles. There are links to the source code in both articles should you wish to try some things on your own.</p>
<h3>Background</h3>
<p>Many people are surprised to learn that MPI has not been certified by any major standards organization. Instead, the <a href="http://www.mpi-forum.org/">MPI Forum</a> creates and maintains the MPI standard. You can find more background and where to obtain both open and commercial MPI versions in the article <a href="/2009/06/mpi-choices/">MPI Choices</a>. When writing MPI codes, the programmer must explicitly add message passing calls to a program. Quite often, existing sequential programs are modified, but new "parallel applications" can be written as well. In terms of programming difficulty, MPI is conceptually straight forward, but it is possible to build programs that are hard to follow as there are many things happening at the same time. In addition, starting/monitoring/debugging MPI programs across a cluster can sometimes lead to extra work not found when running on a single server. There are also tools to assist with MPI programs such as <a href="http://software.intel.com/en-us/articles/intel-trace-analyzer/">Intel Trace Analyzer</a>.</p>
<p>OpenMP was developed because native operating system threads (often referred to as POSIX threads or Pthreads) programing can be cumbersome. To aid with thread programming, a higher level of abstraction was developed and called OpenMP. As with all higher level approaches, there is the sacrifice of flexibility for the ease of coding. At its core, OpenMP uses threads, but the details are hidden from the programmer. As mentioned, OpenMP is implemented as compiler directives (pragmas) in program comments. Typically, computationally heavy loops are augmented with OpenMP directives that the compiler uses to automatically "thread the loop." This type of approach has the distinct advantage that it may be possible to leave the original program "untouched" (except for directives) and provide simple recompilation for a sequential (non-threaded) version where the OpenMP directives are ignored. More information can be found at the <a href="http://openmp.org/wp/">OpenMP</a> website.</p>
<p>OpenMP is supported by all major Fortran and C compilers (including <a href="http://gcc.gnu.org/">gcc</a>/<a href="http://gcc.gnu.org/fortran/">gfortran</a> and the <a href="http://software.intel.com/en-us/articles/intel-compilers/">Intel Compilers</a>). From a programmers standpoint, working with OpenMP is easier than MPI -- at least initially. Adding pragmas to a program allows it to still function as a sequential (single core) program, thus programmers can incrementally add parallelism. Users can still create complex, hard to understand programs, but as with MPI there are tools, like <a href="http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/">Intel Vtune</a>, to assist with programmer with OpenMP.</p>
<h3>Copy or Share</h3>
<p>When discussing MPI or message passing methods, one obvious aspect is often overlooked - message passing is basically memory copying. Let's consider a simple MPI message from one program to another.</p>
<p>The first sending program sends the text message "Hello over there" and the receiving program responds with "What's up?". The sending program will construct the "Hello over there" string in memory then send it to the receiving program. The receiving program will take take the string and place it into it's own memory. There are now two copies of the string. The reply works exactly the same way. This type of communication is best for distributed memory systems like clusters. Note that I did not state where the processes were located. By design, MPI processes can be located on the same server or on a separate server. Regardless of where it runs, each MPI process has it's own memory space from which messages are copied.</p>
<p>In contrast, in a threaded or OpenMP environment, communication happens differently. If one thread wants to communicate with another thread, it would <em>say</em>, there is a message at this memory location (i.e. "Hello over there"). The receiving thread would look at the memory location and then <em>tell</em> the sender my response is here. There is no copied data, there is one copy and it is shared between threads. (Note, this is not strictly how it happens, but the "no copying" is what is important).</p>
<p>As mentioned, MPI can run across distributed servers and on a SMP (multi-core) servers. OpenMP, however, is best run on a single SMP server or on multiple servers using something like <a href="http://www.scalemp.com/">ScaleMP</a>. There is also a product called <a href="http://software.intel.com/en-us/whatif/">Cluster OpenMP*</a> that can run OpenMP applications across a cluster. For this reason, MPI codes usually scale to larger numbers of servers, while OpenMP is restricted to an single operating system domain (e.g a single server).</p>
<p>There is another subtle difference between OpenMP and MPI applications that run on a single server. In OpenMP communication is through shared memory, which means <em>threads share access</em> to a memory location. MPI programs on SMP systems communicate through shared memory, but <em>processes send messages by reading and writing to shared memory</em>. The messages are still copied from one process space to another.  Obviously, sharing memory locations seems more efficient than sending copies of memory locations to other processes, but it all depends. In the MPI process model, single processes have exclusive access to all their process memory. For some programs, this situation may be more efficient because it is better to copy data (send a message) than to wait for shared memory access. On the other hand, in the OpenMP model, threads can share access to all memory in the process space. In this case, some programs may be much more efficient as the overhead of copying memory is not needed.</p>
<h3>Compiling and Running Code</h3>
<p>We are going to use three programs from the tutorial articles mentioned above. The programs are simple matrix multiplications with same underlying code:</p>
<ul>
<li><tt>matmul.c</tt> - a sequential (runs on a single core) version of the matrix multiplication program</li>
<li><tt>matmul_omp.c</tt> - an OpenMP version of <tt>matmul.c</tt></li>
<li><tt>matmul_mpi.c</tt> - an MPI version of <tt>matmul.c</tt></li>
</ul>
<p>In order to extend the execution times, I increased the array dimension from 1000 to 2000 in all the programs.  I'm using an Intel Q6600 quad-core running at 2.40GHz and my gcc version is 4.3.3. The first thing we will do is build the sequential version.</p>
<pre>$ gcc -g -O3 -o matmul.exe matmul.c -lm</pre>
<p>Next, we will run the program and record the time.</p>
<pre>$ time ./matmul.exe&gt;matmul.out

real	1m59.011s
user	1m58.931s
sys	0m0.084s</pre>
<p>The program took 119 seconds to run. The OpenMP version was built with the following command, note the use of the <tt>-fopenmp</tt> option. This option tells the compiler to use the OpenMP pragmas to build a threaded version. Indeed, it is possible to create a sequential or single core version by not using the <tt>-fopenmp</tt> option.</p>
<pre>$ gcc -g -O3  -fopenmp -o matmul_omp.exe matmul_omp.c -lm</pre>
<p>Running the program produces the following times</p>
<pre>$ time ./matmul_omp.exe&gt;matmul_omp.out

real	0m31.304s
user	2m3.460s
sys	0m0.080s</pre>
<p>The OpenMP version reduced the wall clock time from 119 seconds to 31 (a speed up of 3.8). If you look at the user time you will see there were 123 seconds used! That is because four cores were used and user time is the combined time of all the cores running at the same time. There is also an environment variable called <tt>OMP_NUM_THREADS</tt> that will tell OpenMP binaries how many threads to use. If this is not defined, one thread per core is used. The maximum number of threads may be defined by the program as well.</p>
<p>Turning to MPI there are a few differences in our compilation process. First, we have to make sure we have a version of MPI installed on our machine. In this case we are using Open MPI 1.3.1. To build an MPI program a wrapper script/program is often used that makes sure that the paths and names of include files and libraries are specified. In our case, we will use the MPI <tt>cc</tt> wrapper called <tt>mpicc</tt>.</p>
<pre>mpicc -g -O3 -o matmul_mpi.exe matmul_mpi.c -lm</pre>
<p>To run the resultant binary, we need to use an MPI starter program often referred to as <tt>mpirun</tt> or <tt>mpiexec</tt>. We also add an argument (<tt>-np</tt>) to tell MPI how many copies of the program to run.</p>
<pre>time mpirun -np 4 matmul_mpi.exe&gt;matmul_mpi.out

real	0m32.662s
user	2m5.824s
sys	0m0.556s</pre>
<p>Note that similar to the OpenMP example, the real time was about 33 seconds (a speed-up of 3.6) while the user time was about 126 seconds. Both methods produced excellent speed-up.</p>
<h3>Processes or Threads</h3>
<p>As mentioned the big difference between MPI and OpenMP is way programs are run. OpenMP programs run as a single process and the parallelism is expressed as threads. (i.e. the program is started as one binary which then separates into individual "threads" which are run on the available cores on a server.) This behavior can be viewed quite clearly when reviewing an OpenMP program using <tt>top</tt>. As an example, consider <em>Figure One</em> where a single OpenMP binary is running on and eight-core server. Notice that the cores are all busy, but there is one process running with a CPU utilization rate of 788 percent!</p>
<div id="attachment_1460" class="wp-caption aligncenter" style="width: 398px"><img class="size-full wp-image-1460" title="figure-one-omp-top-485" src="/wordpress/wp-content/uploads/2009/08/figure-one-omp-top-485.png" alt="Figure One: OpenMP program (cg.B) running on eight cores" width="388" height="272" /><p class="wp-caption-text">OpenMP program (cg.B) running on eight cores</p></div>
<p>In contrast to the OpenMP, MPI actually starts one process per core using the <tt>mpirun -np 8 ...</tt> command. This situation is shown in <em>Figure Two</em> where an MPI version of the same program is now running. Note the number of processes is now eight and each process has a 100 percent utilization rate. The processor (core) loads are about the same.</p>
<div id="attachment_1461" class="wp-caption aligncenter" style="width: 412px"><img class="size-full wp-image-1461" title="figure-two-mpi-top-485" src="/wordpress/wp-content/uploads/2009/08/figure-two-mpi-top-485.png" alt="Figure Two: MPI program (cg.B.8) running on eight cores" width="402" height="289" /><p class="wp-caption-text">MPI program (cg.B.8) running on eight cores</p></div>
<p>We will not be making any statements about which method is better. In some cases OpenMP works much better than MPI on a multi-core server for the same application. In other cases, MPI has been shown to run faster. The good news is if you already have an MPI version of your program, you can easily try it on a multi-core server. You can also run an MPI program "by hand" across multiple servers by using various start-up methods. Most batch scheduling packages (Torque, Moab, SGE, Platform) support multi-node MPI runs as well. Similarly, you can request a node to run your OpenMP application, but make sure you get exclusive control of the number of cores you need.</p>
<h3>Hybrid Approaches</h3>
<p>Astute readers may wonder, can I use OpenMP and MPI in the same program? The answer is yes. Since MPI is an API and OpenMP is based on threads, there is no reason, other than bad programming, that the two methods cannot be used in the same application. Indeed, HPL (the Top500 Benchmark) is run as a single instance on each node, but on the node the program is threaded to use the individual cores. One way to envision a hybrid program is to use MPI for the outer loops and OpenMP for the inner loops. Thus, an MPI program could be augmented with OpenMP pragmas so it could take advantage of all the cores on any one node, if they are available. Of course, the program/algorithm would have to support this level of parallelism to run efficiently.</p>
<p>MPI and OpenMP are well tested and robust technologies for creating parallel programs. Understanding these differences is key to creating applications that meet your expectations and run on the types of hardware available to you. Now that multi-core systems are everywhere, getting started is easy even on that new desktop.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/08/comparing-mpi-and-openmp/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Automatic SLES 11 deployment of an ICR certified HPC cluster</title>
		<link>http://www.clusterconnection.com/2009/07/automatic-sles-11-deployment-of-an-icr-certified-hpc-cluster/</link>
		<comments>http://www.clusterconnection.com/2009/07/automatic-sles-11-deployment-of-an-icr-certified-hpc-cluster/#comments</comments>
		<pubDate>Tue, 14 Jul 2009 17:26:01 +0000</pubDate>
		<dc:creator>Oliver Tennert</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[error messages]]></category>
		<category><![CDATA[Intel Cluster Checker]]></category>
		<category><![CDATA[Intel Cluster Ready]]></category>
		<category><![CDATA[Intel MPI]]></category>
		<category><![CDATA[Mesa]]></category>
		<category><![CDATA[MPI]]></category>
		<category><![CDATA[sles11]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2009/07/automatic-sles-11-deployment-of-an-icr-certified-hpc-cluster/</guid>
		<description><![CDATA[The other day we had to ship an ICR certified HPC cluster based on SLES 11, the latest SUSE Enterprise Distribution. We have used Intel Cluster Runtime version 2.0-1. As SLES 11 has to that date been out only for a couple of weeks, we didn't expect everything to run as smoothly as for SLES [...]]]></description>
			<content:encoded><![CDATA[<p>The other day we had to ship an ICR certified HPC cluster based on SLES 11, the latest SUSE Enterprise Distribution. We have used Intel Cluster Runtime version 2.0-1. As SLES 11 has to that date been out only for a couple of weeks, we didn't expect everything to run as smoothly as for SLES 10, and indeed the challenge turned out to have SLES 11 behave in a way compatible with Intel Cluster Ready.</p>
<p>As we found out in the Web, the Intel MPI implementation has a known bug: the number of available cores is not detected correctly. But it took some time connecting this circumstance to one of our problems, because the original error did not immediately point to it:</p>
<p>Intel(R) MPI Library Runtime Environment (Single-node), (intel_mpi_rt).........................................................FAILED<br />
subtest 'MPI Hello World! (I_MPI_DEVICE = sock)' failed<br />
- failing All hosts returned: 'No one returned Hello World!'<br />
subtest 'mpd shutdown' failed<br />
- failing All hosts returned: 'mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_transtec); possible causes:<br />
1. no mpd is running on this host<br />
2. an mpd is running but was started without a "console" (-n option)'</p>
<p>As it were, the real problem was the program "cpuinfo" of Intel MPI, reporting a wrong number of cores in the system. Setting an environment variable,</p>
<p>"export I_MPI_CPUINFO=proc"</p>
<p>fixed this issue, however.</p>
<p>Another issue with Intel MPI is that it seems to be incompatible with Python 2.6, which comes along with SLES 11:</p>
<p>Intel® MPI Library Runtime Environment (Single-node), (intel_mpi_rt).........................................................FAILED<br />
subtest 'mpd startup' failed<br />
- failing All hosts returned:<br />
'/opt/intel/impi/3.2.0.011/bin64/mpdlib.py:27: DeprecationWarning: The popen2 module is deprecated.  Use the subprocess module.<br />
import sys, os, signal, popen2, socket, select, inspect<br />
/opt/intel/impi/3.2.0.011/bin64/mpdlib.py:37: DeprecationWarning: the md5 module is deprecated; use hashlib instead<br />
from  md5       import  new as md5new'</p>
<p>Although this should not constitute not a real problem, cluster checker seems to be sensitive enough to end with an error code even for warning like this. We could fix it by installing Python 2.4, which, unfortunately, does not exist as a SLES 11 package, so we had to take a source tarball and recompile the package. Strange though, why SLES 11 does not include Python 2.4 as a fallback package for compatibility reasons as there are many Python programs out there based on that version.</p>
<p>In my opinion, this just demonstrates the power of Intel Cluster Ready, or cluster checker, respectively: Intel MPI causes a problem, but the cluster checker does its jobs and identifies a compatibility issue, impressively enough! Thus, the ICR program enables us to catch the problem before it reached the customer level.</p>
<p>A strange error that occurred was this one:</p>
<p>X11 runtime libraries are provided, (X11_libs).........................FAILED<br />
subtest 'libGLw.so (x86-64) &gt;= version 1' failed<br />
- failing All hosts returned: 'missing'</p>
<p>Obviously, the library missing is part of the Mesa package, which is a necessary HPC cluster ingredient from Intel Cluster Ready's point of view. Doing a little bit of research, we found out that nearly all current Linux distributions explicitly remove this library from the standard Mesa package, for whatever reasons.</p>
<p>We could solve that in an elegant way by repackaging the SLES package Mesa with a recompiled one, having the SPEC file adapted in an appropriate way for including the libGLw libraries.</p>
<p>A funny thing that happened was that Korn shell missed the locales:</p>
<p>Korn Shell, (ksh)......................................................FAILED<br />
subtest 'Hello World!' failed<br />
- failing hosts node1 - node3 returned: 'en_US.UTF-8: unknown locale'</p>
<p>But the package glibc-locale seemed to be installed:</p>
<p>root@node1 # rpm -q glibc-locale<br />
gcc-locale-4.3-62.198</p>
<p>What was going on?</p>
<p>It turned out to have nothing to do with ICR incompatibilities of SLES 11, but to be a problem with "xCAT 2" which we use for cluster deployment. As we have tested diskless installations, xCAT defined a number of files and directories to be deleted before creating the compressed root image. In the configuration file</p>
<p>/opt/xcat/share/xcat/netboot/sles/compute.exlist</p>
<p>The locale files are explicitly listed to be deleted before image generation. Having found that out, it was an easy thing to fix.</p>
<p>After solving all that issues, we finally have succeeded in developing a fully automatic deployment procedure for ICR certified HPC cluster based on SLES 11.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/07/automatic-sles-11-deployment-of-an-icr-certified-hpc-cluster/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sharing The Load: Cluster Resource Schedulers</title>
		<link>http://www.clusterconnection.com/2009/07/sharing-the-load-cluster-resource-schedulers/</link>
		<comments>http://www.clusterconnection.com/2009/07/sharing-the-load-cluster-resource-schedulers/#comments</comments>
		<pubDate>Fri, 10 Jul 2009 18:17:32 +0000</pubDate>
		<dc:creator>Douglas Eadline</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[backfill]]></category>
		<category><![CDATA[batch schedulers]]></category>
		<category><![CDATA[fair share]]></category>
		<category><![CDATA[job scheduers]]></category>
		<category><![CDATA[MPI]]></category>
		<category><![CDATA[queue]]></category>
		<category><![CDATA[resource managers]]></category>
		<category><![CDATA[scheduler]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2009/07/sharing-the-load-cluster-resource-schedulers/</guid>
		<description><![CDATA[Job Schedulers are the workhorse of cluster computing One thing virtually every cluster has in common is a batch or job scheduler. The job scheduler sits between the user and the cluster and manages the cluster resources. All user programs that are run on the cluster are under the control of the resource scheduler (we [...]]]></description>
			<content:encoded><![CDATA[<p><em>Job Schedulers are the workhorse of cluster computing </em></p>
<p>One thing virtually every cluster has in common is a batch or job scheduler. The job scheduler sits between the user and the cluster and manages the cluster resources. All user programs that are run on the cluster are under the control of the resource scheduler (we will use the word <em>scheduler</em> to refer to the resource manager/scheduler). Users submit a "job" to the scheduler, which in turn adds the job to the work queue of the cluster. The work queue can be thought of as the line at the supermarket. This type of computing is often called "batch mode" processing and is sometimes considered a throwback to days of the shared mainframe. Because HPC jobs usually run for long periods of time (days or weeks) interactive execution becomes somewhat cumbersome. With a batch scheduler, the user can <em>submit and forget</em>. When the job is finished or stops the user is often notified by email.</p>
<p>The goal of the scheduler is to allow multiple users to submit jobs (programs) to the cluster and then run them on the requested resources -- when the resources become available. That last part about resource availability is what makes the schedulers job so hard. We'll take a deeper look at this issue in a moment, but first let's look at the cluster from the users perspective.</p>
<h3>The User View</h3>
<p>Let's be honest, the cluster user is rather selfish! As a user you want to submit a job or program to the cluster and have it run right away. Even if you are the only user, you still want to use a job scheduler because it will keep track of jobs and make sure you don't over-subscribe nodes (place too many program on a node). The main problem a cluster user faces is the presence of other users! It is those <em>other users</em> that lead to the most common question facing the cluster administrator, "Why is my job not running?"  Thus the user's goal is rather simple -- run my code now and tell me when you are done.</p>
<p>The user can usually interact with the scheduler from the command line, but most users write <em>submit scripts</em> (shell scripts) that allow options to be re-used for subsequent jobs. In particular, when using MPI the scheduler usually requires a large number of options in addition to the <tt>mpirun</tt> command itself. The resource manager also provides a list of nodes that can be used by an MPI starter program, thus instead of entering a command line with a large number of options, a simple script can submitted to the scheduler.</p>
<h3>The Scheduler View</h3>
<p>The scheduler has a more difficult task. Each job or program has a number or parameters that can make scheduling rather difficult. In a general sense, each job requires resources. Examples of resources include the number of cores, location of cores, amount of memory, amount of execution time, etc. The resource can also be a particular node owned by a research group or a group of nodes connected by a specific network or storage technology. The list of resources can be rather large and depends on the cluster. At a minimum, a user will need to request the number of cores they need. In the past when nodes had at most two cores, users would request nodes, however with multi-core the node request has given way to the number of cores or slots which usually translates into the number of individual MPI processes the user wants to run. Single or sequential jobs will require only one core. In addition, multi-threaded programs will need to request a number of cores on a specific node. As you can see the scheduler has quite a lot to consider.</p>
<p>Because users share the cluster, the scheduler should at least be fair (unless there is some other policy in place). A naive way to be fair is to use a first come first serve policy. While this seems like a good idea, this approach, may lead to poor utilization. Consider a cluster with 16 cores. If a user submits a job requesting 12 cores, then there are 4 cores remaining for other users. If someone submits a job requiring 14 cores, this job will have to wait until the 12 core job is finished leaving the four cores idle. Now suppose in the mean time, a user submits a 4 core job. Obviously, a smart scheduler would notice this and try to "back fill" to use the idle four cores. Utilization is much better, but what happens if the four core job continues to run after the initial 12 core job completes. Now there are 12 cores available, but not enough to run the "next in line" 14 core job. Of course the batch scheduler wants to use the idle cores, but now it runs the risk of continually pushing back the 14 core job.</p>
<p>If you include other resource requirements (e.g. large memory size) and the fact that users often do not or incorrectly specify the amount of run time they need, and the constant stream of new job submissions, you can appreciate that <em>fair and efficient</em> cluster resource scheduling is a hard problem.  From the users prospective, it all seems rather simple -- <em>submit and forget</em>. From the scheduler perspective it is one headache after another! Thankfully the scheduler does not complain all that much.</p>
<p>Another interesting note about resource schedulers is the fact that modern multi-core/multi-processor motherboards run their own scheduler as part of the OS. In particular, the OS must decide where to place a process and at the same time balance the processes so all cores are equally busy, but try to keep processes on the same  core so that data in the processor cache can be reused. This scheduling is independent of the global resource scheduler, but some co-ordination would be useful.</p>
<h3>Resource Scheduler Background</h3>
<p>Because resource schedulers can be complex and vary from one to another, <span style="color: black;">understanding some concepts maybe helpfu</span>l when evaluating various solutions. In general a resource scheduler has two major components;</p>
<ul>
<li>The <strong>Resource Manager</strong> keeps track of node state (load, number of jobs, memory use, physically available) and manages jobs that it runs on the nodes. Most resource managers <span style="color: black;">use a daemon (resident program) on nodes that report status</span> (both job and node) and starts and stops jobs (see <em>Job Execution Daemon</em> below). Most resource mangers use a database to keep track of all resources, submitted requests and running jobs. In addition, not all resources are homogeneous clusters. Indeed, there can be multiple clusters or computers under the control of single resource manager.</li>
<li>The <strong>Scheduler</strong> does the hard work by determining when and where a job will run. It can be naive or very complicated. The design of the scheduler determines how efficient the cluster is utilized. Some resource mangers provide a modular interface so different schedulers can be used. In addition, many schedulers have various policies and modes that can be adapted to a sites work-flow requirements. <span style="color: black;">Schedulers track job priority, compute resource availability, can manage license</span> keys if a job is using licensed software, keep track of execution time allocated to the user, manage the number of simultaneous jobs allowed for a user, estimated execution time, and elapsed execution time. Finally, schedulers may also manage multiple queues.There are two basic designs used for job schedulers:
<ul>
<li> A master/worker model where the job scheduling software is installed on a single machine (Master) while on production machines (nodes) only a very small component is installed (daemon) that waits for commands from the master, executes them, and returns the exit code back to the master.</li>
<li>A cooperative model is a decentralized design where each machine is capable of helping with scheduling and can offload locally scheduled jobs to other cooperating machines.</li>
</ul>
</li>
</ul>
<h3>Resource Scheduler Terminology</h3>
<p>While a full discussion of resource scheduling is beyond this article, some basic concepts can be presented. A good reference can be found at <a href="http://www.scl.ameslab.gov/Publications/Halstead/usenix_2k.pdf">The Portable Batch Scheduler and the Maui Scheduler on Linux Clusters</a></p>
<p><strong>Utilization and Turnaround </strong></p>
<p>The goal of any scheduler is to get as many jobs run in as short a time as possible utilizing all resources in a fair way. Depending on your job mix and policies the utilization rate may tend to vary. Heavily used clusters tend to pay more attention to utilization. Keep in mind in, with  a large 512 core cluster increasing the utilization rate by even 10% can be like adding more than 50 cores!</p>
<p><strong>Prioritization and Fairness </strong></p>
<p>It is the goal of a scheduler administrator to balance resource consumption amongst competing parties and implement policies that address a large number of political concerns.  Fairness can mean different things to different people. It is important that users understand the "fairness" policy of your site as it can depend on many factors, the least of which is your place in the queue.</p>
<p><strong>Fair Share Policy</strong></p>
<p>A fair share policy tracks users use over a set period (a week, month, quarter) and ensures they receive their allocated percentage of computing resources. The tracking of historical resource utilization for each user results in the ability to modify job priority. In a fair share mode, users, groups, or projects that give up time to others (because the do not need the cluster) can later get their time back when the cluster contention is high.</p>
<p><strong>Urgency Policy</strong></p>
<p>An urgency policy usually allows an urgency value for each job. This urgency value can include resource requirement, resource attributes, a deadline weight and a waiting time weight. For instance with this type of policy a job waiting in the queue will increase in urgency and move up in priority. Priorities can be based on user, projects or departments. That is if a project has a deadline, then it may be given higher priority.</p>
<p><strong>Functional Policy</strong></p>
<p>Functional scheduling is sometimes called priority scheduling. A functional policy setup ensures that a defined share is guaranteed to each user, project, job, or department at any time because shares or tickets are assigned to the user. Unlike a fair share policy, the scheduler does not remember past use. If the cluster is underutilized, jobs run as they enter the queue. If the cluster is full then users, groups, or projects are guaranteed their functional fraction of the cluster.</p>
<p><strong>Reservation </strong></p>
<p>Resource reservation allows constructing a flexible, policy based batch scheduling environment. The user must specify the needed resources (nodes, cores, etc.) and the execution time.</p>
<p><strong>Job Execution Daemon </strong></p>
<p>Virtually all resource schedulers rely on a remote program to manage jobs sent by the master resource manager to the cluster nodes. These programs are referred to as "daemons" which run in the background. The daemon is responsible for starting remote programs, monitoring loads and health, and stopping programs (if necessary). Note that the job execution daemon usually has tighter control over remotely started jobs (those started with <em>rsh</em> or <em>ssh</em>). Many MPI starter programs (i.e. <tt>mpirun</tt>)  often use <em>rsh</em> or <em>ssh </em> to start remote jobs, however some MPI versions will now work directly with the resource manager.</p>
<p><strong>Backfill Scheduling </strong></p>
<p>Backfill is a key component to many schedulers.  It allows for the periodic analysis of the running queue and execution of lower priority jobs if it is determined that running these jobs will not delay jobs higher in the queue. For instance, single core jobs can be sprinkled almost anywhere there are open cores. Most small jobs (small number of cores) benefit from this feature and it improves utilization.</p>
<p><strong>Robust Availability </strong></p>
<p>Because the user is assigned cores or nodes at the time their job runs, they have no control or knowledge of the actual hardware that is used (unless they request a specialized node or nodes). This provides a layer of abstraction that protects the user from having to manage hardware failures. Of course if a job is running and a node crashes, the job will stop, but other jobs not using the node will continue to run.  This feature also allows scheduler or administrator to take nodes out the active queue for maintenance or repair without the users ever knowing.</p>
<p><strong>Failed Job Restart</strong></p>
<p>Many schedulers have the capability to restart a failed job. This can be helpful if hardware fails during a run, as the job will be restarted. This feature can be dangerous, however, because jobs that fail due to a programmer error will also restart and thus can create a cycle of continuous job failure and re-submission. In addition, failed job restart can also be a problem if changes are made to permanent files or other resources, like a database.</p>
<p><strong>SMP Aware Queuing </strong></p>
<p>The advent of multi-core has made SMP (Symmetric Multi-Processing) scheduling an important issue. Additionally, there are issues as to processor/core efficiency that are desirable with today's multi-core designs. There is some question as to where this control should be based. Currently most MPI packages can provide some control, but ultimately the scheduler should be in charge of cores and processor affinity. This level of scheduling will require co-ordination between the scheduler and the MPI starter program. In addition, applications may request multiple cores for multiple processes on multiple nodes or a number of cores for a single threaded process.</p>
<h3>Popular Open Source Resource Schedulers</h3>
<p>The following is a listing of the popular "open" resource schedulers. These packages are freely available, but check the license terms for each package before you deploy it on your cluster. In general, the open packages work quite well for the average cluster installation. It should be noted, that these are not cut-down or miniature versions of commercial products but real production level packages. As such, configuration and administration is often not trivial. All the packages have excellent documentation and support communities so it is possible to <em>learn your way</em> to success. Please see the individual package web sites for more information. The following are summaries are based on the package descriptions.</p>
<ul>
<li><a href="http://www.hpccommunity.org/index.php?pageid=lava">Platform Lava </a> is an open source entry-level workload scheduler designed to meet a wide range of workload scheduling needs for clusters up to 512-nodes.  Scalability is important for a workload manager in several dimensions  including the number of physical hosts in the system, the number of CPUs managed, the number of users with work in the system, and the number of jobs running and pending. Lava was designed not to lose a job once submitted to the system. Lava will reliably continue operation even under conditions of very heavy load without losing any pending or active jobs. In addition, as long as one host in the Lava cluster is operational, the system will continue to run. Lava supports multiple candidate master hosts and fail back seamlessly when the master has recovered. This gives Lava tremendous fault tolerance compared to traditional master-slave setup.</li>
<li><a href="http://www.collab.net/">Sun Grid Engine</a> (SGE) is a Distributed Resource Management (DRM) software system. (i.e. Resource Scheduler). Previously known as CODINE (COmputing in DIstributed Networked Environments) or GRD (Global Resource Director), SGE software aggregates compute power and delivers it as a network service. Grid Engine software is used with a set of computers to create powerful compute farms and clusters which are used in a wide range of technical computing applications. SGE presents users a seamless, integrated computing capability where users start interactive, batch, and sets of repetitious jobs (parametric processing). SGE supports a wide variety of scheduling policies and is almost identical to the commercial version offered by Sun Microsystems.</li>
<li><a href="https://computing.llnl.gov/linux/slurm/">SLURM </a> (Simple Linux Utility for Resource Management) is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.<br />
Optional plugins can be used for accounting, advanced reservation, gang scheduling (time sharing for parallel jobs), backfill scheduling, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.</li>
<li><a href="http://www.cs.wisc.edu/condor/">Condor</a> is a high-throughput computing software framework for coarse-grained distributed parallelization of computationally intensive tasks. It can be used to manage workloads on a dedicated cluster of computers, and/or to farm out work to idle desktop computers -- so-called cycle scavenging. Condor runs on Linux, Unix, Mac OS X, FreeBSD, and contemporary Windows operating systems. Condor can seamlessly integrate both dedicated resources (rack-mounted clusters) and non-dedicated desktop machines (cycle scavenging) into one computing environment.</li>
<li><a href="http://www.clusterresources.com/products/torque-resource-manager.php">Torque</a> is an open source resource manager providing control over batch jobs and distributed compute nodes. It is a community effort based on the original PBS project (Portable Batch System) and, with more than 1,200 patches, has incorporated significant advances in the areas of scalability, fault tolerance, and feature extensions contributed by NCSA, OSC, USC, the U.S. Dept of Energy, Sandia, PNNL, U of Buffalo, TeraGrid, and many other leading edge HPC organizations. This version may be freely modified and redistributed subject to the constraints of the included license. Torque is designed to work with the Maui Scheduler (below).</li>
<li><a href="http://www.clusterresources.com/products/maui-cluster-scheduler.php/">Maui Cluster Scheduler</a> is an open source job scheduler for clusters and supercomputers. It is an optimized, configurable tool capable of supporting an array of scheduling policies, dynamic priorities, extensive reservations, and fairshare capabilities. It is currently in use at hundreds of government, academic, and commercial sites throughout the world. Maui must be used with a resource manager such as Torque (above). Note, this Maui is not associated with the <a href="http://mauischeduler.sourceforge.net/">Maui Scheduler Molokini Edition</a> which was developed as a project on the SourceForge site independent of the original Maui scheduler.</li>
</ul>
<h3>Commercial Resource Schedulers</h3>
<p>While the freely available resource schedulers are quite robust, there are many users who require commercial support and advanced features not found in the open versions. The following are the popular commercial scheduler for x86 platforms.</p>
<ul>
<li><a href="http://www.platform.com/workload-management/high-performance-computing/lp">Platform LSF </a>allows you to manage and accelerate batch workload processing for mission-critical compute- or data-intensive application workload. With Platform LSF you can intelligently schedule and guarantee the completion of the batch workload across your distributed, virtualized, High Performance Computing (HPC) environment. The benefits include maximum resource utilization and a commercially supported product.</li>
<li><a href="http://www.pbsworks.com/(X(1)S(kkqbzh45oo35bsb2djeqvfby))/Default.aspx?AspxAutoDetectCookieSupport=1">PBSPro </a> is based on the popular PBS (Portable Batch System) and is considered a service orientated architecture, field-proven grid infrastructure software that increases productivity even in the most complex computing environments. It efficiently distributes workloads across cluster, SMP, and hybrid configurations, scaling easily to hundreds or even thousands of processors. PBSPro is in active use at more than 1400 sites worldwide.</li>
<li><a href="http://www.oracle.com/us/products/tools/oracle-grid-engine-075549.html">Sun N1 Grid Engine</a> is based on the open source Grid Engine project (see above), and as such, is focused on development of future versions of Grid Engine software through the community process.  In addition, to the open source functionality the Sun branded N1 Grid Engine software suite offers: ARCo - Accounting and Reporting Console, a web based interface to accounting and monitoring data, GEMM - Grid Engine Management Module, a SCS (Sun Control Station) plugin for automatic deployment and monitoring of a N1 Grid Engine cluster, Microsoft Windows support for job execution, and  commercial  support  contract and services available from Sun.</li>
<li><a href="http://www.clusterresources.com/products/moab-cluster-suite/workload-manager.php">Moab Workload Manager</a> is a cluster workload management package that integrates the scheduling, managing, monitoring and reporting of cluster workloads. The Moab Cluster Suite simplifies and unifies management across one or multiple hardware, operating system, storage, network, license and resource manager environments. Moab's development was based on the Open Source Maui job scheduling package (see above).</li>
<li> <a href="http://www-03.ibm.com/systems/software/loadleveler/index.html">Tivoli Workload Scheduler LoadLeveler</a> is a parallel job scheduling system that allows users to run more jobs in less time by matching each job's processing needs and priority with the available resources, thereby maximizing resource utilization. LoadLeveler also provides a single point of control for effective workload management, offers detailed accounting of system utilization for tracking or chargeback and supports high availability configurations.</li>
</ul>
<h3>Conclusion</h3>
<p>Resource schedulers are the backbone of HPC clusters. They represent an important layer between the user and the cluster hardware. Indeed, without a resource scheduler, using a large cluster would be almost impossible and terribly inefficient. Thankfully, there are many options both freely available and from the commercial sectro that help deliver the full potential of your cluster.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/07/sharing-the-load-cluster-resource-schedulers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cluster Interconnects: Messaging Rate</title>
		<link>http://www.clusterconnection.com/2009/07/cluster-interconnects-messaging-rate/</link>
		<comments>http://www.clusterconnection.com/2009/07/cluster-interconnects-messaging-rate/#comments</comments>
		<pubDate>Wed, 01 Jul 2009 17:37:33 +0000</pubDate>
		<dc:creator>Douglas Eadline</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[bandwidth]]></category>
		<category><![CDATA[benchmark]]></category>
		<category><![CDATA[interconnect]]></category>
		<category><![CDATA[latency]]></category>
		<category><![CDATA[message rate]]></category>
		<category><![CDATA[MPI]]></category>
		<category><![CDATA[N/2]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2009/07/cluster-interconnects-messaging-rate/</guid>
		<description><![CDATA[If the interconnect creates the cluster, then how do we measure it? The cluster interconnect is a big contributor to application performance. Use a slow interconnect and your processors may end up idle while they wait for data. A balanced CPU/Interconnect is important for getting the most from your cluster. For example, the better the interconnect, [...]]]></description>
			<content:encoded><![CDATA[<p><em>If the interconnect creates the cluster, then how do we measure it?</em></p>
<p>The cluster interconnect is a big contributor to application performance. Use a slow interconnect and your processors may end up idle while they wait for data. A balanced CPU/Interconnect is important for getting the most from your cluster. For example, the better the interconnect, the more it costs, which means the less you can spend on node or storage hardware. If your applications are embarrassingly parallel, then money spent on a fast interconnect may have been better used for more nodes. The converse is true as well - using a slower cheaper network may throttle high throughput applications, so less nodes and fast network will work better.</p>
<p>It is often said, the best benchmark of a cluster is your application(s). While this is certainly true it is not always possible. Looking at performance data for applications similar to yours is also an option. While this can be helpful, many users often start with micro or single point measurements that help qualify a given interconnect.</p>
<p>Traditionally these micro measurements have been bandwidth, latency, and N/2. Bandwidth, or throughput, is probably the most often quoted performance metric. This number is also used to identify networks (e.g Gigabit Ethernet is one billion bits per second (bps), 10-GigE is 10 billion bps). The throughput does vary by payload size  (messages size) so the maximum possible data rate is often reported for large payloads. Latency effects throughput and is often an important feature in HPC networks. Latency can thought of as the set-up and tear-down time required for a message. For example, traveling by plane is very fast, but the time spent at the airport prior to and after the flight is the "latency" of the flight. The smaller the message, the more the latency matters. Using our airplane analogy, if you are flying from New York to Boston, a 2 hour airport latency is a large part of you travel time. If you were traveling to Tokyo, then the airport time contributes much less to the overall trip time. Many HPC applications require lower latency because they use many shorter messages.</p>
<p>Because latency is so important in HPC, many interconnects report what is known as <em>single byte latency</em> (i.e. the latency or overhead required to send a singe byte of data). Indeed, the competition to produce the lowest HPC latency is quite fierce. Currently, a low latency interconnect has a single byte latency of between 1 and 3 μseconds using specialized interconnect protocols (i.e. not standard kernel networking). For comparison, GigE has a latency range of 20-80 μseconds using TCP/IP.</p>
<p>When evaluating networks, throughput and latency are not the whole story, however. The N/2 point is also included in the list of performance numbers. N/2 is defined as payload size at which the bandwidth is at half its maximum. Recall that bandwidth is dependent on payload size and due to latency, the smaller the payload, the smaller the bandwidth. As the payload size increases a maximum throughput is achieved. An example curve is shown in <em>Figure One</em> below.</p>
<p><center><img class="aligncenter size-full wp-image-1156" title="example-bandwidth-n2" src="/wordpress/wp-content/uploads/2009/06/example-bandwidth-n2.png" alt="example-bandwidth-n2" width="480" height="370" /><br /><em>Figure One: example throughput N/2 curve</center></p>
<p>Traditionally, the above information was often considered a guide to determine a <em>good</em> cluster interconnect (i.e. it should be high bandwidth, low latency, and low N/2 value). And there are benchmarks the you can run to determine the values mentioned above. Most HPC interconnect benchmarks are written in <a href="http://www.mcs.anl.gov/research/projects/mpi/">MPI</a> (Message Passing Interface), but other protocols can be used as well. The following are some freely available benchmarks that can be used to measure an interconnect.</p>
<ul>
<li><a href="http://www.scl.ameslab.gov/netpipe/">NetPIPE</a> is a protocol independent (i.e TCP, MPI, MPI-2, SHMEM, TCGMSG, PVM, and others) performance tool that visually represents the network performance under a variety of conditions.</li>
<li><a href="http://software.intel.com/en-us/articles/intel-mpi-benchmarks/">Intel MPI benchmarks</a> are set of MPI benchmarks that will thoroughly exercise a network interconnect.</li>
<li><a href="http://mvapich.cse.ohio-state.edu/benchmarks/">OMB</a> (OSU Micro-Benchmarks) are a  set of MPI benchmarks that address point-to-point communication and some multiple communication patterns common most multi-core platforms.</li>
</ul>
<p>Using throughout, latency, and N/2 to evaluate worked rather well before the multi-core revolution. The ability to accurately access how a network performs uses the assumption that cluster nodes work in a <em>pint-to-point</em> fashion. That is, a single process on one cluster node is communicating with a single process on another node. With today's multi-core nodes, the number of process on a given node can easily be eight or more. The increased number of processes means that a single point-to-point number may not reflect the real performance of a cluster because there is now more contention for the interconnect - the node is sending and receiving more messages.</p>
<p>To help address this problem, the "message rate" metric was developed. Message rate is defined as the number of messages transmitted in a period of time and is determined by taking the message rate bandwidth (bytes/second) divided by the length of the message (typically done for 0 or 2 bytes) resulting in messages per second metric. The larger the number of messages a multi-core node can send the better it is expected too work on HPC codes. Note that high bandwidth, low latency, and low N/2 do not necessarily imply a good messaging rate.</p>
<p>In closing, even with a good messaging rate, single point numbers are not always the best indicator for interconnect performance. As stated, your application is still the best measure of performance. And remember, one good benchmark that mirrors your applications is worth a hundred opinions.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/07/cluster-interconnects-messaging-rate/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Cluster or Constellation</title>
		<link>http://www.clusterconnection.com/2009/06/cluster-or-constellation/</link>
		<comments>http://www.clusterconnection.com/2009/06/cluster-or-constellation/#comments</comments>
		<pubDate>Wed, 24 Jun 2009 16:50:57 +0000</pubDate>
		<dc:creator>Douglas Eadline</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[Cluster]]></category>
		<category><![CDATA[constellation]]></category>
		<category><![CDATA[cores]]></category>
		<category><![CDATA[HPC]]></category>
		<category><![CDATA[MPI]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2009/06/cluster-or-constellation/</guid>
		<description><![CDATA[Is there really a difference between cluster designs? In a previous entry, I talked about Capacity vs Capability Clusters. Some may argue, "What is the difference?" As I mentioned, there are some design differences, but my main point was to sort out how people use clusters. That is, not everybody is running their code on [...]]]></description>
			<content:encoded><![CDATA[<p><em>Is there really a difference between cluster designs?</em></p>
<p>In a previous entry, I talked about <a href="/2009/06/capacity-vs-capability-clusters/">Capacity vs Capability Clusters</a>. Some may argue, "What is the difference?" As I mentioned, there are some design differences, but my main point was to sort out how people use clusters. That is, not everybody is running their code on 25,000 cores.</p>
<p>Indeed, in a pedantic sense, you may  not even be using a cluster, you may actually be running on a constellation. Please note, I do not mean the Sun "Constellation" brand cluster, but rather I am talking about a definition used by Beowulf pioneer Tom Sterling:</p>
<p><em>A constellation is a cluster of large SMP nodes scaled such that the number of processors per node is greater than the number of nodes.</em></p>
<p>Using this definition, a constellation becomes a cluster when the number of nodes equals or exceeds the number of cores per node. Today, the eight core node is standard which implies that you need eight nodes (or more) to be called a cluster. Therefore, a modern day cluster should have 64 or more cores. There was a time when 64 processors (single core) was considered a large cluster. By today's standards and Sterling's taxonomy, it may not even be a cluster!</p>
<p>Again, users may ask, "What is the point? My codes run on either system". I would agree to a point. If your codes uses eight cores or less, then you probably want to run on a single node. And, as core counts and processor sockets increase, the cores per node will easily double. Thus, if your code requires 16 cores or less you will again stay within a single node. In terms of programming, this may swing many decisions away from MPI toward a threaded solution. This could be a major break in the way HPC is done on clustered systems.</p>
<p>By the way, for a 16 core node (e.g. four socket quad-core) to be considered a cluster it must have 16 nodes which implies 256 cores. Interestingly, using the definition above, doubling the number of cores per node, quadruples the total core count required by our cluster definition.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/06/cluster-or-constellation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MPI Choices</title>
		<link>http://www.clusterconnection.com/2009/06/mpi-choices/</link>
		<comments>http://www.clusterconnection.com/2009/06/mpi-choices/#comments</comments>
		<pubDate>Tue, 16 Jun 2009 16:34:55 +0000</pubDate>
		<dc:creator>Douglas Eadline</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[API]]></category>
		<category><![CDATA[ICR]]></category>
		<category><![CDATA[Intel Cluster Ready]]></category>
		<category><![CDATA[Message Passing]]></category>
		<category><![CDATA[MPI]]></category>
		<category><![CDATA[MPICH2]]></category>
		<category><![CDATA[MVAPICH]]></category>
		<category><![CDATA[Open MPI]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2009/06/mpi-choices/</guid>
		<description><![CDATA[Choices, choices, choices. What should you expect from parallel computing software? Talk to anyone that works with high-performance computing clusters and at some point the conversation is bound to turn to MPI. MPI stands for "Message Passing Interface". As an "Interface" -- or API (Application Programming Interface) -- MPI does not work any magic on [...]]]></description>
			<content:encoded><![CDATA[<p>Choices, choices, choices. What should you expect from parallel computing software?</p>
<p>Talk to anyone that works with high-performance computing clusters and at some point the conversation is bound to turn to MPI.</p>
<p>MPI stands for "Message Passing Interface". As an "Interface" -- or API (Application Programming Interface) -- MPI does not work any magic on its own and must be coupled with a programming language like Fortran or C/C++. Once a program is MPI-enabled (i.e. written in a way to pass messages/data between computing processes) it can run on a cluster or even on a multi-core node.</p>
<p>MPI was initially developed because vendors were each creating custom message passing libraries, leading to non-portable applications and extra coding. To resolve this, everyone needed to be working off a global template, or standard. With a standard, users could move programs from one parallel platform to another without any major work (at least in theory).</p>
<p>Today, MPI is a standard maintained by the <a href="http://www.mpi-forum.org/">MPI Forum</a>. Because of this, there are a plethora of MPI choices for cluster users.</p>
<p>MPI versions generally fall into one of two camps: Open Source and commercial.</p>
<p>Historically, many of the open source versions of MPI were shared among early HPC users and thus became very popular.</p>
<p><strong>Open Source MPI</strong></p>
<p>The first two open MPI versions were <a href="http://www.mcs.anl.gov/research/projects/mpich2/">MPICH</a> and <a href="http://www.lam-mpi.org/">LAM/MPI</a>. MPICH was developed by Argonne National Lab while LAM/MPI grew out of a number of University efforts. As great as these packages are, their use is now discouraged as no maintenance or code updates are planned for either version.</p>
<p>In their place, users should consider <a href="http://www.mcs.anl.gov/research/projects/mpich2/">MPICH2</a> and <a href="http://www.open-mpi.org/">Open MPI</a>. If you are using InfiniBand, you may also want to look at <a href="http://mvapich.cse.ohio-state.edu/">MVAPICH</a>. (MPICH2 and Open MPI provide InfiniBand support as well.) These versions are under active development. There are other open MPI projects, but these are the most popular.</p>
<p><strong>Commercial MPI</strong></p>
<p>In terms of commercial MPI versions the most popular are <a href="http://software.intel.com/en-us/articles/intel-mpi-library/">Intel MPI</a>, <a href="http://www.scali.com/cluster-computing/platform-mpi/platform-mpi">Platform MPI</a> (formerly Scali MPI), and <a href="https://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=MPISW">HP-MPI</a>. Each has a variety of custom features, including support, that are worth considering for your cluster.</p>
<p><strong>Picking an MPI</strong></p>
<p>With all the choices, which MPI is right for you?</p>
<p>One would think that because it is a standard all MPI are compatible and work the same. As for compatibility, most programs written with one MPI <em>should</em> be able to be recompiled using another MPI library. There are some issues that crop up, but for the most part the API is portable.</p>
<p>Implementation, however, can vary quite a bit. Indeed, there are many details of the MPI implementation that are not covered by the specification (this was intentional). Various MPIs offer different features like better performance, run-time interconnect choice, or start-up methods.</p>
<p><strong>One Solution</strong></p>
<p>Although the multiple MPI versions are welcome by users, Independent Software Vendors (ISVs) find the multiple MPI versions difficult to support. One solution to this problem is the <a href="http://software.intel.com/en-us/cluster-ready/">Intel Cluster Ready</a> (ICR) initiative. ICL ensures that vendors will find a standard software environment every ICR certified cluster. Check with your cluster vendor about ICR options.</p>
<p>Intel Cluster Ready® will not limit your MPI choices, it will just make sure the ISV codes can run on your cluster.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/06/mpi-choices/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Dell and Intel® Cluster Ready Tech Brief: Simplifying HPC Clusters</title>
		<link>http://www.clusterconnection.com/2009/04/dell-and-intel-cluster-ready-tech-brief-simplifying-hpc-clusters/</link>
		<comments>http://www.clusterconnection.com/2009/04/dell-and-intel-cluster-ready-tech-brief-simplifying-hpc-clusters/#comments</comments>
		<pubDate>Wed, 01 Apr 2009 22:40:09 +0000</pubDate>
		<dc:creator>Marcel Van Drunen</dc:creator>
				<category><![CDATA[Briefs]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Dell]]></category>
		<category><![CDATA[high performance computing]]></category>
		<category><![CDATA[HPC]]></category>
		<category><![CDATA[ICR]]></category>
		<category><![CDATA[Intel Cluster Ready]]></category>
		<category><![CDATA[Intel MPI]]></category>
		<category><![CDATA[MPI]]></category>
		<category><![CDATA[simplify]]></category>
		<category><![CDATA[supercomputing]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2009/04/dell-and-intel-cluster-ready-tech-brief-simplifying-hpc-clusters/</guid>
		<description><![CDATA[[Excerpt] Many organizations have departments and workgroups that could benefit from high performance computing (HPC) resources to analyze, model, and visualize the growing volumes of data they need to conduct business. Unfortunately, these groups often do not have sufficient IT support and may lack the specialized IT skills required to run their own HPC clusters. [...]]]></description>
			<content:encoded><![CDATA[<p>[Excerpt] Many organizations have departments and workgroups that could benefit from high performance computing (HPC) resources to analyze, model, and visualize the growing volumes of data they need to conduct business. Unfortunately, these groups often do not have sufficient IT support and may lack the specialized IT skills required to run their own HPC clusters.</p>
<p><a href="/wordpress/wp-content/uploads/2009/04/ziff-davis-dell-icr-wp.pdf">Click here to download pdf</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/04/dell-and-intel-cluster-ready-tech-brief-simplifying-hpc-clusters/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

