<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cluster Connection &#187; API</title>
	<atom:link href="http://www.clusterconnection.com/tag/api/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.clusterconnection.com</link>
	<description>Simplify HPC. Share the knowledge.</description>
	<lastBuildDate>Fri, 30 Dec 2011 21:23:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1</generator>
		<item>
		<title>Comparing MPI and OpenMP</title>
		<link>http://www.clusterconnection.com/2009/08/comparing-mpi-and-openmp/</link>
		<comments>http://www.clusterconnection.com/2009/08/comparing-mpi-and-openmp/#comments</comments>
		<pubDate>Thu, 20 Aug 2009 19:48:06 +0000</pubDate>
		<dc:creator>Douglas Eadline</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[Hardware]]></category>
		<category><![CDATA[API]]></category>
		<category><![CDATA[Intel Cluster Ready]]></category>
		<category><![CDATA[MPI]]></category>
		<category><![CDATA[OpenMP]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2009/08/comparing-mpi-and-openmp/</guid>
		<description><![CDATA[The hardware environment may determine the best parallel programming tool to use The advent of multi-core processors has increased the need for parallel programs on the largest to the smallest of systems (clusters to laptops). There are many ways to express parallelism in a program. In HPC, the MPI (Message Passing Interface) has been the [...]]]></description>
			<content:encoded><![CDATA[<p><em>The hardware environment may determine the best parallel programming tool to use</em></p>
<p>The advent of multi-core processors has increased the need for parallel programs on the largest to the smallest of systems (clusters to laptops). There are many ways to express parallelism in a program. In HPC, the MPI (Message Passing Interface) has been the main tool of most programmers. MPI is often talked about as though it is a computer language on its own. In reality, MPI is an API (Applications Programming Interface), or programming library that allows Fortran and C (and sometimes C++) programs to send messages to each other.</p>
<p>Another method to express parallelism is OpenMP. Unlike MPI, OpenMP is not an API, but an extension to a compiler. To use OpenMP, the programmer adds "pragmas" (comments) to the program that are used as hints by the compiler. The resulting program uses operating system threads to run in parallel. Operating system threads can be thought of as separate subroutines running at the same time that share the same memory space. In addition to the fact that "MP" is in both the names of these methods, there is often some confusion about how each of these parallel paradigms works and where/when they should be applied. This article will explain the differences and provide a better understanding of these two powerful technologies.</p>
<p>While some programming issues will be discussed, we will not present any programming examples or explanations. If you are interested in using MPI and OpenMP, please see <a href="http://www.linux-mag.com/id/5759">MPI In 30 Minutes</a> and <a href="http://www.linux-mag.com/id/4609">OpenMP in 30 Minutes</a> tutorials. These tutorials will get you started quickly. We will also be talking about some of the example programs mentioned in these articles. There are links to the source code in both articles should you wish to try some things on your own.</p>
<h3>Background</h3>
<p>Many people are surprised to learn that MPI has not been certified by any major standards organization. Instead, the <a href="http://www.mpi-forum.org/">MPI Forum</a> creates and maintains the MPI standard. You can find more background and where to obtain both open and commercial MPI versions in the article <a href="/2009/06/mpi-choices/">MPI Choices</a>. When writing MPI codes, the programmer must explicitly add message passing calls to a program. Quite often, existing sequential programs are modified, but new "parallel applications" can be written as well. In terms of programming difficulty, MPI is conceptually straight forward, but it is possible to build programs that are hard to follow as there are many things happening at the same time. In addition, starting/monitoring/debugging MPI programs across a cluster can sometimes lead to extra work not found when running on a single server. There are also tools to assist with MPI programs such as <a href="http://software.intel.com/en-us/articles/intel-trace-analyzer/">Intel Trace Analyzer</a>.</p>
<p>OpenMP was developed because native operating system threads (often referred to as POSIX threads or Pthreads) programing can be cumbersome. To aid with thread programming, a higher level of abstraction was developed and called OpenMP. As with all higher level approaches, there is the sacrifice of flexibility for the ease of coding. At its core, OpenMP uses threads, but the details are hidden from the programmer. As mentioned, OpenMP is implemented as compiler directives (pragmas) in program comments. Typically, computationally heavy loops are augmented with OpenMP directives that the compiler uses to automatically "thread the loop." This type of approach has the distinct advantage that it may be possible to leave the original program "untouched" (except for directives) and provide simple recompilation for a sequential (non-threaded) version where the OpenMP directives are ignored. More information can be found at the <a href="http://openmp.org/wp/">OpenMP</a> website.</p>
<p>OpenMP is supported by all major Fortran and C compilers (including <a href="http://gcc.gnu.org/">gcc</a>/<a href="http://gcc.gnu.org/fortran/">gfortran</a> and the <a href="http://software.intel.com/en-us/articles/intel-compilers/">Intel Compilers</a>). From a programmers standpoint, working with OpenMP is easier than MPI -- at least initially. Adding pragmas to a program allows it to still function as a sequential (single core) program, thus programmers can incrementally add parallelism. Users can still create complex, hard to understand programs, but as with MPI there are tools, like <a href="http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/">Intel Vtune</a>, to assist with programmer with OpenMP.</p>
<h3>Copy or Share</h3>
<p>When discussing MPI or message passing methods, one obvious aspect is often overlooked - message passing is basically memory copying. Let's consider a simple MPI message from one program to another.</p>
<p>The first sending program sends the text message "Hello over there" and the receiving program responds with "What's up?". The sending program will construct the "Hello over there" string in memory then send it to the receiving program. The receiving program will take take the string and place it into it's own memory. There are now two copies of the string. The reply works exactly the same way. This type of communication is best for distributed memory systems like clusters. Note that I did not state where the processes were located. By design, MPI processes can be located on the same server or on a separate server. Regardless of where it runs, each MPI process has it's own memory space from which messages are copied.</p>
<p>In contrast, in a threaded or OpenMP environment, communication happens differently. If one thread wants to communicate with another thread, it would <em>say</em>, there is a message at this memory location (i.e. "Hello over there"). The receiving thread would look at the memory location and then <em>tell</em> the sender my response is here. There is no copied data, there is one copy and it is shared between threads. (Note, this is not strictly how it happens, but the "no copying" is what is important).</p>
<p>As mentioned, MPI can run across distributed servers and on a SMP (multi-core) servers. OpenMP, however, is best run on a single SMP server or on multiple servers using something like <a href="http://www.scalemp.com/">ScaleMP</a>. There is also a product called <a href="http://software.intel.com/en-us/whatif/">Cluster OpenMP*</a> that can run OpenMP applications across a cluster. For this reason, MPI codes usually scale to larger numbers of servers, while OpenMP is restricted to an single operating system domain (e.g a single server).</p>
<p>There is another subtle difference between OpenMP and MPI applications that run on a single server. In OpenMP communication is through shared memory, which means <em>threads share access</em> to a memory location. MPI programs on SMP systems communicate through shared memory, but <em>processes send messages by reading and writing to shared memory</em>. The messages are still copied from one process space to another.  Obviously, sharing memory locations seems more efficient than sending copies of memory locations to other processes, but it all depends. In the MPI process model, single processes have exclusive access to all their process memory. For some programs, this situation may be more efficient because it is better to copy data (send a message) than to wait for shared memory access. On the other hand, in the OpenMP model, threads can share access to all memory in the process space. In this case, some programs may be much more efficient as the overhead of copying memory is not needed.</p>
<h3>Compiling and Running Code</h3>
<p>We are going to use three programs from the tutorial articles mentioned above. The programs are simple matrix multiplications with same underlying code:</p>
<ul>
<li><tt>matmul.c</tt> - a sequential (runs on a single core) version of the matrix multiplication program</li>
<li><tt>matmul_omp.c</tt> - an OpenMP version of <tt>matmul.c</tt></li>
<li><tt>matmul_mpi.c</tt> - an MPI version of <tt>matmul.c</tt></li>
</ul>
<p>In order to extend the execution times, I increased the array dimension from 1000 to 2000 in all the programs.  I'm using an Intel Q6600 quad-core running at 2.40GHz and my gcc version is 4.3.3. The first thing we will do is build the sequential version.</p>
<pre>$ gcc -g -O3 -o matmul.exe matmul.c -lm</pre>
<p>Next, we will run the program and record the time.</p>
<pre>$ time ./matmul.exe&gt;matmul.out

real	1m59.011s
user	1m58.931s
sys	0m0.084s</pre>
<p>The program took 119 seconds to run. The OpenMP version was built with the following command, note the use of the <tt>-fopenmp</tt> option. This option tells the compiler to use the OpenMP pragmas to build a threaded version. Indeed, it is possible to create a sequential or single core version by not using the <tt>-fopenmp</tt> option.</p>
<pre>$ gcc -g -O3  -fopenmp -o matmul_omp.exe matmul_omp.c -lm</pre>
<p>Running the program produces the following times</p>
<pre>$ time ./matmul_omp.exe&gt;matmul_omp.out

real	0m31.304s
user	2m3.460s
sys	0m0.080s</pre>
<p>The OpenMP version reduced the wall clock time from 119 seconds to 31 (a speed up of 3.8). If you look at the user time you will see there were 123 seconds used! That is because four cores were used and user time is the combined time of all the cores running at the same time. There is also an environment variable called <tt>OMP_NUM_THREADS</tt> that will tell OpenMP binaries how many threads to use. If this is not defined, one thread per core is used. The maximum number of threads may be defined by the program as well.</p>
<p>Turning to MPI there are a few differences in our compilation process. First, we have to make sure we have a version of MPI installed on our machine. In this case we are using Open MPI 1.3.1. To build an MPI program a wrapper script/program is often used that makes sure that the paths and names of include files and libraries are specified. In our case, we will use the MPI <tt>cc</tt> wrapper called <tt>mpicc</tt>.</p>
<pre>mpicc -g -O3 -o matmul_mpi.exe matmul_mpi.c -lm</pre>
<p>To run the resultant binary, we need to use an MPI starter program often referred to as <tt>mpirun</tt> or <tt>mpiexec</tt>. We also add an argument (<tt>-np</tt>) to tell MPI how many copies of the program to run.</p>
<pre>time mpirun -np 4 matmul_mpi.exe&gt;matmul_mpi.out

real	0m32.662s
user	2m5.824s
sys	0m0.556s</pre>
<p>Note that similar to the OpenMP example, the real time was about 33 seconds (a speed-up of 3.6) while the user time was about 126 seconds. Both methods produced excellent speed-up.</p>
<h3>Processes or Threads</h3>
<p>As mentioned the big difference between MPI and OpenMP is way programs are run. OpenMP programs run as a single process and the parallelism is expressed as threads. (i.e. the program is started as one binary which then separates into individual "threads" which are run on the available cores on a server.) This behavior can be viewed quite clearly when reviewing an OpenMP program using <tt>top</tt>. As an example, consider <em>Figure One</em> where a single OpenMP binary is running on and eight-core server. Notice that the cores are all busy, but there is one process running with a CPU utilization rate of 788 percent!</p>
<div id="attachment_1460" class="wp-caption aligncenter" style="width: 398px"><img class="size-full wp-image-1460" title="figure-one-omp-top-485" src="/wordpress/wp-content/uploads/2009/08/figure-one-omp-top-485.png" alt="Figure One: OpenMP program (cg.B) running on eight cores" width="388" height="272" /><p class="wp-caption-text">OpenMP program (cg.B) running on eight cores</p></div>
<p>In contrast to the OpenMP, MPI actually starts one process per core using the <tt>mpirun -np 8 ...</tt> command. This situation is shown in <em>Figure Two</em> where an MPI version of the same program is now running. Note the number of processes is now eight and each process has a 100 percent utilization rate. The processor (core) loads are about the same.</p>
<div id="attachment_1461" class="wp-caption aligncenter" style="width: 412px"><img class="size-full wp-image-1461" title="figure-two-mpi-top-485" src="/wordpress/wp-content/uploads/2009/08/figure-two-mpi-top-485.png" alt="Figure Two: MPI program (cg.B.8) running on eight cores" width="402" height="289" /><p class="wp-caption-text">MPI program (cg.B.8) running on eight cores</p></div>
<p>We will not be making any statements about which method is better. In some cases OpenMP works much better than MPI on a multi-core server for the same application. In other cases, MPI has been shown to run faster. The good news is if you already have an MPI version of your program, you can easily try it on a multi-core server. You can also run an MPI program "by hand" across multiple servers by using various start-up methods. Most batch scheduling packages (Torque, Moab, SGE, Platform) support multi-node MPI runs as well. Similarly, you can request a node to run your OpenMP application, but make sure you get exclusive control of the number of cores you need.</p>
<h3>Hybrid Approaches</h3>
<p>Astute readers may wonder, can I use OpenMP and MPI in the same program? The answer is yes. Since MPI is an API and OpenMP is based on threads, there is no reason, other than bad programming, that the two methods cannot be used in the same application. Indeed, HPL (the Top500 Benchmark) is run as a single instance on each node, but on the node the program is threaded to use the individual cores. One way to envision a hybrid program is to use MPI for the outer loops and OpenMP for the inner loops. Thus, an MPI program could be augmented with OpenMP pragmas so it could take advantage of all the cores on any one node, if they are available. Of course, the program/algorithm would have to support this level of parallelism to run efficiently.</p>
<p>MPI and OpenMP are well tested and robust technologies for creating parallel programs. Understanding these differences is key to creating applications that meet your expectations and run on the types of hardware available to you. Now that multi-core systems are everywhere, getting started is easy even on that new desktop.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/08/comparing-mpi-and-openmp/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>MPI Choices</title>
		<link>http://www.clusterconnection.com/2009/06/mpi-choices/</link>
		<comments>http://www.clusterconnection.com/2009/06/mpi-choices/#comments</comments>
		<pubDate>Tue, 16 Jun 2009 16:34:55 +0000</pubDate>
		<dc:creator>Douglas Eadline</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[API]]></category>
		<category><![CDATA[ICR]]></category>
		<category><![CDATA[Intel Cluster Ready]]></category>
		<category><![CDATA[Message Passing]]></category>
		<category><![CDATA[MPI]]></category>
		<category><![CDATA[MPICH2]]></category>
		<category><![CDATA[MVAPICH]]></category>
		<category><![CDATA[Open MPI]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2009/06/mpi-choices/</guid>
		<description><![CDATA[Choices, choices, choices. What should you expect from parallel computing software? Talk to anyone that works with high-performance computing clusters and at some point the conversation is bound to turn to MPI. MPI stands for "Message Passing Interface". As an "Interface" -- or API (Application Programming Interface) -- MPI does not work any magic on [...]]]></description>
			<content:encoded><![CDATA[<p>Choices, choices, choices. What should you expect from parallel computing software?</p>
<p>Talk to anyone that works with high-performance computing clusters and at some point the conversation is bound to turn to MPI.</p>
<p>MPI stands for "Message Passing Interface". As an "Interface" -- or API (Application Programming Interface) -- MPI does not work any magic on its own and must be coupled with a programming language like Fortran or C/C++. Once a program is MPI-enabled (i.e. written in a way to pass messages/data between computing processes) it can run on a cluster or even on a multi-core node.</p>
<p>MPI was initially developed because vendors were each creating custom message passing libraries, leading to non-portable applications and extra coding. To resolve this, everyone needed to be working off a global template, or standard. With a standard, users could move programs from one parallel platform to another without any major work (at least in theory).</p>
<p>Today, MPI is a standard maintained by the <a href="http://www.mpi-forum.org/">MPI Forum</a>. Because of this, there are a plethora of MPI choices for cluster users.</p>
<p>MPI versions generally fall into one of two camps: Open Source and commercial.</p>
<p>Historically, many of the open source versions of MPI were shared among early HPC users and thus became very popular.</p>
<p><strong>Open Source MPI</strong></p>
<p>The first two open MPI versions were <a href="http://www.mcs.anl.gov/research/projects/mpich2/">MPICH</a> and <a href="http://www.lam-mpi.org/">LAM/MPI</a>. MPICH was developed by Argonne National Lab while LAM/MPI grew out of a number of University efforts. As great as these packages are, their use is now discouraged as no maintenance or code updates are planned for either version.</p>
<p>In their place, users should consider <a href="http://www.mcs.anl.gov/research/projects/mpich2/">MPICH2</a> and <a href="http://www.open-mpi.org/">Open MPI</a>. If you are using InfiniBand, you may also want to look at <a href="http://mvapich.cse.ohio-state.edu/">MVAPICH</a>. (MPICH2 and Open MPI provide InfiniBand support as well.) These versions are under active development. There are other open MPI projects, but these are the most popular.</p>
<p><strong>Commercial MPI</strong></p>
<p>In terms of commercial MPI versions the most popular are <a href="http://software.intel.com/en-us/articles/intel-mpi-library/">Intel MPI</a>, <a href="http://www.scali.com/cluster-computing/platform-mpi/platform-mpi">Platform MPI</a> (formerly Scali MPI), and <a href="https://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=MPISW">HP-MPI</a>. Each has a variety of custom features, including support, that are worth considering for your cluster.</p>
<p><strong>Picking an MPI</strong></p>
<p>With all the choices, which MPI is right for you?</p>
<p>One would think that because it is a standard all MPI are compatible and work the same. As for compatibility, most programs written with one MPI <em>should</em> be able to be recompiled using another MPI library. There are some issues that crop up, but for the most part the API is portable.</p>
<p>Implementation, however, can vary quite a bit. Indeed, there are many details of the MPI implementation that are not covered by the specification (this was intentional). Various MPIs offer different features like better performance, run-time interconnect choice, or start-up methods.</p>
<p><strong>One Solution</strong></p>
<p>Although the multiple MPI versions are welcome by users, Independent Software Vendors (ISVs) find the multiple MPI versions difficult to support. One solution to this problem is the <a href="http://software.intel.com/en-us/cluster-ready/">Intel Cluster Ready</a> (ICR) initiative. ICL ensures that vendors will find a standard software environment every ICR certified cluster. Check with your cluster vendor about ICR options.</p>
<p>Intel Cluster Ready® will not limit your MPI choices, it will just make sure the ISV codes can run on your cluster.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/06/mpi-choices/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

