August 20th, 2009 12:48 pm
Posted by Douglas Eadline
Tags: API, Hardware, Intel Cluster Ready, MPI, OpenMP
The hardware environment may determine the best parallel programming tool to use
The advent of multi-core processors has increased the need for parallel programs on the largest to the smallest of systems (clusters to laptops). There are many ways to express parallelism in a program. In HPC, the MPI (Message Passing Interface) has been the main tool of most programmers. MPI is often talked about as though it is a computer language on its own. In reality, MPI is an API (Applications Programming Interface), or programming library that allows Fortran and C (and sometimes C++) programs to send messages to each other.
Another method to express parallelism is OpenMP. Unlike MPI, OpenMP is not an API, but an extension to a compiler. To use OpenMP, the programmer adds "pragmas" (comments) to the program that are used as hints by the compiler. The resulting program uses operating system threads to run in parallel. Operating system threads can be thought of as separate subroutines running at the same time that share the same memory space. In addition to the fact that "MP" is in both the names of these methods, there is often some confusion about how each of these parallel paradigms works and where/when they should be applied. This article will explain the differences and provide a better understanding of these two powerful technologies.
While some programming issues will be discussed, we will not present any programming examples or explanations. If you are interested in using MPI and OpenMP, please see MPI In 30 Minutes and OpenMP in 30 Minutes tutorials. These tutorials will get you started quickly. We will also be talking about some of the example programs mentioned in these articles. There are links to the source code in both articles should you wish to try some things on your own.
Many people are surprised to learn that MPI has not been certified by any major standards organization. Instead, the MPI Forum creates and maintains the MPI standard. You can find more background and where to obtain both open and commercial MPI versions in the article MPI Choices. When writing MPI codes, the programmer must explicitly add message passing calls to a program. Quite often, existing sequential programs are modified, but new "parallel applications" can be written as well. In terms of programming difficulty, MPI is conceptually straight forward, but it is possible to build programs that are hard to follow as there are many things happening at the same time. In addition, starting/monitoring/debugging MPI programs across a cluster can sometimes lead to extra work not found when running on a single server. There are also tools to assist with MPI programs such as Intel Trace Analyzer.
OpenMP was developed because native operating system threads (often referred to as POSIX threads or Pthreads) programing can be cumbersome. To aid with thread programming, a higher level of abstraction was developed and called OpenMP. As with all higher level approaches, there is the sacrifice of flexibility for the ease of coding. At its core, OpenMP uses threads, but the details are hidden from the programmer. As mentioned, OpenMP is implemented as compiler directives (pragmas) in program comments. Typically, computationally heavy loops are augmented with OpenMP directives that the compiler uses to automatically "thread the loop." This type of approach has the distinct advantage that it may be possible to leave the original program "untouched" (except for directives) and provide simple recompilation for a sequential (non-threaded) version where the OpenMP directives are ignored. More information can be found at the OpenMP website.
OpenMP is supported by all major Fortran and C compilers (including gcc/gfortran and the Intel Compilers). From a programmers standpoint, working with OpenMP is easier than MPI -- at least initially. Adding pragmas to a program allows it to still function as a sequential (single core) program, thus programmers can incrementally add parallelism. Users can still create complex, hard to understand programs, but as with MPI there are tools, like Intel Vtune, to assist with programmer with OpenMP.
Copy or Share
When discussing MPI or message passing methods, one obvious aspect is often overlooked - message passing is basically memory copying. Let's consider a simple MPI message from one program to another.
The first sending program sends the text message "Hello over there" and the receiving program responds with "What's up?". The sending program will construct the "Hello over there" string in memory then send it to the receiving program. The receiving program will take take the string and place it into it's own memory. There are now two copies of the string. The reply works exactly the same way. This type of communication is best for distributed memory systems like clusters. Note that I did not state where the processes were located. By design, MPI processes can be located on the same server or on a separate server. Regardless of where it runs, each MPI process has it's own memory space from which messages are copied.
In contrast, in a threaded or OpenMP environment, communication happens differently. If one thread wants to communicate with another thread, it would say, there is a message at this memory location (i.e. "Hello over there"). The receiving thread would look at the memory location and then tell the sender my response is here. There is no copied data, there is one copy and it is shared between threads. (Note, this is not strictly how it happens, but the "no copying" is what is important).
As mentioned, MPI can run across distributed servers and on a SMP (multi-core) servers. OpenMP, however, is best run on a single SMP server or on multiple servers using something like ScaleMP. There is also a product called Cluster OpenMP* that can run OpenMP applications across a cluster. For this reason, MPI codes usually scale to larger numbers of servers, while OpenMP is restricted to an single operating system domain (e.g a single server).
There is another subtle difference between OpenMP and MPI applications that run on a single server. In OpenMP communication is through shared memory, which means threads share access to a memory location. MPI programs on SMP systems communicate through shared memory, but processes send messages by reading and writing to shared memory. The messages are still copied from one process space to another. Obviously, sharing memory locations seems more efficient than sending copies of memory locations to other processes, but it all depends. In the MPI process model, single processes have exclusive access to all their process memory. For some programs, this situation may be more efficient because it is better to copy data (send a message) than to wait for shared memory access. On the other hand, in the OpenMP model, threads can share access to all memory in the process space. In this case, some programs may be much more efficient as the overhead of copying memory is not needed.
Compiling and Running Code
We are going to use three programs from the tutorial articles mentioned above. The programs are simple matrix multiplications with same underlying code:
- matmul.c - a sequential (runs on a single core) version of the matrix multiplication program
- matmul_omp.c - an OpenMP version of matmul.c
- matmul_mpi.c - an MPI version of matmul.c
In order to extend the execution times, I increased the array dimension from 1000 to 2000 in all the programs. I'm using an Intel Q6600 quad-core running at 2.40GHz and my gcc version is 4.3.3. The first thing we will do is build the sequential version.
$ gcc -g -O3 -o matmul.exe matmul.c -lm
Next, we will run the program and record the time.
$ time ./matmul.exe>matmul.out
The program took 119 seconds to run. The OpenMP version was built with the following command, note the use of the -fopenmp option. This option tells the compiler to use the OpenMP pragmas to build a threaded version. Indeed, it is possible to create a sequential or single core version by not using the -fopenmp option.
$ gcc -g -O3 -fopenmp -o matmul_omp.exe matmul_omp.c -lm
Running the program produces the following times
$ time ./matmul_omp.exe>matmul_omp.out
The OpenMP version reduced the wall clock time from 119 seconds to 31 (a speed up of 3.8). If you look at the user time you will see there were 123 seconds used! That is because four cores were used and user time is the combined time of all the cores running at the same time. There is also an environment variable called OMP_NUM_THREADS that will tell OpenMP binaries how many threads to use. If this is not defined, one thread per core is used. The maximum number of threads may be defined by the program as well.
Turning to MPI there are a few differences in our compilation process. First, we have to make sure we have a version of MPI installed on our machine. In this case we are using Open MPI 1.3.1. To build an MPI program a wrapper script/program is often used that makes sure that the paths and names of include files and libraries are specified. In our case, we will use the MPI cc wrapper called mpicc.
mpicc -g -O3 -o matmul_mpi.exe matmul_mpi.c -lm
To run the resultant binary, we need to use an MPI starter program often referred to as mpirun or mpiexec. We also add an argument (-np) to tell MPI how many copies of the program to run.
time mpirun -np 4 matmul_mpi.exe>matmul_mpi.out
Note that similar to the OpenMP example, the real time was about 33 seconds (a speed-up of 3.6) while the user time was about 126 seconds. Both methods produced excellent speed-up.
Processes or Threads
As mentioned the big difference between MPI and OpenMP is way programs are run. OpenMP programs run as a single process and the parallelism is expressed as threads. (i.e. the program is started as one binary which then separates into individual "threads" which are run on the available cores on a server.) This behavior can be viewed quite clearly when reviewing an OpenMP program using top. As an example, consider Figure One where a single OpenMP binary is running on and eight-core server. Notice that the cores are all busy, but there is one process running with a CPU utilization rate of 788 percent!
OpenMP program (cg.B) running on eight cores
In contrast to the OpenMP, MPI actually starts one process per core using the mpirun -np 8 ... command. This situation is shown in Figure Two where an MPI version of the same program is now running. Note the number of processes is now eight and each process has a 100 percent utilization rate. The processor (core) loads are about the same.
MPI program (cg.B.8) running on eight cores
We will not be making any statements about which method is better. In some cases OpenMP works much better than MPI on a multi-core server for the same application. In other cases, MPI has been shown to run faster. The good news is if you already have an MPI version of your program, you can easily try it on a multi-core server. You can also run an MPI program "by hand" across multiple servers by using various start-up methods. Most batch scheduling packages (Torque, Moab, SGE, Platform) support multi-node MPI runs as well. Similarly, you can request a node to run your OpenMP application, but make sure you get exclusive control of the number of cores you need.
Astute readers may wonder, can I use OpenMP and MPI in the same program? The answer is yes. Since MPI is an API and OpenMP is based on threads, there is no reason, other than bad programming, that the two methods cannot be used in the same application. Indeed, HPL (the Top500 Benchmark) is run as a single instance on each node, but on the node the program is threaded to use the individual cores. One way to envision a hybrid program is to use MPI for the outer loops and OpenMP for the inner loops. Thus, an MPI program could be augmented with OpenMP pragmas so it could take advantage of all the cores on any one node, if they are available. Of course, the program/algorithm would have to support this level of parallelism to run efficiently.
MPI and OpenMP are well tested and robust technologies for creating parallel programs. Understanding these differences is key to creating applications that meet your expectations and run on the types of hardware available to you. Now that multi-core systems are everywhere, getting started is easy even on that new desktop.
JOIN THE CONVERSATION
You must be a Registered Member in order to comment on Cluster Connection posts.
Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.
Login Register Now