Automatic SLES 11 deployment of an ICR certified HPC cluster

July 14th, 2009 10:26 am
Posted by Oliver Tennert
Tags: , , , , , ,

The other day we had to ship an ICR certified HPC cluster based on SLES 11, the latest SUSE Enterprise Distribution. We have used Intel Cluster Runtime version 2.0-1. As SLES 11 has to that date been out only for a couple of weeks, we didn't expect everything to run as smoothly as for SLES 10, and indeed the challenge turned out to have SLES 11 behave in a way compatible with Intel Cluster Ready.

As we found out in the Web, the Intel MPI implementation has a known bug: the number of available cores is not detected correctly. But it took some time connecting this circumstance to one of our problems, because the original error did not immediately point to it:

Intel(R) MPI Library Runtime Environment (Single-node), (intel_mpi_rt).........................................................FAILED
subtest 'MPI Hello World! (I_MPI_DEVICE = sock)' failed
- failing All hosts returned: 'No one returned Hello World!'
subtest 'mpd shutdown' failed
- failing All hosts returned: 'mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_transtec); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)'

As it were, the real problem was the program "cpuinfo" of Intel MPI, reporting a wrong number of cores in the system. Setting an environment variable,

"export I_MPI_CPUINFO=proc"

fixed this issue, however.

Another issue with Intel MPI is that it seems to be incompatible with Python 2.6, which comes along with SLES 11:

Intel® MPI Library Runtime Environment (Single-node), (intel_mpi_rt).........................................................FAILED
subtest 'mpd startup' failed
- failing All hosts returned:
'/opt/intel/impi/3.2.0.011/bin64/mpdlib.py:27: DeprecationWarning: The popen2 module is deprecated.  Use the subprocess module.
import sys, os, signal, popen2, socket, select, inspect
/opt/intel/impi/3.2.0.011/bin64/mpdlib.py:37: DeprecationWarning: the md5 module is deprecated; use hashlib instead
from  md5       import  new as md5new'

Although this should not constitute not a real problem, cluster checker seems to be sensitive enough to end with an error code even for warning like this. We could fix it by installing Python 2.4, which, unfortunately, does not exist as a SLES 11 package, so we had to take a source tarball and recompile the package. Strange though, why SLES 11 does not include Python 2.4 as a fallback package for compatibility reasons as there are many Python programs out there based on that version.

In my opinion, this just demonstrates the power of Intel Cluster Ready, or cluster checker, respectively: Intel MPI causes a problem, but the cluster checker does its jobs and identifies a compatibility issue, impressively enough! Thus, the ICR program enables us to catch the problem before it reached the customer level.

A strange error that occurred was this one:

X11 runtime libraries are provided, (X11_libs).........................FAILED
subtest 'libGLw.so (x86-64) >= version 1' failed
- failing All hosts returned: 'missing'

Obviously, the library missing is part of the Mesa package, which is a necessary HPC cluster ingredient from Intel Cluster Ready's point of view. Doing a little bit of research, we found out that nearly all current Linux distributions explicitly remove this library from the standard Mesa package, for whatever reasons.

We could solve that in an elegant way by repackaging the SLES package Mesa with a recompiled one, having the SPEC file adapted in an appropriate way for including the libGLw libraries.

A funny thing that happened was that Korn shell missed the locales:

Korn Shell, (ksh)......................................................FAILED
subtest 'Hello World!' failed
- failing hosts node1 - node3 returned: 'en_US.UTF-8: unknown locale'

But the package glibc-locale seemed to be installed:

root@node1 # rpm -q glibc-locale
gcc-locale-4.3-62.198

What was going on?

It turned out to have nothing to do with ICR incompatibilities of SLES 11, but to be a problem with "xCAT 2" which we use for cluster deployment. As we have tested diskless installations, xCAT defined a number of files and directories to be deleted before creating the compressed root image. In the configuration file

/opt/xcat/share/xcat/netboot/sles/compute.exlist

The locale files are explicitly listed to be deleted before image generation. Having found that out, it was an easy thing to fix.

After solving all that issues, we finally have succeeded in developing a fully automatic deployment procedure for ICR certified HPC cluster based on SLES 11.

JOIN THE CONVERSATION


You must be a Registered Member in order to comment on Cluster Connection posts.

Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.

Login     Register Now


Author Info


Oliver Tennert started his professional life as a theoretical physicist at Tübingen University in Southern Germany. After a while he was finished researching about quantum field theory, lattice QCD, and cosmology and swapped his hobby for his profession. Since then,
he worked as a Senior Solution Engineer at a small company providing IT services and solutions.
Since 2008 he is working with transtec in Tübingen, a medium-sized vendor of customized hardware and one of the leading European specialists of HPC and Storage Solutions.