Intel Cluster Ready, going further with Platform Computing

April 23rd, 2009 1:26 pm
Posted by Mehdi Bozzo-Rey
Tags: , , , , , , , , ,

The goal  is pretty simple: making sure that an initial combination of hardware, software and configuration files (the recipe) can be replicated over and over to ensure the same behavior and more important, the same performance for all clusters shipped.  This is good, but not enough because you don’t buy clusters just for those performance numbers, you buy a cluster because you want to run applications on it. Here comes the second part of the certification: compliance to the ISV’s pre-requisites to ensure for example that your preferred application (part of the ICR framework) will just be waiting for your input file after install. No need to do extra configuration, or install additional software pieces: nice amount of time (and money) saved.

Now, as the OS is anything but static, you will have to patch your system at some point.  Maybe you can afford a “real” test cluster, apply the patches, run the cluster-checker on these updated nodes, check that everything is fine and schedule a maintenance period, update the whole cluster and put it back in production. This sounds quite expensive in terms of time and hardware (depreciation and electricity) needed just for testing.

You can also take advantage of the Open Cluster Stack 5 framework that we developed at Platform Computing , take a snapshot of the repository used to provision the nodes, update this snapshot, decouple few nodes from the production cluster and let them use the new updated repository (all of this is done live). You can then validate the updated snapshot. Is everything fine? Perfect, you can then apply the patches on your production cluster (no interruption needed, except a reboot that can be scheduled via your scheduler in case of kernel update) and of course put back the few nodes you used to production. Sounds complicated? Not really when you can use a GUI to perform all those operations. Again, a nice amount of time (money) can be saved and more important, the certification process is not any more tied to a certain point in time (the installation) but is now part of the lifecycle of the cluster.

Let’s have a look at the problem from an IHV point of view. How do you get your hardware / software combination certified ? The answer is simple: you run the cluster-checker against a recipe.

So the first step is to build a recipe. This can be an iterative and time consuming process: install the stack, run the cluster-checker, update the recipe given the cluster-checker runs, and start over again. This is time consuming and can lead to highly complex recipes so you will probably end up at the end of the day with a nice set of scripts that you need to execute prior to successfully run the cluster-checker. Of course, your next step is to write a master script that will sequentially run all these small scripts: the last thing you want is a complex procedure on the assembly line.

Imagine now that most of the scripts needed are not only embedded in the software stack but that the certification process is also part of the development process and more precisely part of the QA process. This means for an IHV that the software that will be installed is already “certifiable” so the cost of writing a recipe becomes a lot more affordable. This is exactly what we’ve done at Platform Computing: using the ICR certification process as part of our QA process. Some steps needed for a successful cluster-checker run are part of the core of our clustering solution; no need to write scripts, just enable a software component in one click. Of course, this can be done through an intuitive GUI. Customization can then be added to the software stack in order to handle for example hardware deviations from an original recipe.

In conclusion, we’ve just demonstrated how easy it is to save time (and money) for both IHVs and users by integrating the ICR certification process not only in the cluster’s lifecycle (Platform Open Cluster Stack 5) but also in the development process (Platform Computing) in order to get a scalable certified solution.

JOIN THE CONVERSATION


You must be a Registered Member in order to comment on Cluster Connection posts.

Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.

Login     Register Now


Author Info


Mehdi Bozzo-Rey graduated from University Paris XIII, France in Physics / Mathematics and Computer Science in 1996 and received his M.Sc. in theoretical Physics from University of Sherbrooke, Canada in 2003. As part of the HPC team at University of Sherbrooke, he designed, deployed and benchmarked in 2005 the most powerful cluster in Canada, ranked #1 in Canada for 6 consecutive top500, and worked on diskless nodes provisioning as part of the Thin-Oscar group, prior to joining Platform Computing in 2006