Automating Cluster Maintenance - Part 1

June 22nd, 2011 1:31 pm
Posted by Patrick Ryan
Tags: , , , , , , , , , ,

Easily Keep Your HPC Cluster in Great Shape

Whether you're a cluster user that expects optimal performance and functionality every time you run a job, or a system administrator that needs to keep the cluster in perfect working order for your users, running checks on a regular basis is important!

A great way to ensure that your HPC system, certified Intel Cluster Ready, remains in the same great shape as when it was first built, is to run the Intel® Cluster Checker tool regularly.

While it’s easy enough to run Intel Cluster Checker on your cluster(s) once a week, I've found there are times when I was too busy, or simply forgot to run a check. For this reason, and because I have multiple clusters to check, I developed a method to automate Intel Cluster Checker runs. This automated solution runs Intel Cluster Checker and reports any errors directly to me and/or the system administrator. Not only does automating this process save the administrator a substantial amount of time each week, it will eliminate the chance of missing a run, and ensures that the cluster remains in optimal health.

A Cluster That Checks Itself

In order to automate the process, it is important that your Intel Cluster Checker passes a manual run first. To do this, make sure the configuration file is optimized to ensure the cluster is operating at its best. With the perfect configuration on hand, a script is then needed to setup the system, run Intel Cluster Checker, and report the results.

For the first phase of the process, I wrote a script that sets up the environment and runs the most in-depth wellness check in Intel Cluster Checker. I used cron to schedule my script to execute once a week. After the results are complete, the script updates the message of the day to show when the last check ran and the results. If the check fails, the log file created by Intel Cluster Checker will be copied to the specified directory where it can be accessed and analyzed.

The cluster is now setup to check itself, in the middle of the night, and report its status upon login!

Read Part 2 for more about scheduling test runs, and the different levels of wellness tests in Intel Cluster Checker.


>> About Intel Cluster Checker

>> Intel Cluster Checker Knowledge Base



You must be a Registered Member in order to comment on Cluster Connection posts.

Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.

Login     Register Now

Author Info
Patrick Ryan

Patrick Ryan is a High Performance Computing (HPC) Cluster Systems Engineer in the Software and Services Group at Intel. He has been part of the Cluster Software & Technologies Organization since joining Intel in December of 2010.

Patrick builds and maintains HPC Clusters as part of efforts in systems integration and research. Specializing in dependencies between HPC applications, operating systems, and resource management components -- he regularly draws on his 5+ years of experience in datacenter operations, Windows and Linux systems administration, and user support. In addition, Patrick served as Staff Sergeant in the United States Air Force from 2000-2006 where he specialized in F-16 Avionics.

Patrick holds a B.S. in Computer Science, and is currently working towards his M.S. in Computer Science, both from the University of Illinois at Springfield.