June 22nd, 2011 1:31 pm
Posted by Patrick Ryan
Tags: Automating cluster tests, Cluster, cluster management, cluster test runs, high performance computing, HPC, HPCC, Intel Cluster Checker, Intel Cluster Ready, Intel MPI, simplify
Easily Keep Your HPC Cluster in Great Shape
Whether you're a cluster user that expects optimal performance and functionality every time you run a job, or a system administrator that needs to keep the cluster in perfect working order for your users, running checks on a regular basis is important!
A great way to ensure that your HPC system, certified Intel Cluster Ready, remains in the same great shape as when it was first built, is to run the Intel® Cluster Checker tool regularly.
While it’s easy enough to run Intel Cluster Checker on your cluster(s) once a week, I've found there are times when I was too busy, or simply forgot to run a check. For this reason, and because I have multiple clusters to check, I developed a method to automate Intel Cluster Checker runs. This automated solution runs Intel Cluster Checker and reports any errors directly to me and/or the system administrator. Not only does automating this process save the administrator a substantial amount of time each week, it will eliminate the chance of missing a run, and ensures that the cluster remains in optimal health.
A Cluster That Checks Itself
In order to automate the process, it is important that your Intel Cluster Checker passes a manual run first. To do this, make sure the configuration file is optimized to ensure the cluster is operating at its best. With the perfect configuration on hand, a script is then needed to setup the system, run Intel Cluster Checker, and report the results.
For the first phase of the process, I wrote a script that sets up the environment and runs the most in-depth wellness check in Intel Cluster Checker. I used cron to schedule my script to execute once a week. After the results are complete, the script updates the message of the day to show when the last check ran and the results. If the check fails, the log file created by Intel Cluster Checker will be copied to the specified directory where it can be accessed and analyzed.
The cluster is now setup to check itself, in the middle of the night, and report its status upon login!
Read Part 2 for more about scheduling test runs, and the different levels of wellness tests in Intel Cluster Checker.
>> About Intel Cluster Checker
>> Intel Cluster Checker Knowledge Base
JOIN THE CONVERSATION
You must be a Registered Member in order to comment on Cluster Connection posts.
Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.
Login Register Now