Automating Cluster Maintenance - Part 2

August 4th, 2011 8:26 am
Posted by Patrick Ryan
Tags: , , , , , , , , , ,

Levels of Wellness Tests and Automation Scheduling

Intel Cluster Checker can run a variety of tests depending on what the user is trying to accomplish. For general wellness of a cluster, Intel Cluster Checker offers five levels of thoroughness. For automation, we'll focus on levels one, three, and five.

Wellness Level One is a very short run of tests that check basic connectivity throughout the cluster as well as basic uniformity checks amongst the nodes.  This test focuses on BIOS settings, and processor, memory and system configurations. The level one tests are quick and show that the cluster is online and ready for use.

Wellness Level Three is the default run level for Intel Cluster Checker. It builds on level one and includes more rigorous modules that test parameters like disk and memory bandwidth, MFLOPS, and network performance. It also performs an in-depth hardware uniformity test along with an Intel MPI Collectives and message Integrity test. It takes a bit longer to run this check that assures the performance of the hardware is up to par.

Wellness Level Five adds a packages test, comparing currently installed packages with a generated list of expected packages at a given time. It also runs the HPCC module (a performance benchmark test). The level five tests can take a bit longer but assures the users that all the pieces of the system are working in harmony. This test is useful for the admin to make sure a user hasn’t installed or uninstalled anything that may affect the cluster.

For my clusters, I have set up the following schedule that runs these wellness levels at various times.

  • Level One - set to run each weekday
  • Level Three - set to run once a week, on Saturday
  • Level Five - set to run monthly, on the first Sunday

For our lab with multiple smaller clusters this automation schedule is perfect, allowing users to login to one of the clusters, see when each of the wellness levels passed and know that their jobs will run as expected.

Every cluster has a different purpose and a different load. Running tests at the intervals or frequency I have set may not be feasible or needed for every cluster. However, automating the test runs on a schedule that meets your needs will help assure that your cluster performance remains optimal, and frees up the cluster(s) for jobs during normal business hours.

References:

>> Review Part 1

>> About Intel Cluster Checker

>> Intel Cluster Checker Knowledge Base

JOIN THE CONVERSATION


You must be a Registered Member in order to comment on Cluster Connection posts.

Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.

Login     Register Now


Author Info
Patrick Ryan


Patrick Ryan is a High Performance Computing (HPC) Cluster Systems Engineer in the Software and Services Group at Intel. He has been part of the Cluster Software & Technologies Organization since joining Intel in December of 2010.

Patrick builds and maintains HPC Clusters as part of efforts in systems integration and research. Specializing in dependencies between HPC applications, operating systems, and resource management components -- he regularly draws on his 5+ years of experience in datacenter operations, Windows and Linux systems administration, and user support. In addition, Patrick served as Staff Sergeant in the United States Air Force from 2000-2006 where he specialized in F-16 Avionics.

Patrick holds a B.S. in Computer Science, and is currently working towards his M.S. in Computer Science, both from the University of Illinois at Springfield.