August 4th, 2011 8:26 am
Posted by Patrick Ryan
Tags: Automating cluster tests, Cluster, cluster management, cluster test runs, high performance computing, HPC, HPCC, Intel Cluster Checker, Intel Cluster Ready, Intel MPI, simplify
Levels of Wellness Tests and Automation Scheduling
Intel Cluster Checker can run a variety of tests depending on what the user is trying to accomplish. For general wellness of a cluster, Intel Cluster Checker offers five levels of thoroughness. For automation, we'll focus on levels one, three, and five.
Wellness Level One is a very short run of tests that check basic connectivity throughout the cluster as well as basic uniformity checks amongst the nodes. This test focuses on BIOS settings, and processor, memory and system configurations. The level one tests are quick and show that the cluster is online and ready for use.
Wellness Level Three is the default run level for Intel Cluster Checker. It builds on level one and includes more rigorous modules that test parameters like disk and memory bandwidth, MFLOPS, and network performance. It also performs an in-depth hardware uniformity test along with an Intel MPI Collectives and message Integrity test. It takes a bit longer to run this check that assures the performance of the hardware is up to par.
Wellness Level Five adds a packages test, comparing currently installed packages with a generated list of expected packages at a given time. It also runs the HPCC module (a performance benchmark test). The level five tests can take a bit longer but assures the users that all the pieces of the system are working in harmony. This test is useful for the admin to make sure a user hasn’t installed or uninstalled anything that may affect the cluster.
For my clusters, I have set up the following schedule that runs these wellness levels at various times.
- Level One - set to run each weekday
- Level Three - set to run once a week, on Saturday
- Level Five - set to run monthly, on the first Sunday
For our lab with multiple smaller clusters this automation schedule is perfect, allowing users to login to one of the clusters, see when each of the wellness levels passed and know that their jobs will run as expected.
Every cluster has a different purpose and a different load. Running tests at the intervals or frequency I have set may not be feasible or needed for every cluster. However, automating the test runs on a schedule that meets your needs will help assure that your cluster performance remains optimal, and frees up the cluster(s) for jobs during normal business hours.
>> Review Part 1
>> About Intel Cluster Checker
>> Intel Cluster Checker Knowledge Base
JOIN THE CONVERSATION
You must be a Registered Member in order to comment on Cluster Connection posts.
Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.
Login Register Now