Detecting Silent Problems in Clusters

April 27th, 2009 1:42 pm
Posted by Brock Taylor
Tags: , , ,

A cluster node doesn't boot, the network is down, the power supply is smoking - these are actually nice problems for a cluster administrator. These problems conveniently provide the starting point for the path to resolution and usually are quickly resolved. What's not so nice are issues that don't actually break the system but allow it to limp along or subtly degrade over time. These problems can chip away at cluster performance usually without presenting an explicit symptom. The system runs, but it just does not seem to run as well as it used to.

My favorite example is the damaged network cable that is dropping data intermittently but not to the point of outright failure. This might result in a support call reporting, "the cluster is broken," and an application that used to run in two hours now takes almost three. It might be intuitive to some to proceed right to checking the link speeds between each node pair, but it's likely that home grown methods are the primary approach to discovering the cause of the slowdown.

Enter Intel® Cluster Checker as the systematic approach finding these problems. The tool provides a diagnostic report by checking the usual suspects that can cause functional issues in the cluster. I have had a damaged InfiniBand cable continue to operate but at half the bandwidth. Intel Cluster Checker doesn't tell me I have a bad cable, but it does provide me data that the bandwidth over the fabric to one particular node is below par. It allows me to narrow in quickly on the failing component.

The inherent value is that there isn't the need to speculate on any particular issue. Step 1 is always to run Intel Cluster Checker and see what it reports. The tool does the investigation work and keeps you from chasing dead ends. It also helps expose those silent issues that are not obvious or routine to check.

Comments

Comment from bhelvey
Time May 9, 2009 at 7:41 pm

How would I go about evaluating this tool? We have both large clusters and a large 'number of varying size clusters. We currently have an enterprise license agreement. Should i just talk to our sales rep or is there a location to download/

Thanks,

bryan.helvey@pgs.com

Comment from Brock Taylor
Time May 12, 2009 at 6:57 am

Unfortunately, we don't have an evaluation program for the Intel Cluster Checker. It is only available on clusters that are sold and deployed as Intel Cluster Ready certified solutions.

JOIN THE CONVERSATION


You must be a Registered Member in order to comment on Cluster Connection posts.

Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.

Login     Register Now


Author Info
Brock Taylor


Brock Taylor is an Engineering Manager and Cluster Solutions Architect for volume High Performance Compute clusters in the Software and Services Group at Intel. He has been a part of the Intel® Cluster Ready program from the start, is a co-author of the specification, and launched the first reference implementations of Intel Cluster Ready certified solutions.

Brock and others at Intel are working within the HPC community to enable advances and innovations in scientific computing by lowering the barriers to clustered solutions.

Brock joined Intel in December of 2000, and in addition to HPC clustering, he previously helped launch new processors and chipsets as part of an enterprise validation BIOS team. Brock has a B.S. in Computer Engineering from Rose-Hulman Institute of Technology and an M.Sc. in High Performance Computing from Trinity College Dublin.