April 27th, 2009 1:42 pm
Posted by Brock Taylor
Tags: debugging, InfiniBand, Intel, Intel Cluster Checker
A cluster node doesn't boot, the network is down, the power supply is smoking - these are actually nice problems for a cluster administrator. These problems conveniently provide the starting point for the path to resolution and usually are quickly resolved. What's not so nice are issues that don't actually break the system but allow it to limp along or subtly degrade over time. These problems can chip away at cluster performance usually without presenting an explicit symptom. The system runs, but it just does not seem to run as well as it used to.
My favorite example is the damaged network cable that is dropping data intermittently but not to the point of outright failure. This might result in a support call reporting, "the cluster is broken," and an application that used to run in two hours now takes almost three. It might be intuitive to some to proceed right to checking the link speeds between each node pair, but it's likely that home grown methods are the primary approach to discovering the cause of the slowdown.
Enter Intel® Cluster Checker as the systematic approach finding these problems. The tool provides a diagnostic report by checking the usual suspects that can cause functional issues in the cluster. I have had a damaged InfiniBand cable continue to operate but at half the bandwidth. Intel Cluster Checker doesn't tell me I have a bad cable, but it does provide me data that the bandwidth over the fabric to one particular node is below par. It allows me to narrow in quickly on the failing component.
The inherent value is that there isn't the need to speculate on any particular issue. Step 1 is always to run Intel Cluster Checker and see what it reports. The tool does the investigation work and keeps you from chasing dead ends. It also helps expose those silent issues that are not obvious or routine to check.
JOIN THE CONVERSATION
You must be a Registered Member in order to comment on Cluster Connection posts.
Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.
Login Register Now