August 26th, 2009 10:18 am
Posted by Brock Taylor
Tags: certified, Clusters, HPC, Intel Cluster Checker, Intel Cluster Ready, Intel Cluster Ready Architecture
My last post concluded that eventually the answer to, "is my cluster's software too old," is yes. Updating software on a cluster is not as simple as updating a single server, but the down side of errors is the same: if updates aren't done properly, clusters, like traditional servers, can mysteriously break or start to have problems in the future. In addition, the chance of making an error during the update is proportional to the number of nodes in the cluster. For Intel® Cluster Ready compliant clusters, Intel has provided a couple steps that can and should be performed after a software update that will help verify the cluster is still compliant and ensure it is still functioning properly.
First, always use the required and supplied "provisioning system" tools to update a cluster. It may be relatively easy to update all the nodes in a cluster using a couple RPMs and something like pdsh to script the installation of the update - but don't be tempted. This manual or brute-force method bypasses the software that manages the image on each server. Provisioning systems may reimage nodes after a crash or maybe a node is replaced or added to the cluster. If software updates are applied manually (outside of the provisioning system) then the reimaged node will be inconsistent with the rest of the system. An admin would need to remember all the manual changes and apply them again. It's much better to let the provisioning software worry about that. That is one the reasons ICR required a provisioning system!
Once updates are applied, it's a good idea to verify the cluster is working as it did before the update was installed and it remains compliant with the Intel® Cluster Ready architecture. Many if not most software updates will behave well, but verification helps ensure an update didn't alter or remove a key system component that may lead to application failures. Intel® Cluster Checker provides an easy way to check the compliance after an update. By using the command-line --compliance option, the tool will verify the interface defined by the architecture still exists as before. It's an easy way to check that the update hasn't had any ill effect on the architecture interface used by ICR applications.
Finally, there may be needed updates to the Intel Cluster Checker configuration files to reflect the updated software. For example, if a newer version of the Intel C compiler is installed, the Intel Cluster Checker configuration file should be updated to utilize the newer version. Running the tool would then verify the new installation is functioning on all nodes. It's also valuable to update the list of packages that are expected on each node. The packages test verifies the RPMs installed on each node matches a predetermined list. Using the --packages command-line option will create new package lists based on the current installation (use it after all updates are complete). Save the original list file and set the configuration file to use the updated list. For more information on using the tool, see the Intel Cluster Checker Users Guide.
JOIN THE CONVERSATION
You must be a Registered Member in order to comment on Cluster Connection posts.
Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.
Login Register Now