May 26th, 2009 9:21 am
Posted by Gary Tyreman
Tags: cluster management, HPC, improvement, innovation, operations, systems management, Univa UD
In my last post, I challenged the concept of Cluster Management by suggesting we, as an industry, have a long way to go. There is much work left to do and numerous management aspects of the cluster yet-to-be-solved.
Intel Cluster Ready is enabling this innovation
The more that things change, the more they stay the same. The fact is that HPC systems will change over time. A newer compute node is likely not a copy-exact replacement of the original failed node, even when acquired from the same manufacturer. Differences can be magnified in larger systems where “MTBF laws” create incidents on a frequent and unpredictable basis (think of Murphy’s law applied to large high-use compute systems).
Often, there are multiple admins and even contractors that provide the care and feeding of the cluster. Many people, many users, many requests and many changes give rise to complexity and can create confusion. Vast amounts of time can be spent figuring out what happened, who did what and when it was done. This reduces efficiency of the system by prolonging downtime of valuable resources, delaying projects and interrupting expensive manpower needed to resolve situations. Same problem as it ever was, even as individual component technologies improve and become more efficient themselves.
Sometimes you need more…
Package and configuration management unquestionably improve many aspects of maintaining a cluster; however, there continue to be gaps in the operational aspects of sustaining HPTC environments. This is an area of management that Univa has been innovating in partnership with Texas Advanced Computing Center (TACC). With five clusters and nearly 65,000 cores and limited manpower, TACC had to find better ways of managing large-scale clusters. The largest cluster, Ranger, has 3,936 nodes and is managed by both TACC and contract staff. Problem deduction, tracking and resolution have been greatly simplified with a set of operational tools that capture and codify TACC's collective (and impressive) man-years of experience.
Univa's HPC systems management and product development expertise are being leveraged to inventory, generalize and productize these systems. In short order Univa will release substantial improvements to UniCluster's systems management capabilities based on our work with TACC. These innovative systems, best practices and tools will revolutionize HPC systems management and redefine "cluster management."
What is critical to point out is that this innovation and the operational efficiency would not have been possible for a company the size of Univa without a program like Intel Cluster Ready. Since our engineers and product folk are free from solving the same, literally basic problems of provisioning and configuration, we were able to allocate some of our resources to advance the "science" of cluster operations.
Likewise, end-user (like TACC) researchers and admins' time will be freed up to make it possible to accomplish their objectives: that is, to enable science or grand problem solving.
Over the next few posts in my personal blog I will describe these operational tools and systems in more detail. Anyone interested in learning more or previewing the tools may contact Univa at any time. (shameless plug!)
JOIN THE CONVERSATION
You must be a Registered Member in order to comment on Cluster Connection posts.
Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.
Login Register Now