The More Things Change….

May 26th, 2009 9:21 am
Posted by Gary Tyreman
Tags: , , , , , ,

In my last post, I challenged the concept of Cluster Management by suggesting we, as an industry, have a long way to go. There is much work left to do and numerous management aspects of the cluster yet-to-be-solved.

Intel Cluster Ready is enabling this innovation

The more that things change, the more they stay the same. The fact is that HPC systems will change over time. A newer compute node is likely not a copy-exact replacement of the original failed node, even when acquired from the same manufacturer. Differences can be magnified in larger systems where “MTBF laws” create incidents on a frequent and unpredictable basis (think of Murphy’s law applied to large high-use compute systems).

Often, there are multiple admins and even contractors that provide the care and feeding of the cluster. Many people, many users, many requests and many changes give rise to complexity and can create confusion. Vast amounts of time can be spent figuring out what happened, who did what and when it was done. This reduces efficiency of the system by prolonging downtime of valuable resources, delaying projects and interrupting expensive manpower needed to resolve situations.  Same problem as it ever was, even as individual component technologies improve and become more efficient themselves.

Sometimes you need more…

Package and configuration management unquestionably improve many aspects of maintaining a cluster; however, there continue to be gaps in the operational aspects of sustaining HPTC environments. This is an area of management that Univa has been innovating in partnership with Texas Advanced Computing Center (TACC). With five clusters and nearly 65,000 cores and limited manpower, TACC had to find better ways of managing large-scale clusters. The largest cluster, Ranger, has 3,936 nodes and is managed by both TACC and contract staff. Problem deduction, tracking and resolution have been greatly simplified with a set of operational tools that capture and codify TACC's collective (and impressive) man-years of experience.

Univa's HPC systems management and product development expertise are being leveraged to inventory, generalize and productize these systems. In short order Univa will release substantial improvements to UniCluster's systems management capabilities based on our work with TACC. These innovative systems, best practices and tools will revolutionize HPC systems management and redefine "cluster management."

What is critical to point out is that this innovation and the operational efficiency would not have been possible for a company the size of Univa without a program like Intel Cluster Ready. Since our engineers and product folk are free from solving the same, literally basic problems of provisioning and configuration, we were able to allocate some of our resources to advance the "science" of cluster operations.

Likewise, end-user (like TACC) researchers and admins' time will be freed up to make it possible to accomplish their objectives: that is, to enable science or grand problem solving.

Over the next few posts in my personal blog I will describe these operational tools and systems in more detail. Anyone interested in learning more or previewing the tools may contact Univa at any time. (shameless plug!)

JOIN THE CONVERSATION


You must be a Registered Member in order to comment on Cluster Connection posts.

Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.

Login     Register Now


Author Info
Gary Tyreman


Gary Tyreman brings more than 20 years of executive software experience to his role as the President and CEO of Univa Corporation. Gary leads corporate development and fundraising activities and is the architect of Univa's data center optimization strategy, which couples the strategic addition of Grid Engine expertise with Univa's innovative and industry-leading integrated cloud computing management products. Gary has established Univa as a top multi-national competitor and has expanded the markets the company serves. Prior to taking the position as CEO, Gary spent three years as Univa's Senior Vice President of Products and Alliances.

At Univa UD, Gary is Vice President and General Manager of the High-Performance Computing Division. In this role he oversees all aspects of the company's HPC business, including strategic planning, engineering, marketing, sales and business development. He also directs the growth of the company's online open source community.

Prior to joining Univa UD in 2008, Gary was Vice President and Business Manager for Platform Computing HPC division. During nearly five years there, he led the company's business planning, innovation and product management efforts while marshaling a team that developed some of the industry's most popular software.

Tyreman was among the first in the industry to recognize the emerging entry-level user in the HPC space and was responsible for developing a vision for how to simplify running applications off the shelf, a key to unlocking value among organizations new to HPC. He worked with Intel Corp. to develop his innovations, which were taken into account when Intel announced the Intel Cluster Ready program last year, making it easier to design, build, sell, program, acquire and deploy clusters built with Intel components.

Prior to his tenure at Platform Computing, Tyreman held a variety of executive positions in product management and marketing in technology growth companies, including Hummingbird, Delano and Itemus.

Gary is actively involved in the standards community and has held key positions in the X Consortium (X.org) and Open Grid Forum.