Support, Why Do I Need Cluster Support?

October 8th, 2009 3:45 pm
Posted by Douglas Eadline
Tags: , , , , ,

Supporting a successful HPC cluster takes time and money, take your pick

Many of the HPC people I know are what you would call rugged individualists. The have been around since the beginning and were responsible for moving the market/community along when commodity HPC was less than fashionable. This group consists mostly of developers, implementers, and administrators. Many of these people developed, by way of discussion on the Beowulf Mailing List, the best practices used today. The Beowulf Mailing list is a true resource if there ever was one. A newbie can ask a question and get polite (and lengthy) answers by list members. The list holds a large amount of open community knowledge because all the HPC plumbing is open source. Unhindered discussions can take place at any level between any number of people.

There is also the false notion that open source software is "free as in beer." This idea is not quite true because software unlike toasters have a usage cost. Once you install, configure, and study any software you have already made an "investment" in the package. Continued use furthers this investment. The size of the investment is up to you. And, because the software is open, in theory you can fix any problem. Thus, you have the choice to decide how much time you can invest in a particular software package before it becomes "expensive" to you. At some point, the cost (or dilution) of your time may come into play. Spending weeks fine tuning a single application at the expense of other responsibilities is probably not going to play out very well.

In terms of support responsibilities, many clusters have minimal issues once they are configured correctly. There are, of course, hardware failures, but in general, once everything is booted, things often work quite well. There are two areas that need attention, however. The first is software updates. Updates are needed for several reasons including security, bug fix, or feature updates. These types of updates are usually easy to manage unless they have dependencies, which means there may be a whole raft of packages that need updating. If you don't get the dependencies right, then there can be problems with the entire cluster.

The other issue is local integration. This is what I consider the "last mile problem" for clusters. Very often local file system issues need to be worked out and managed in addition to creating job submission policies. There is usually some end-user assistance needed as well as questions on how to compile and submit jobs to the queue. Of course, "Why is my job sitting in the queue?" is probably the question that gets asked the most.

If your job includes time for installing, integrating, and updating software and you happen to be one of those rugged individualist HPC people, then you probably have no interest in professional support. If on the other hand, you are new to clustering (or Linux) and already have many responsibilities, then you may want to consider using professional support services. As with all open source software, the choice is yours. In terms of commercial software, or commercial support of open software, there are many options. In any case, purchasing support for business critical applications is always a good idea. As is the use of a Intel Cluster Ready (ICR) solution. By adhering to the ICR specification support is much easier -- for you and/or a vendor. That is, a reference platform allows you and your vendors to work from the same page. Without a common framework, support vendors and others in your organization may have to decipher/debug how you configured the cluster.

In conclusion, cluster support can be commercial or it can be institutional. In either case, there is a cost. If you do it on your own, it will cost time and if you hire a consultant or company, it will cost money. To supplement either effort, there is a large amount of information on the web that can be useful when identifying and solving problems. Support is an important part of any successful HPC cluster,  just ask the old-timers. They figured it out, so you don't have to.


You must be a Registered Member in order to comment on Cluster Connection posts.

Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.

Login     Register Now

Author Info

Dr. Douglas Eadline has worked with parallel computers since 1988 (anyone remember the Inmos Transputer?). After co-authoring the original Beowulf How-To, he continued to write extensively about Linux HPC Clustering and parallel software issues. Much of Doug's early experience has been in software tools and and application performance. He has been building and using Linux clusters since 1995. Doug holds a Ph.D. in Chemistry from Lehigh University.