August 2nd, 2010 10:11 am
Posted by Brock Taylor
Tags: dark cluster, HPC, HPC cluster, turn-key
Breaking down the purchase of an HPC cluster into four stages, I want to look at the expertise required and the amount of work done by the cluster buyer and the cluster seller(s) in different models. To get good contrast, I'm first targeting the two ends of the cluster buying spectrum: dark cluster purchases and pre-defined, turn-key purchases. This post is my attempt to analyze who must know what in dark clusters.
I am somewhat hi-jacking the definition of a dark cluster. I'm really looking at a cluster where the purchaser buys all the parts and assembles them - pure do-it-yourself approach. The stages break down with the purchaser doing pretty much all the work.
Stage 1 - Specifying the parts: All purchaser. Need to have the expertise to determine what parts are needed - that includes networking fabric, storage needs, OS, provisioning system, and drivers. The parts vendors are just selling their components, so the purchaser has to do all the research and effort to determine what the end solution needs to be.
Stage 2 - Integration: Pretty much all purchaser again. Bad thing about the about this is that the integration of components likely occurs at deployment time. If there's an incompatibility of components, it probably isn't discovered until late in the game. Vendors involved may provide some assistance in debugging issues related to their individual component, but that may also require a service contract. So, again the expertise level required by the purchaser is high and the effort is mostly theirs.
Stage 3 - Manufacturing: This is the one stage that the vendors are actively a part of. Again, the parts have to come from somewhere. Parts roll off the assembly line and are shipped off - beyond that, you're on your own. The purchaser still has to buy all the individual pieces and take care of the software procurements.
Stage 4 - Assembly and final testing: Similar to stage 2 in this case, it's pretty much all purchaser. In fact, in dark clusters stage 2 and stage 4 may blend into one. As the integration of parts commences, so does the actual assembly of the cluster. Once the integration is done, the cluster enters testing before being put into use. There's another problem - you also have to know how to test the cluster in addition to knowing how to put it together. It actually takes a lot of knowledge to properly check if the system is running correctly. Miss a "silent" error, and you wind up with costly downtime in the future.
So the expertise barrier for do-it-yourself clusters is pretty high. You have to already be a cluster expert or factor in the ramp to becoming a cluster expert into the purchase. Mistakes can be costly in time and money. The cluster works fine, but the application for which the cluster was purchased for doesn't run. Oops. The upside is that once you have become an expert, you can reduce the number of missteps with each new purchase. Plus, many people in the community have tried to make it easier (and made great progress in doing so). Still, this is a big hurdle to get over.
The big problem is really who is the cluster expert. In a large company, this is likely dedicated staff in the IT department - a wise investment and required for large systems used by many people. In a small company, however, this might be one or more of the primary users of the system. That means engineers and scientists fill the role of cluster administrator, and that's time taken away from producing results in order to build and maintain the system. That type of investment is much harder for a small company to make. It might also be hard for a small department within a large company to justify the cost of their own IT resources required for a much-needed workgroup cluster.
In academia, this barrier used to be solved by throwing graduate students at the issue. I'd bet there is more than one thesis that morphed from "this is how to do science" into "this is how to build a cluster to do science." It's still time lost, though, that should be directed to doing work other than cluster administration. Plus, there is turnover. At some point, graduate students demand their degrees, leave school, and go off to small companies that ask them to administer their own cluster. The point being that it's a whole lot better for a researcher to be running a CFD simulation rather than figuring out why an RPM that was installed yesterday is now missing on one of the compute nodes of the cluster.
Solutions integrators exist for a reason - building a proper HPC cluster is more than just installing software on hardware. The do-it-yourself approach can help keep financial cost to a minimum with respect to the parts, but that must be weighed against the cost in expertise and time to build and maintain it. Because of the latter, broad adoption of HPC cluster use isn't likely to go down this path.
>> See Related Stories for more information.
JOIN THE CONVERSATION
You must be a Registered Member in order to comment on Cluster Connection posts.
Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.
Login Register Now