Cluster buyer vs. seller: do-it-yourself builds

August 2nd, 2010 10:11 am
Posted by Brock Taylor
Tags: , , ,

Breaking down the purchase of an HPC cluster into four stages, I want to look at the expertise required and the amount of work done by the cluster buyer and the cluster seller(s) in different models. To get good contrast, I'm first targeting the two ends of the cluster buying spectrum: dark cluster purchases and pre-defined, turn-key purchases. This post is my attempt to analyze who must know what in dark clusters.

I am somewhat hi-jacking the definition of a dark cluster. I'm really looking at a cluster where the purchaser buys all the parts and assembles them - pure do-it-yourself approach. The stages break down with the purchaser doing pretty much all the work.

Stage 1 - Specifying the parts: All purchaser. Need to have the expertise to determine what parts are needed - that includes networking fabric, storage needs, OS, provisioning system, and drivers. The parts vendors are just selling their components, so the purchaser has to do all the research and effort to determine what the end solution needs to be.

Stage 2 - Integration: Pretty much all purchaser again.  Bad thing about the about this is that the integration of components likely occurs at deployment time. If there's an incompatibility of components, it probably isn't discovered until late in the game. Vendors involved may provide some assistance in debugging issues related to their individual component, but that may also require a service contract. So, again the expertise level required by the purchaser is high and the effort is mostly theirs.

Stage 3 - Manufacturing: This is the one stage that the vendors are actively a part of. Again, the parts have to come from somewhere. Parts roll off the assembly line and are shipped off - beyond that, you're on your own. The purchaser still has to buy all the individual pieces and take care of the software procurements.

Stage 4 - Assembly and final testing: Similar to stage 2 in this case, it's pretty much all purchaser. In fact, in dark clusters stage 2 and stage 4 may blend into one. As the integration of parts commences, so does the actual assembly of the cluster.  Once the integration is done, the cluster enters testing before being put into use. There's another problem - you also have to know how to test the cluster in addition to knowing how to put it together. It actually takes a lot of knowledge to properly check if the system is running correctly. Miss a "silent" error, and you wind up with costly downtime in the future.

So the expertise barrier for do-it-yourself clusters is pretty high. You have to already be a cluster expert or factor in the ramp to becoming a cluster expert into the purchase. Mistakes can be costly in time and money. The cluster works fine, but the application for which the cluster was purchased for doesn't run. Oops. The upside is that once you have become an expert, you can reduce the number of missteps with each new purchase. Plus, many people in the community have tried to make it easier (and made great progress in doing so). Still, this is a big hurdle to get over.

The big problem is really who is the cluster expert. In a large company, this is likely dedicated staff in the IT department - a wise investment and required for large systems used by many people. In a small company, however, this might be one or more of the primary users of the system. That means engineers and scientists fill the role of cluster administrator, and that's time taken away from producing results in order to build and maintain the system. That type of investment is much harder for a small company to make. It might also be hard for a small department within a large company to justify the cost of their own IT resources required for a much-needed workgroup cluster.

In academia, this barrier used to be solved by throwing graduate students at the issue. I'd bet there is more than one thesis that morphed from "this is how to do science" into "this is how to build a cluster to do science." It's still time lost, though, that should be directed to doing work other than cluster administration. Plus, there is turnover. At some point, graduate students demand their degrees, leave school, and go off to small companies that ask them to administer their own cluster. The point being that it's a whole lot better for a researcher to be running a CFD simulation rather than figuring out why an RPM that was installed yesterday is now missing on one of the compute nodes of the cluster.

Solutions integrators exist for a reason - building a proper HPC cluster is more than just installing software on hardware. The do-it-yourself approach can help keep financial cost to a minimum with respect to the parts, but that must be weighed against the cost in expertise and time to build and maintain it. Because of the latter, broad adoption of HPC cluster use isn't likely to go down this path.

>> See Related Stories for more information.


Pingback from Cluster Connection » Cluster buyer vs. seller: predefined solutions
Time August 21, 2010 at 11:30 am

[...] Now I want to break down a cluster purchase of a completely predefined solution to look at who does what in the four stages similar to how I looked at do-it-yourself purchases. [...]

Comment from abcd1234
Time September 15, 2010 at 2:36 pm

My problem is I do not know how to select hardare and software? any suggstions? Thanks.
I would like the cluster is high efficiency for computation and installed PGI (f90 and cc) and mpich.

Comment from Brock Taylor
Time September 23, 2010 at 4:34 pm

A cluster solutions provider can help determine a complete solution for you, but the ideal situation is to start with your application needs. If you use a commercial application, then that vendor may provide good insight into components that work well for the application. If they are Intel Cluster Ready registered, then they may even point you to complete solutions for sale that are know to work with their code.
It's a really good and interesting question because it hits on the role of the application and the workload in the process of buying clusters. Ironically, I plan to write about that in my next entry.

Pingback from Cluster Connection » Applications on a Common Cluster Platform Architecture
Time September 27, 2010 at 2:57 pm

[...] Cluster buyer vs. seller: do-it-yourself builds // [...]


You must be a Registered Member in order to comment on Cluster Connection posts.

Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.

Login     Register Now

Author Info
Brock Taylor

Brock Taylor is an Engineering Manager and Cluster Solutions Architect for volume High Performance Compute clusters in the Software and Services Group at Intel. He has been a part of the Intel® Cluster Ready program from the start, is a co-author of the specification, and launched the first reference implementations of Intel Cluster Ready certified solutions.

Brock and others at Intel are working within the HPC community to enable advances and innovations in scientific computing by lowering the barriers to clustered solutions.

Brock joined Intel in December of 2000, and in addition to HPC clustering, he previously helped launch new processors and chipsets as part of an enterprise validation BIOS team. Brock has a B.S. in Computer Engineering from Rose-Hulman Institute of Technology and an M.Sc. in High Performance Computing from Trinity College Dublin.