July 10th, 2009 3:37 pm
Posted by Douglas Eadline
Tags: air cooling, cooling, ICR, Intel Cluster Ready, leakage current, liquid cooling, Nehalem, power, watts
Care and feeding of the HPC server
As clusters have become more dense, power and cooling design has become more important. Back in the day when servers were 2U or even 4U in size, the number of motherboards per rack was 10-20. Current clusters can easily have 40 motherboards using faster hotter processors within a single rack chassis. This high density has moved power and cooling from an after thought to a critical design issue. Understanding the basics of power and cooling before you specify or buy a cluster can save a few headaches (and embarrassment) down the road.
The Cluster Environment
In the past, I have seen clusters placed in closets, next to desks, corners of laboratories, even stacked on top of an old table. Many of these environments had issues and the cluster had to eventually be moved. Quite often there was not enough power, or poor power, or there was not adequate cooling, or all three. Clusters need continuous and clean power such as that found in a data center environment. In addition, clusters create heat which must be removed. Most modern office or laboratory work environments are designed for desktop/side IT hardware, but were not designed to handle a stack of servers. One variation I have seen is to use a chemistry fume hood to exhaust heat and wall air conditions to provide chilled air. This system worked, but is certainly non-standard to say the least.
A typical server, on average, may produce 300 watts of heat. To get a feel for this kind of heat think of a small metal box with two 150 watt incandescent light bulbs in it. Now imaging 42 of these per rack and you can see how the heat adds up. Or better yet, think about standing in front of a 10x10 grid of 150 watt light bulbs.
Not only do servers get hot when they are running codes, but some people are surprised to learn that they have a pretty good power appetite when not doing anything at all. If a server runs at 300 watts, the idle power may be as much as 150 watts. The reason for this power usage is something called leakage current. Recently Intel introduced the Nehalem family of processors that have significantly reduced the leakage current and thus the idle power draw by turning off unused circuitry in the processor.
Servers are cooled by pushing cool air through the case. That process transfers heat from the server components to the air which is then cycled through a chiller unit. The heat generated by the servers is removed by the chillers. Essentially, cooling is a heat transfer process as the heat has to go somewhere. In most data centers, air serves as the heat transfer medium. It turns out air is very poor conductor of heat, which is why liquid cooling is now seeing greater use in dense computing environments.
If the heat is not removed adequately the server gets hot and will eventually fail or fail sooner. In general, unless more cooling is applied the failure rate will double for every 10 degree Celsius increase in operating temperature (remember the Arrhenius equation). Modern CPUs often do thermal throttling to prevent overheating as well. This point is important because it is possible to have an improperly cooled server work for a while, but then fail prematurely due to overheated memory, hard drive drive, power supply, and rarely the CPU. Additionally, a "hot" server may throttle the CPU, leak more power, increase fan speeds , and not be able to use the Nehalem turbo mode, which in turn could hold back other servers that are working in parallel with the server.
A Back Of Envelope Cost Estimate
As a simple example, let's calculate the power and cooling costs for a typical (x86 based) cluster. Consider our average dual socket server node that requires around 300 watts of power. The rule of thumb is cooling (air chillers, fans etc.) and power delivery inefficiencies can double this power requirement to 600 watts. Therefore, on an annual basis a server can require 5256 kilowatt hours. At a nominal cost of $.10 per kilowatt hour, the annual power and cooling costs for the server is approximately $526.
Next consider the cost of an entire cluster. Assuming the cost of our 8 core server to be $3500 (including racks, switches, etc.), then a typical 128 node cluster will then provide 256 processors and 1024 cores and cost about $448,000. Based on the above assumptions, the annual power and cooling budget is then $67,300. Over a three year period this amounts to $202,000 or 45% of the system cost.
While costs may vary due to market conditions and location, the above analysis illustrates that for a typical commodity cluster the three year power cost can easily reach 40-50% of the hardware purchase price. Every effort should go into reducing the power requirement and at the same time increasing efficiencies. For instance, highly efficient (>90%) power supplies are a must as are power manages processors. There are even resource management packages that power down unused nodes.
There is plenty more to consider, but the above should give you enough background to explore the power and cooling needs of your cluster -- no matter how large or small it is. One of the most important pieces of advice is to work closely with your vendor as they should have both experience and know-how to provide the correct solution for your needs. In particular, work with a vendor who understands HPC, as HPC servers are often used harder than mail or web servers. One way to ensure a vendor understands HPC clusters it to look and see if they offer Intel Cluster Ready (ICR) systems. There is a good chance that if they sell ICR systems they also understand the power and cooling issue. You may also wish to consult a professional data center designer. And finally, there is always a chance you could land your cluster on the other "list" -- the Green500 list.
In the mean time here are some references that you may find helpful:
JOIN THE CONVERSATION
You must be a Registered Member in order to comment on Cluster Connection posts.
Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.
Login Register Now