October 2nd, 2009 1:40 pm
Posted by Thomas Gebert
Tags: Hardware, ICR, Intel Cluster Checker
Well yes, this is another "cheers to the Intel Cluster Checker" blog and there must be reasons out there why people are writing about the Intel Cluster Ready program. Yes, there are reasons...
I think I do not have to state that setting up an HPC cluster sometimes is an adventure of compiling and installing different program versions and libraries. Nevertheless, when having all up and running, the Intel Cluster Checker is a really nice tool to verify that all services needed for your HPC cluster are set up correctly and you haven't forgotten one of those tiny details during the installation of your cluster.
My last experience with the Intel Cluster Checker and the newly installed HPC cluster were a bit different from the ones before. The first steps ran smoothly and everything seemed to work fine, with me having learned from my previous first Intel Cluster Checker experience. The tests were successful, I altered the benchmark thresholds and all tests ended with a "passed". But in the end, when checking the dmidecode output I got a "failed" on this test. First I thought this was due to some BIOS specific mismatches, which you can exclude. But when I had a closer look on the output file of the Intel Cluster Checker I saw that there seemed to be installed different RAM modules on the compute nodes. Hmm...the sizes of the memories looked fine, all were 2GB and I double checked the product numbers of the DIMMs that the Intel Cluster Checker reported. Here we got two different types of memories installed. I was astonished on the one hand as I would not have checked this without the Intel Cluster Checker and on the other hand I hardly could believe that there were two different types of memory DIMMs installed. Well, the servers I had installed were Intel Nehalem based and I remembered the days when AMD started with the CPU built-in memory controller and the problems with memories that arose during those times...
I took a closer look at the compute nodes and opened the chassis of those which were affected. And indeed, there were different types of memory modules built in, but they nearly looked the same and also had the same product number written down on them. After some investigation with our purchasing department I found out that those memory types have been mixed up. Unfortunately it was not possible to pin down the real cause for this confusion.
The DIMMs were swapped to the correct ones in the end and I ran the Intel Cluster Checker again. This time all tests were passed and the results were sent to Intel to verify the Cluster Ready Certificate, which now proudly resides beside the other ICRs we have scored.
JOIN THE CONVERSATION
You must be a Registered Member in order to comment on Cluster Connection posts.
Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.
Login Register Now