June 30th, 2009 12:04 pm
Posted by Marcel Van Drunen
Tags: cluster computing, cluster connection, Dell, ICR, linux, ms-dc, Open Source, open source world, rhel
I still remember the first all-staff meeting with a Dell VP. I forgot the good man’s name, but he really tried to have an open discussion with us. His first words, in a very strong accent, were: ‘You can never insult me, I am Scottish.’ One of my colleagues immediately stood up and asked: ‘Is this your first trip to Amsterdam?’
This will be my first blog posting for the Cluster Connection website. The nice thing about blogging is that if I manage to insult somebody, which could happen, the insulted can always express his opinion as I have on the forum.
I know that quite a few Linux operators will be upset if I compare the Intel Cluster Ready program to the Microsoft DataCenter program as it started late last century. Yes, I have to admit that I am from the Windows-side of town, even though I started my career on PDP11’s, VAX, MS-DOS and Novell. At the time of the Microsoft DataCenter project I was working for Unisys. Coming from the mainframe world, we made large machines, designed for high availability. At the time the words Windows and High Availability were never seen in the same sentence, at least not positively correlated. Over the years that the MS-DC project ran, the stability of the software stack increased enormously. Not that many Linux people would like to admit this, see second paragraph of this post.
So what makes a system stable and why is ICR going to help the Open Source world on its way? In my opinion the primary factor that made Linux systems more stable over the years was the quality of the people working with it. Becoming a Windows administrator takes a few months for somebody with a normal set of brains. Achieving that in the Linux world takes several years. The result is that most of the sysadmins in the Linux world really know their stuff. I can tell you from hard earned experience (got the scars, etc., etc.) that most sysadmins in the Windows world in comparison have a long way to go.
The second thing that makes a system stable is the hardware. That looks like it is no differentiator, both Windows and Linux run on the same hardware. There is a difference though. Even though there is a lot of complaining, all new hardware gadgets are supported in Windows within a very short time from their release. On Linux it generally takes longer, namely until a community adapts the gadget and matures the drivers and such. On servers this is much less of a problem than on personal systems, there is no need for fancy gadgets in the server room. Still, having all these different flavors of Linux around often causes problems. It is the sheer professionalism of most of the Linux admins that prevent disaster and solves these problems.
So why bother? Well, clustering technology is spreading its wings in a rapid way. It is no longer only achievable for well-funded research institutes and large companies. More and more smaller players are considering clusters to help them innovate their products in time to get a competitive edge. Most of the people working on these systems consider themselves researchers or innovators, not sysadmins or programmers. They don’t have the time or knowledge to spend days on sorting out drivers and re-compiling kernels. And that’s where ICR comes in. Not that all problems are prevented from happening if one installs an ICR stack, but at least all the support teams can work against a reference which is more easily established. As every support person knows, by replicating a problem you get to a solution faster.
When I worked on MS-DC I ran into a lot of discussions with customers. Why couldn’t they just download the latest firmware and install it? Because we checked in the lab, and it is not stable. Follow-up question: why has nobody built a new firmware and tested that? Well, maybe because our visionary management decided to fire half the tech department, being unaware of both the IT business and the consequences of their actions? I could see the same thing happening with the ICR program. I hope we have all learned from the MS-DC and similar programs. And I hope that the Open Source people suspend parts of their customary stubbornness and inwards-focused –ness.
Before you start flaming away read this: I sold a double cluster a few months ago. Half the nodes Win2008HPC, half the nodes RHEL. It took the MS guy one day to get his part up and running, the RHEL people are still busy! An early-beta version of the software was posted on the website, allegedly. Now there still is a driver issue, though I suspect it to be minor. Let’s not close our eyes, there is a lot of room for improvement, especially if we want to cater to smaller research groups with little OS knowledge, rather than big organizations with life-long Open source gurus.
JOIN THE CONVERSATION
You must be a Registered Member in order to comment on Cluster Connection posts.
Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.
Login Register Now