<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cluster Connection &#187; Intel Cluster Checker</title>
	<atom:link href="http://www.clusterconnection.com/tag/intel-cluster-checker/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.clusterconnection.com</link>
	<description>Simplify HPC. Share the knowledge.</description>
	<lastBuildDate>Fri, 30 Dec 2011 21:23:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1</generator>
		<item>
		<title>Automating Cluster Maintenance - Part 2</title>
		<link>http://www.clusterconnection.com/2011/08/automating-cluster-maintenance-part-2/</link>
		<comments>http://www.clusterconnection.com/2011/08/automating-cluster-maintenance-part-2/#comments</comments>
		<pubDate>Thu, 04 Aug 2011 16:26:26 +0000</pubDate>
		<dc:creator>Patrick Ryan</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Hardware]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Automating cluster tests]]></category>
		<category><![CDATA[Cluster]]></category>
		<category><![CDATA[cluster management]]></category>
		<category><![CDATA[cluster test runs]]></category>
		<category><![CDATA[high performance computing]]></category>
		<category><![CDATA[HPC]]></category>
		<category><![CDATA[HPCC]]></category>
		<category><![CDATA[Intel Cluster Checker]]></category>
		<category><![CDATA[Intel Cluster Ready]]></category>
		<category><![CDATA[Intel MPI]]></category>
		<category><![CDATA[simplify]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2011/08/automating-cluster-maintenance-part-2/</guid>
		<description><![CDATA[Levels of Wellness Tests and Automation Scheduling Intel Cluster Checker can run a variety of tests depending on what the user is trying to accomplish. For general wellness of a cluster, Intel Cluster Checker offers five levels of thoroughness. For automation, we'll focus on levels one, three, and five. Wellness Level One is a very [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Levels of Wellness Tests and Automation Scheduling</strong></p>
<p>Intel Cluster Checker can run a variety of tests depending on what the user is trying to accomplish. For general wellness of a cluster, Intel Cluster Checker offers five levels of thoroughness. For automation, we'll focus on levels one, three, and five.</p>
<p><strong>Wellness Level One</strong> is a very short run of tests that check basic connectivity throughout the cluster as well as basic uniformity checks amongst the nodes.  This test focuses on BIOS settings, and processor, memory and system configurations. The level one tests are quick and show that the cluster is online and ready for use.</p>
<p><strong>Wellness Level Three</strong> is the default run level for Intel Cluster Checker. It builds on level one and includes more rigorous modules that test parameters like disk and memory bandwidth, MFLOPS, and network performance. It also performs an in-depth hardware uniformity test along with an Intel MPI Collectives and message Integrity test. It takes a bit longer to run this check that assures the performance of the hardware is up to par.</p>
<p><strong>Wellness Level Five</strong> adds a packages test, comparing currently installed packages with a generated list of expected packages at a given time. It also runs the HPCC module (a performance benchmark test). The level five tests can take a bit longer but assures the users that all the pieces of the system are working in harmony. This test is useful for the admin to make sure a user hasn’t  installed or uninstalled anything that may affect the cluster.</p>
<p>For my clusters, I have set up the following schedule that runs these wellness levels at various times.</p>
<ul>
<li>Level One - set to run each weekday</li>
<li>Level Three - set to run once a week, on Saturday</li>
<li>Level Five - set to run monthly, on the first Sunday</li>
</ul>
<p>For our lab with multiple smaller clusters this automation schedule is  perfect, allowing users to login to one of the clusters, see  when each of the wellness levels passed and know that their jobs will  run as expected.</p>
<p>Every cluster has a different purpose and a different load. Running tests at the intervals or frequency I have set may not be feasible or needed for every cluster. However, automating the test runs on a schedule that meets your needs will help assure that your cluster performance remains optimal, and frees up the cluster(s) for jobs during normal business hours.</p>
<p>References:</p>
<p>&gt;&gt; <a href="http://www.clusterconnection.com/2011/06/automating-cluster-maintenance-part-1/">Review Part 1</a></p>
<p>&gt;&gt; <a href="http://software.intel.com/en-us/articles/intel-cluster-checker/">About Intel Cluster Checker</a></p>
<p>&gt;&gt; <a href="http://software.intel.com/en-us/articles/intel-cluster-checker-kb/all/1/">Intel Cluster Checker Knowledge Base</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2011/08/automating-cluster-maintenance-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Automating Cluster Maintenance - Part 1</title>
		<link>http://www.clusterconnection.com/2011/06/automating-cluster-maintenance-part-1/</link>
		<comments>http://www.clusterconnection.com/2011/06/automating-cluster-maintenance-part-1/#comments</comments>
		<pubDate>Wed, 22 Jun 2011 21:31:02 +0000</pubDate>
		<dc:creator>Patrick Ryan</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Automating cluster tests]]></category>
		<category><![CDATA[Cluster]]></category>
		<category><![CDATA[cluster management]]></category>
		<category><![CDATA[cluster test runs]]></category>
		<category><![CDATA[high performance computing]]></category>
		<category><![CDATA[HPC]]></category>
		<category><![CDATA[HPCC]]></category>
		<category><![CDATA[Intel Cluster Checker]]></category>
		<category><![CDATA[Intel Cluster Ready]]></category>
		<category><![CDATA[Intel MPI]]></category>
		<category><![CDATA[simplify]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2011/06/automating-cluster-maintenance-part-1/</guid>
		<description><![CDATA[Easily Keep Your HPC Cluster in Great Shape Whether you're a cluster user that expects optimal performance and functionality every time you run a job, or a system administrator that needs to keep the cluster in perfect working order for your users, running checks on a regular basis is important! A great way to ensure [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Easily Keep Your HPC Cluster in Great Shape</strong></p>
<p>Whether  you're a cluster user that expects optimal performance and  functionality every time you run a job, or a system administrator that  needs to keep the cluster in perfect working order for your users,  running checks on a regular basis is important!</p>
<p>A great way to ensure that your HPC system, certified <a href="http://software.intel.com/en-us/cluster-ready/">Intel Cluster Ready</a>, remains in the same great shape as when it was first built, is to run the <a href="http://software.intel.com/en-us/articles/intel-cluster-checker/" target="_blank">Intel® Cluster Checker </a>tool regularly.</p>
<p>While  it’s easy enough to run Intel Cluster Checker on your cluster(s) once a  week, I've found there are times when I was too busy, or simply forgot  to run a check. For this reason, and because I have multiple clusters to  check, I developed a method to automate Intel Cluster Checker runs.  This automated solution runs Intel Cluster Checker and reports any  errors directly to me and/or the system administrator. Not only does  automating this process save the administrator a substantial amount of  time each week, it will eliminate the chance of missing a run, and  ensures that the cluster remains in optimal health.</p>
<p><strong>A Cluster That Checks Itself</strong></p>
<p>In  order to automate the process, it is important that your Intel Cluster  Checker passes a manual run first. To do this, make sure the  configuration file is optimized to ensure the cluster is operating at  its best. With the perfect configuration on hand, a script is then  needed to setup the system, run Intel Cluster Checker, and report the  results.</p>
<p>For the first phase of the process, I wrote a script  that sets up the environment and runs the most in-depth wellness check  in Intel Cluster Checker.  I used <a href="http://en.wikipedia.org/wiki/Cron">cron</a> to schedule my script to execute once a week. After the results are  complete, the script updates the message of the day to show when the  last check ran and the results. If the check fails, the log file created  by Intel Cluster Checker will be copied to the specified directory  where it can be accessed and analyzed.</p>
<p>The cluster is now setup to check itself, in the middle of the night, and report its status upon login!</p>
<p><a href="http://www.clusterconnection.com/2011/08/automating-cluster-maintenance-part-2/"><strong>Read Part 2</strong></a> for more about scheduling test runs, and the different levels of wellness tests in Intel Cluster Checker.</p>
<p>References:</p>
<p>&gt;&gt; <a href="http://software.intel.com/en-us/articles/intel-cluster-checker/">About Intel Cluster Checker</a></p>
<p>&gt;&gt; <a href="http://software.intel.com/en-us/articles/intel-cluster-checker-kb/all/1/">Intel Cluster Checker Knowledge Base</a></p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2011/06/automating-cluster-maintenance-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How I&#039;ve experienced changes in HPC…</title>
		<link>http://www.clusterconnection.com/2010/09/how-ive-experienced-changes-in-hpc/</link>
		<comments>http://www.clusterconnection.com/2010/09/how-ive-experienced-changes-in-hpc/#comments</comments>
		<pubDate>Thu, 23 Sep 2010 19:21:43 +0000</pubDate>
		<dc:creator>Christopher Heller</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Changing HPC]]></category>
		<category><![CDATA[Cluster recipe]]></category>
		<category><![CDATA[clusterware solutions]]></category>
		<category><![CDATA[HPC applications]]></category>
		<category><![CDATA[Intel Cluster Checker]]></category>
		<category><![CDATA[Intel Cluster Ready]]></category>
		<category><![CDATA[ISVs]]></category>
		<category><![CDATA[Open Source Cluster Application Resources]]></category>
		<category><![CDATA[OSCAR]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2010/09/how-ive-experienced-changes-in-hpc/</guid>
		<description><![CDATA[Since this is my first article in a series of blogs I intend to write about HPC and Intel® Cluster Ready (ICR), I’ll start with my story to provide you a bit more about my background, and a little history of how Intel Cluster Ready has changed HPC for me. How it started and evolved [...]]]></description>
			<content:encoded><![CDATA[<p>Since this is my first article in a series of blogs I intend to write about HPC and Intel® Cluster Ready (ICR), I’ll start with my story to provide you a bit more about my background, and a little history of how Intel Cluster Ready has changed HPC for me.</p>
<p><strong>How it started and evolved</strong></p>
<p>Coming into HPC in 2005 was a great learning experience for me, as most of my former technical experience had been networking related. This included a position where I utilized WAP related technologies in the early "Wild West" days of using 802.11 wireless networking to span great distances, providing cheap broadband to rural commercial interests to save them the costs of ISDNs and even more cost-prohibitive T1s.</p>
<p>So, I started in 2005 by working for a year with one of the Intel HPC teams as a contractor, before officially joining Intel in 2006. Among my first tasks at Intel was to build a cluster. At that time this was accomplished by going through a how-to "recipe" document which contained a great many steps in order to manually set up the head (master) node, and then clone that node with “dd” (which was  slow) or kick-start (much better) and then continue following the steps to create a cluster "by hand." This method did not scale well beyond ~8 nodes as there was much room for operator error.</p>
<p>The next "evolution" I experienced was utilizing some home-grown setup scripts from a predecessor and writing some of my own. These were primitive setup scripts that left a lot to be desired, however, easier than the “by hand” method. Among these scripts was a relatively simple cluster health check called "cluster-checker.sh." If given a correct hostfile, this script would go through and check the compute nodes for simple problems such as ping, rsh/ssh connectivity, and a few others.</p>
<p>Another useful find was OSCAR - Open Source Cluster Application Resources. What was really neat about OSCAR was that the developers had engineered a setup framework that greatly simplified the setup of a cluster, in most cases. This seemed like absolute luxury at the time. After I joined Intel, I wanted to be a part of this group that had helped make my job easier. (My later experience with NPACI ROCKS was similar as it was also an easier way to set up and configure a cluster.)</p>
<p><strong>Developing Intel Cluster Ready </strong></p>
<p>The research and development work associated with the Intel® Cluster Checker tool was an exciting time! We had to ensure that:</p>
<ul>
<li>The tool enforced the Intel Cluster Ready Specification with the – compliance flag.</li>
<li>The wellness checks covered a wide range of functions and could execute in acceptable times.</li>
</ul>
<p>This required making some tests fairly generic to apply to a large range of hardware/OS/clusterware solutions. The process also included a campaign to encourage ISVs to register their applications as Intel Cluster Ready, to verify that their applications will run successfully on any certified Intel Cluster Ready system. These registrations added critical libraries and binaries to be placed into the software images that allowed the registered applications to “just run.”</p>
<p><strong>Making it easier</strong></p>
<p>With all of these changes, the process of building out and troubleshooting an HPC cluster became significantly easier. These processes encouraged the further enhancement of existing recipes we had developed over the years to become “<a href="http://software.intel.com/en-us/articles/intel-cluster-ready-recipes/">Intel Cluster Ready Reference recipes</a>.” The development of these recipes was key in helping to better understand what it meant for the software stack to truly be Intel Cluster Ready compliant, and search out the best ways to facilitate the automation of these recipes working with the commercial provisioning vendors.</p>
<p>We're proud to say that since we began this endeavor just 3 years ago, we have a number of <a href="http://software.intel.com/en-us/articles/intel-cluster-ready-participating-vendors-system-software/">provisioning vendors</a> that offer ICR compliant clusterware solutions to OEM's to sell to their partners. Also, equally important, we are very proud of our <a href="http://software.intel.com/en-us/articles/intel-cluster-ready-participating-vendors-oems-systems-integrators/">OEM and system partners</a> who have modified their own clusterware stacks to align with Intel Cluster Ready. They are the ones who primarily benefit from Intel Cluster Checker, for many, it has become a critical tool from the engineering of a cluster recipe (software stack), to checking that stack in manufacturing. This ensures that the customer gets a certified Intel Cluster Ready system, and the customer can use the Intel Cluster Checker tool to test the cluster wellness, or diagnose issues with their cluster system. Simplifying the entire process.</p>
<p>This goes full circle when the customer calls the OEM or provisioning vendor for support and poses the question, “Is your system Intel Cluster Ready?” If so, the process becomes much easier with a tool that can establish the functionality/wellness of a cluster, including diagnosing many common problems we see when dealing with clusters on the system side.</p>
<p>We have seen great success in working with the provisioning vendors and OEMs regarding compliance to Intel Cluster Ready. This is in no small part due to the features and functionality provided by <a href="http://software.intel.com/en-us/articles/intel-cluster-checker/">Intel Cluster Checker</a>. While its primary purpose lies in enforcing the Intel Cluster Ready specification, it has expanded into a tool with over 100 various tests that deal with consistency, performance, and correct configurations. With the auto-configuration tool, it can easily handle heterogeneous clusters, perform single node and network performance tests, and much more.</p>
<p>It has changed HPC for me, are you ready for it to change HPC for you?</p>
<p>Learn more about <a href="http://software.intel.com/en-us/cluster-ready/">Intel Cluster Ready</a>, and check back for future blogs on related HPC topics!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2010/09/how-ive-experienced-changes-in-hpc/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Accelerate small to mid-sized ANSYS HPC deployments</title>
		<link>http://www.clusterconnection.com/2010/06/accelerate-small-to-mid-sized-ansys-hpc-deployments/</link>
		<comments>http://www.clusterconnection.com/2010/06/accelerate-small-to-mid-sized-ansys-hpc-deployments/#comments</comments>
		<pubDate>Thu, 24 Jun 2010 20:07:43 +0000</pubDate>
		<dc:creator>Maria McLaughlin</dc:creator>
				<category><![CDATA[Briefs]]></category>
		<category><![CDATA[Products and Promotions]]></category>
		<category><![CDATA[ansys]]></category>
		<category><![CDATA[Appro]]></category>
		<category><![CDATA[Intel Cluster Checker]]></category>
		<category><![CDATA[Solution Guide]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/?p=1697</guid>
		<description><![CDATA[Appro Ready-to-Go clusters are Intel® Cluster Ready certified, optimized to run ANSYS applications, and pre-tested with the Intel® Cluster Checker software to help ensure application and component interoperability. Power up with the Appro Ready-to-Go Cluster Series. &#62;&#62; Read the Solution Guide]]></description>
			<content:encoded><![CDATA[<p>Appro Ready-to-Go clusters are Intel® Cluster Ready  certified, optimized to run ANSYS applications, and pre-tested with the  Intel® Cluster Checker software to help ensure application and component  interoperability. Power up with the <a href="http://www.ansys.com/corporate/partners/company/cluster-ready-appro.asp" target="_blank">Appro Ready-to-Go Cluster Series</a>.</p>
<p><span style="color: #ff9900;">&gt;&gt; <span style="color: #000000;">Read the</span> <a href="http://software.intel.com/file/28832" target="_blank">Solution Guide</a></span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2010/06/accelerate-small-to-mid-sized-ansys-hpc-deployments/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Intel Cluster Checker Experience</title>
		<link>http://www.clusterconnection.com/2009/10/the-intel-cluster-checker-experience/</link>
		<comments>http://www.clusterconnection.com/2009/10/the-intel-cluster-checker-experience/#comments</comments>
		<pubDate>Fri, 02 Oct 2009 20:40:51 +0000</pubDate>
		<dc:creator>Thomas Gebert</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[Hardware]]></category>
		<category><![CDATA[ICR]]></category>
		<category><![CDATA[Intel Cluster Checker]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/?p=1587</guid>
		<description><![CDATA[Well yes, this is another "cheers to the Intel Cluster Checker" blog and there must be reasons out there why people are writing about the Intel Cluster Ready program. Yes, there are reasons... I think I do not have to state that setting up an HPC cluster sometimes is an adventure of compiling and installing [...]]]></description>
			<content:encoded><![CDATA[<p>Well yes, this is another "cheers to the Intel Cluster Checker" blog and there must be reasons out there why people are writing about the Intel Cluster Ready program. Yes, there are reasons...</p>
<p>I think I do not have to state that setting up an HPC cluster sometimes is an adventure of compiling and installing different program versions and libraries. Nevertheless, when having all up and running, the Intel Cluster Checker is a really nice tool to verify that all services needed for your HPC cluster are set up correctly and you haven't forgotten one of those tiny details during the installation of your cluster.</p>
<p>My last experience with the Intel Cluster Checker and the newly installed HPC cluster were a bit different from the ones before. The first steps ran smoothly and everything seemed to work fine, with me having learned from my previous first Intel Cluster Checker experience. The tests were successful, I altered the benchmark thresholds and all tests ended with a "passed". But in the end, when checking the dmidecode output I got a "failed" on this test. First I thought this was due to some BIOS specific mismatches, which you can exclude. But when I had a closer look on the output file of the Intel Cluster Checker I saw that there seemed to be installed different RAM modules on the compute nodes.  Hmm...the sizes of the memories looked fine, all were 2GB and I double checked the product numbers of the DIMMs that the Intel Cluster Checker reported. Here we got two different types of memories installed. I was astonished on the one hand as I would not have checked this without the Intel Cluster Checker and on the other hand I hardly could believe that there were two different types of memory DIMMs installed. Well, the servers I had installed were Intel Nehalem based and I remembered the days when AMD started with the CPU built-in memory controller and the problems with memories that arose during those times...</p>
<p>I took a closer look at the compute nodes and opened the chassis of  those which were affected. And indeed, there were different types of memory modules built in, but they nearly looked the same and also had the same product number written down on them. After some investigation with our purchasing department I found out that those memory types have been mixed up. Unfortunately it was not possible to pin down the real cause for this confusion.</p>
<p>The DIMMs were swapped to the correct ones in the end and I ran the Intel Cluster Checker again. This time all tests were passed and the results were sent to Intel to verify the Cluster Ready Certificate, which now proudly resides beside the other ICRs we have scored.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/10/the-intel-cluster-checker-experience/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Software Updates on an Intel® Cluster Ready System</title>
		<link>http://www.clusterconnection.com/2009/08/software-updates-on-an-intel-cluster-ready-system/</link>
		<comments>http://www.clusterconnection.com/2009/08/software-updates-on-an-intel-cluster-ready-system/#comments</comments>
		<pubDate>Wed, 26 Aug 2009 17:18:04 +0000</pubDate>
		<dc:creator>Brock Taylor</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[certified]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[HPC]]></category>
		<category><![CDATA[Intel Cluster Checker]]></category>
		<category><![CDATA[Intel Cluster Ready]]></category>
		<category><![CDATA[Intel Cluster Ready Architecture]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2009/08/software-updates-on-an-intel-cluster-ready-system/</guid>
		<description><![CDATA[My last post concluded that eventually the answer to, "is my cluster's software too old," is yes. Updating software on a cluster is not as simple as updating a single server, but the down side of errors is the same: if updates aren't done properly, clusters, like traditional servers, can mysteriously break or start to [...]]]></description>
			<content:encoded><![CDATA[<p>My last post concluded that eventually the answer to, "is my cluster's software too old," is yes. Updating software on a cluster is not as simple as updating a single server, but the down side of errors is the same: if updates aren't done properly, clusters, like traditional servers, can mysteriously break or start to have problems in the future.  In addition, the chance of making an error during the update is proportional to the number of nodes in the cluster. For Intel® Cluster Ready compliant clusters, Intel has provided a couple steps that can and should be performed after a software update that will help verify the cluster is still compliant and ensure it is still functioning properly.</p>
<p>First, always use the required and supplied "provisioning system" tools to update a cluster.  It may be relatively easy to update all the nodes in a cluster using a couple RPMs and something like pdsh to script the installation of the update - but don't be tempted.   This manual or brute-force method bypasses the software that manages the image on each server.  Provisioning systems may reimage nodes after a crash or maybe a node is replaced or added to the cluster.  If software updates are applied manually (outside of the provisioning system) then the reimaged node will be inconsistent with the rest of the system.  An admin would need to remember all the manual changes and apply them again.  It's much better to let the provisioning software worry about that.  That is one the reasons ICR required a provisioning system!</p>
<p>Once updates are applied, it's a good idea to verify the cluster is working as it did before the update was installed and it remains compliant with the Intel® Cluster Ready architecture.  Many if not most software updates will behave well, but verification helps ensure an update didn't alter or remove a key system component that may lead to application failures.  Intel® Cluster Checker provides an easy way to check the compliance after an update.  By using the command-line --compliance option, the tool will verify the interface defined by the architecture still exists as before.  It's an easy way to check that the update hasn't had any ill effect on the architecture interface used by ICR applications.</p>
<p>Finally, there may be needed updates to the Intel Cluster Checker configuration files to reflect the updated software.  For example, if a newer version of the Intel C compiler is installed, the Intel Cluster Checker configuration file should be updated to utilize the newer version.  Running the tool would then verify the new installation is functioning on all nodes. It's also valuable to update the list of packages that are expected on each node.  The packages test verifies the RPMs installed on each node matches a predetermined list.  Using the --packages command-line option will create new package lists based on the current installation (use it after all updates are complete).  Save the original list file and set the configuration file to use the updated list.  For more information on using the tool, see the Intel Cluster Checker Users Guide.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/08/software-updates-on-an-intel-cluster-ready-system/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Automatic SLES 11 deployment of an ICR certified HPC cluster</title>
		<link>http://www.clusterconnection.com/2009/07/automatic-sles-11-deployment-of-an-icr-certified-hpc-cluster/</link>
		<comments>http://www.clusterconnection.com/2009/07/automatic-sles-11-deployment-of-an-icr-certified-hpc-cluster/#comments</comments>
		<pubDate>Tue, 14 Jul 2009 17:26:01 +0000</pubDate>
		<dc:creator>Oliver Tennert</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[error messages]]></category>
		<category><![CDATA[Intel Cluster Checker]]></category>
		<category><![CDATA[Intel Cluster Ready]]></category>
		<category><![CDATA[Intel MPI]]></category>
		<category><![CDATA[Mesa]]></category>
		<category><![CDATA[MPI]]></category>
		<category><![CDATA[sles11]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2009/07/automatic-sles-11-deployment-of-an-icr-certified-hpc-cluster/</guid>
		<description><![CDATA[The other day we had to ship an ICR certified HPC cluster based on SLES 11, the latest SUSE Enterprise Distribution. We have used Intel Cluster Runtime version 2.0-1. As SLES 11 has to that date been out only for a couple of weeks, we didn't expect everything to run as smoothly as for SLES [...]]]></description>
			<content:encoded><![CDATA[<p>The other day we had to ship an ICR certified HPC cluster based on SLES 11, the latest SUSE Enterprise Distribution. We have used Intel Cluster Runtime version 2.0-1. As SLES 11 has to that date been out only for a couple of weeks, we didn't expect everything to run as smoothly as for SLES 10, and indeed the challenge turned out to have SLES 11 behave in a way compatible with Intel Cluster Ready.</p>
<p>As we found out in the Web, the Intel MPI implementation has a known bug: the number of available cores is not detected correctly. But it took some time connecting this circumstance to one of our problems, because the original error did not immediately point to it:</p>
<p>Intel(R) MPI Library Runtime Environment (Single-node), (intel_mpi_rt).........................................................FAILED<br />
subtest 'MPI Hello World! (I_MPI_DEVICE = sock)' failed<br />
- failing All hosts returned: 'No one returned Hello World!'<br />
subtest 'mpd shutdown' failed<br />
- failing All hosts returned: 'mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_transtec); possible causes:<br />
1. no mpd is running on this host<br />
2. an mpd is running but was started without a "console" (-n option)'</p>
<p>As it were, the real problem was the program "cpuinfo" of Intel MPI, reporting a wrong number of cores in the system. Setting an environment variable,</p>
<p>"export I_MPI_CPUINFO=proc"</p>
<p>fixed this issue, however.</p>
<p>Another issue with Intel MPI is that it seems to be incompatible with Python 2.6, which comes along with SLES 11:</p>
<p>Intel® MPI Library Runtime Environment (Single-node), (intel_mpi_rt).........................................................FAILED<br />
subtest 'mpd startup' failed<br />
- failing All hosts returned:<br />
'/opt/intel/impi/3.2.0.011/bin64/mpdlib.py:27: DeprecationWarning: The popen2 module is deprecated.  Use the subprocess module.<br />
import sys, os, signal, popen2, socket, select, inspect<br />
/opt/intel/impi/3.2.0.011/bin64/mpdlib.py:37: DeprecationWarning: the md5 module is deprecated; use hashlib instead<br />
from  md5       import  new as md5new'</p>
<p>Although this should not constitute not a real problem, cluster checker seems to be sensitive enough to end with an error code even for warning like this. We could fix it by installing Python 2.4, which, unfortunately, does not exist as a SLES 11 package, so we had to take a source tarball and recompile the package. Strange though, why SLES 11 does not include Python 2.4 as a fallback package for compatibility reasons as there are many Python programs out there based on that version.</p>
<p>In my opinion, this just demonstrates the power of Intel Cluster Ready, or cluster checker, respectively: Intel MPI causes a problem, but the cluster checker does its jobs and identifies a compatibility issue, impressively enough! Thus, the ICR program enables us to catch the problem before it reached the customer level.</p>
<p>A strange error that occurred was this one:</p>
<p>X11 runtime libraries are provided, (X11_libs).........................FAILED<br />
subtest 'libGLw.so (x86-64) &gt;= version 1' failed<br />
- failing All hosts returned: 'missing'</p>
<p>Obviously, the library missing is part of the Mesa package, which is a necessary HPC cluster ingredient from Intel Cluster Ready's point of view. Doing a little bit of research, we found out that nearly all current Linux distributions explicitly remove this library from the standard Mesa package, for whatever reasons.</p>
<p>We could solve that in an elegant way by repackaging the SLES package Mesa with a recompiled one, having the SPEC file adapted in an appropriate way for including the libGLw libraries.</p>
<p>A funny thing that happened was that Korn shell missed the locales:</p>
<p>Korn Shell, (ksh)......................................................FAILED<br />
subtest 'Hello World!' failed<br />
- failing hosts node1 - node3 returned: 'en_US.UTF-8: unknown locale'</p>
<p>But the package glibc-locale seemed to be installed:</p>
<p>root@node1 # rpm -q glibc-locale<br />
gcc-locale-4.3-62.198</p>
<p>What was going on?</p>
<p>It turned out to have nothing to do with ICR incompatibilities of SLES 11, but to be a problem with "xCAT 2" which we use for cluster deployment. As we have tested diskless installations, xCAT defined a number of files and directories to be deleted before creating the compressed root image. In the configuration file</p>
<p>/opt/xcat/share/xcat/netboot/sles/compute.exlist</p>
<p>The locale files are explicitly listed to be deleted before image generation. Having found that out, it was an easy thing to fix.</p>
<p>After solving all that issues, we finally have succeeded in developing a fully automatic deployment procedure for ICR certified HPC cluster based on SLES 11.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/07/automatic-sles-11-deployment-of-an-icr-certified-hpc-cluster/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Detecting Silent Problems in Clusters</title>
		<link>http://www.clusterconnection.com/2009/04/detecting-silent-problems-in-clusters/</link>
		<comments>http://www.clusterconnection.com/2009/04/detecting-silent-problems-in-clusters/#comments</comments>
		<pubDate>Mon, 27 Apr 2009 20:42:56 +0000</pubDate>
		<dc:creator>Brock Taylor</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[InfiniBand]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[Intel Cluster Checker]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2009/04/detecting-silent-problems-in-clusters/</guid>
		<description><![CDATA[A cluster node doesn't boot, the network is down, the power supply is smoking - these are actually nice problems for a cluster administrator. These problems conveniently provide the starting point for the path to resolution and usually are quickly resolved. What's not so nice are issues that don't actually break the system but allow it [...]]]></description>
			<content:encoded><![CDATA[<p>A cluster node doesn't boot, the network is down, the power supply is smoking - these are actually nice problems for a cluster administrator. These problems conveniently provide the starting point for the path to resolution and usually are quickly resolved. What's not so nice are issues that don't actually break the system but allow it to limp along or subtly degrade over time. These problems can chip away at cluster performance usually without presenting an explicit symptom. The system runs, but it just does not seem to run as well as it used to.</p>
<p>My favorite example is the damaged network cable that is dropping data intermittently but not to the point of outright failure. This might result in a support call reporting, "the cluster is broken," and an application that used to run in two hours now takes almost three. It might be intuitive to some to proceed right to checking the link speeds between each node pair, but it's likely that home grown methods are the primary approach to discovering the cause of the slowdown.</p>
<p>Enter Intel® Cluster Checker as the systematic approach finding these problems. The tool provides a diagnostic report by checking the usual suspects that can cause functional issues in the cluster. I have had a damaged InfiniBand cable continue to operate but at half the bandwidth. Intel Cluster Checker doesn't tell me I have a bad cable, but it does provide me data that the bandwidth over the fabric to one particular node is below par. It allows me to narrow in quickly on the failing component.</p>
<p>The inherent value is that there isn't the need to speculate on any particular issue. Step 1 is always to run Intel Cluster Checker and see what it reports. The tool does the investigation work and keeps you from chasing dead ends. It also helps expose those silent issues that are not obvious or routine to check.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/04/detecting-silent-problems-in-clusters/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Intel Cluster Ready, going further with Platform Computing</title>
		<link>http://www.clusterconnection.com/2009/04/intel-cluster-ready-going-further-with-platform-computing/</link>
		<comments>http://www.clusterconnection.com/2009/04/intel-cluster-ready-going-further-with-platform-computing/#comments</comments>
		<pubDate>Thu, 23 Apr 2009 20:26:45 +0000</pubDate>
		<dc:creator>Mehdi Bozzo-Rey</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Middleware]]></category>
		<category><![CDATA[framework]]></category>
		<category><![CDATA[GUI]]></category>
		<category><![CDATA[ICR]]></category>
		<category><![CDATA[ihv]]></category>
		<category><![CDATA[Intel Cluster Checker]]></category>
		<category><![CDATA[Intel Cluster Ready]]></category>
		<category><![CDATA[open cluster stack]]></category>
		<category><![CDATA[OS]]></category>
		<category><![CDATA[Platform Computing]]></category>
		<category><![CDATA[scalability]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/?p=850</guid>
		<description><![CDATA[The goal  is pretty simple: making sure that an initial combination of hardware, software and configuration files (the recipe) can be replicated over and over to ensure the same behavior and more important, the same performance for all clusters shipped.  This is good, but not enough because you don’t buy clusters just for those performance [...]]]></description>
			<content:encoded><![CDATA[<p style="background: white;"><span style="font-size: 9pt; font-family: &quot;Arial&quot;,&quot;sans-serif&quot;; mso-ansi-language: EN;" lang="EN">The goal  is pretty simple: making sure that an initial combination of hardware, software and configuration files (the recipe) can be replicated over and over to ensure the same behavior and more important, the same performance for all clusters shipped.  This is good, but not enough because you don’t buy clusters just for those performance numbers, you buy a cluster because you want to run applications on it. Here comes the second part of the certification: compliance to the ISV’s pre-requisites to ensure for example that your preferred application (part of the ICR framework) will just be waiting for your input file after install. No need to do extra configuration, or install additional software pieces: nice amount of time (and money) saved.</span></p>
<p style="background: white;"><span style="font-size: 9pt; font-family: &quot;Arial&quot;,&quot;sans-serif&quot;; mso-ansi-language: EN;" lang="EN">Now, as the OS is anything but static, you will have to patch your system at some point.  Maybe you can afford a “real” test cluster, apply the patches, run the cluster-checker on these updated nodes, check that everything is fine and schedule a maintenance period, update the whole cluster and put it back in production. This sounds quite expensive in terms of time and hardware (depreciation and electricity) needed just for testing.</span></p>
<p style="background: white;"><span style="font-size: 9pt; font-family: &quot;Arial&quot;,&quot;sans-serif&quot;; mso-ansi-language: EN;" lang="EN">You can also take advantage of the Open Cluster Stack 5 framework that we developed at Platform Computing , take a snapshot of the repository used to provision the nodes, update this snapshot, decouple few nodes from the production cluster and let them use the new updated repository (all of this is done live). You can then validate the updated snapshot. Is everything fine? Perfect, you can then apply the patches on your production cluster (no interruption needed, except a reboot that can be scheduled via your scheduler in case of kernel update) and of course put back the few nodes you used to production. Sounds complicated? Not really when you can use a GUI to perform all those operations. Again, a nice amount of time (money) can be saved and more important, the certification process is not any more tied to a certain point in time (the installation) but is now part of the lifecycle of the cluster.</span></p>
<p style="background: white;"><span style="font-size: 9pt; font-family: &quot;Arial&quot;,&quot;sans-serif&quot;; mso-ansi-language: EN;" lang="EN">Let’s have a look at the problem from an IHV point of view. How do you get your hardware / software combination certified ? The answer is simple: you run the cluster-checker against a recipe.</span></p>
<p style="background: white;"><span style="font-size: 9pt; font-family: &quot;Arial&quot;,&quot;sans-serif&quot;; mso-ansi-language: EN;" lang="EN">So the first step is to build a recipe. This can be an iterative and time consuming process: install the stack, run the cluster-checker, update the recipe given the cluster-checker runs, and start over again. This is time consuming and can lead to highly complex recipes so you will probably end up at the end of the day with a nice set of scripts that you need to execute prior to successfully run the cluster-checker. Of course, your next step is to write a master script that will sequentially run all these small scripts: the last thing you want is a complex procedure on the assembly line. </span></p>
<p style="background: white;"><span style="font-size: 9pt; font-family: &quot;Arial&quot;,&quot;sans-serif&quot;; mso-ansi-language: EN;" lang="EN">Imagine now that most of the scripts needed are not only embedded in the software stack but that the certification process is also part of the development process and more precisely part of the QA process. This means for an IHV that the software that will be installed is already “certifiable” so the cost of writing a recipe becomes a lot more affordable. This is exactly what we’ve done at Platform Computing: using the ICR certification process as part of our QA process. Some steps needed for a successful cluster-checker run are part of the core of our clustering solution; no need to write scripts, just enable a software component in one click. Of course, this can be done through an intuitive GUI. Customization can then be added to the software stack in order to handle for example hardware deviations from an original recipe.</span></p>
<p style="background: white;"><span style="font-size: 9pt; font-family: &quot;Arial&quot;,&quot;sans-serif&quot;; mso-ansi-language: EN;" lang="EN">In conclusion, we’ve just demonstrated how easy it is to save time (and money) for both IHVs and users by integrating the ICR certification process not only in the cluster’s lifecycle (Platform Open Cluster Stack 5) but also in the development process (Platform Computing) in order to get a scalable certified solution.</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/04/intel-cluster-ready-going-further-with-platform-computing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Accelerated HPC Productivity with Intel® Cluster Ready Solutions</title>
		<link>http://www.clusterconnection.com/2009/04/accelerated-hpc-productivity-with-intel-cluster-ready-solutions/</link>
		<comments>http://www.clusterconnection.com/2009/04/accelerated-hpc-productivity-with-intel-cluster-ready-solutions/#comments</comments>
		<pubDate>Tue, 07 Apr 2009 16:36:53 +0000</pubDate>
		<dc:creator>Brock Taylor</dc:creator>
				<category><![CDATA[Briefs]]></category>
		<category><![CDATA[Hardware]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Dell]]></category>
		<category><![CDATA[HPC]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[Intel Cluster Checker]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[poweredge]]></category>

		<guid isPermaLink="false">http://www.clusterconnection.com/2009/04/accelerated-hpc-productivity-with-intel-cluster-ready-solutions/</guid>
		<description><![CDATA[[Excerpt] The Intel® Cluster Ready program provides a standardized,  replicable way to build and run high-performance   computing (HPC) clusters, helping simplify cluster deployment and management. By using Intel Cluster Ready-certified Dell™ HPC clusters, organizations can quickly install and configure clusters to begin running registered HPC applications. Click here to download "Accelerated HPC Solutions with Intel® [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.dell.com/downloads/global/power/ps4q08-20080416-Intel.pdf" target="_blank"></a></p>
<p>[Excerpt] The Intel® Cluster Ready program provides a standardized,  replicable way to build and run high-performance   computing (HPC) clusters, helping simplify cluster deployment and management. By using Intel Cluster Ready-certified Dell™ HPC clusters, organizations can quickly install and configure clusters to begin running registered HPC applications.</p>
<p><a href=" http://www.dell.com/downloads/global/power/ps4q08-20080416-Intel.pdf">Click here to download "Accelerated HPC Solutions with Intel® Cluster Ready"</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.clusterconnection.com/2009/04/accelerated-hpc-productivity-with-intel-cluster-ready-solutions/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

