August 13th, 2009 2:11 pm
Posted by Lee Porter
Tags: best practices, cluster computing, Clusters, InfiniBand, juropa, partec
Overview of Juropa
The name JuRoPA is itself an acronym which reflects the broader goals of the project of which this machine is just a part. JuRoPA stands for Jülich Research on Petaflop Architectures. The machine will be made available to over 200 research groups across Europe and will be used for data-intensive applications.
Figure 1: Juropa-JSC
JuRoPA's architecture was developed by HPC experts from the Jülich Supercomputing Center. Partner companies Bull, Sun, Intel, Mellanox and ParTec were responsible for the realization of the machine which consists of 3288 compute nodes and a total computing power of 308 Teraflops peak.
The Juropa cluster is an aggregate of two smaller clusters, known individually as Juropa-JSC and HPC-FF. Both machines share a common interconnect fabric (QDR Infiniband) in addition to a common management network. ParastationV5, ParTec's current release of it's cluster operating system, enabled both clusters to be integrated to form a single heterogeneous cluster entity capable of solving the most challenging problems facing researchers today.
Figure 2: HPC-FF
Juropa's Infiniband Network
The interconnect for the Juropacluster is common to both the HPC-FF and Juropa Machines. It is based on the latest Mellenox QDR Silicon but uses switches from two different vendors. The top level switching (level 1 and 2) is provided by 6 Sun (M9) 648 switches and 2 virtual M9 switches. The virtual M9 switches are topologically the same as the real M9 switches - but they are created using single instances of MTS3600 switches interconnected to replicate the internal connectivity of the actual M9's. These "virtual" M9's are shown in purple in figure 3.
Figure 3 - IB topology of JuRoPa
Figure 4: Sun 648 - M9
JuRoPa's IB components are supplied by two different vendors, Mellanox and Sun. However, the basic building block of both switches is the same 36 port QDR InfiniScale® IV ASIC, the 4th generation of switch silicon from Mellanox.
The M9's, of which there are six, provide most of the Level 1 and Level 2 switching for both the HPC-FF and Juropa-JSC machines. For the Juropa-JSC machine, the level zero switching is provided by the QNEM modules - or Quad-datarate Network Express Modules. These QNEM modules integrated directly into the back of 6048 blade system. Connectivity to the HCA's is achieved via back plane traces internal to the blade shelf. The HCA's themselves are integrated into the serverboard using a 40 Gb/s InfiniBand ConnectX® mezzanine adapter card. This means there is no cabling to worry about between the HCA and the 1st level of switching.
Figure 5: QNEM modules - 10 12x ports externally
It was recognized early on in the Juropa project that connectivity to the M9 switches (with 12X connectors) from the HPC-FF machine, which had regular QSFP (4x) HCA's, required special cable considerations. In order to connect the 1U servers of the HPC-FF machine to the M9, a special 12X -> 4X splitter cable was required, shown in Figure 6. These are Sun prorietory cables which the CPX connector no one end and 3 QSFP+ connectors on the other.
Figure 6: Sun splitter cable
Mellanox MTS 3600 switches for the Level zero switching for the HPC-FF cluster. These were connected to 4x QDR ConnectX HCA, housed in a standard PCI-E Gen2.0 8x slots on the 1U HPC-FF servers. This connectivity was typically achieved with QSFP copper cables of 2m length or less.
Why is the IB debugging strategy so important for Large Clusters
There are a myriad reasons why a cluster doesn't perform to specification. These problems become difficult to manage on very large distributed memory machines. For example, there are a many combinations of node BIOS setting which can effect individual node performance. As MPI jobs progress at the speed of the slowest participant, it is essential that BIOS settings are both optimal and consistent across the whole cluster. Tools such as Intel's Cluster Checker are valuable in spotting such inconsistencies and rectifying them. However, that is only part of the story. Node consistency checking and per-node benchmarking is important - but the network must also be considered. In particular, for commodity based clusters using an Infiniband Interconnect, the performance of the IB network itself can never be taken for granted. A single link with high errors rates - resulting in either lost packets or significant data re-sends will impact performance across the whole cluster. These problems manifest themselves as either MPI job failures or general poor performance.
There were two basic objectives which were identified prior to planning the diagnostic activity for the Juropa cluster:
- Realize Per-Node parallel efficiency across the whole cluster of 3288 compute nodes
- Aid isolation and identification of faulty IB components - cables / ports/ HCA's - using location strings to identify the precise physical locations.
Goal 1 was quantified by stipulating a parallel efficiency in excess 90% (Nehalem Turbo Mode off). In particular, the parallel efficiency of the entire system should be about the same as that of a single compute node provided there is nothing pathologically wrong with the Infiniband network, the cluster nodes are consistently imaged and configured and the MPI libraries have been suitably optimized. This is the case of HPL benchmark runs with IB networks speeds in excess of DDR (5Gbps line rate). For QDR, as used in the Juropa machine, a line rate of 10Gbps is more than adequate.
For SDR IB network, the reduced bandwith would typically give a 2-4% downside in HPL benchmark performance. This is because LINPACK computations mostly communicate with process ranks which are immediate neighbors, and they do so with relatively large message sizes making the application somewhat bandwidth intensive. As such, the HPL benchmark is not particularly latency sensitive.
The normal single node parallel efficiency for the Nehalem based nodes used in the Juropa cluster is approx. 91%. This is with Turbo mode OFF. If the IB network is clean, BIOS settings are consistent and there are no hardware issues on individual nodes, then it should be possible to reproduce this efficiency across the entire machine.
Figure 7: Snippet of output of ibdiagnet - showing number of IB components
The second goal of the JuRoPa IB debugging strategy was to device a plan and construct a series of tools that enabled the integrator to quickly identify and replace faulty IB components by tying the error reports generated by the IB diagnostics tools with location strings of components on the machine floor. This was very important in a machine of this size, with 3881 individual IB components - as illustrated in Figure 7. The IB diags tools are very efficient at identifying where the errors in a fabric are and listing those errors with there associated GUID. A GUID is similar to a MAC address in the Ethernet world. Each IB component has a separate GUID. However, in a fabric containing this many components, locating a faulty component when you only have a GUID to go on is a daunting challenge. Integrators could simply record the base GUID of each switch component in a spread sheet - and search that spreadsheet for the corresponding machine room location for the faulty component. However, this a time consuming and inefficient manner to proceed. What is needed is a to link the switch location with the ibdiags tools so the location strings for components becomes part of the diagnostic output. This is done using a IB topology file.
Juropa Floor Plan - understanding the topology files
The IB topology file can be used to specifically reference the locations with reference to there connectivity to other components. Figure 8 below details the floor plan of the Juropa cluster. As you can see - there are 2 distinct systems. The first (in green) is the HPC FF and the second (in orange) is the JSC cluster.
Constructing Infiniband topology Files
A topology file has 2 core uses:
- Topology files can be used to specify the intended connectivity of a IB network. The topology file can then be used in conjunction with the ibtopodiff utility (part of the ibdiags package in the OFED stack) to determine if there are any discrepancies between the intended topology and the physical one. This helps the network engineer identify potential cabling errors that might cause traffic congestion in the physical network.
- Topology files can be used to append location strings to specific specific connection sources and destinations. In figure 9 below, we can see that a Mellanox MTS3600 switch located at location C9 (rack location on floor plan) position U (vertical location in rack) has port 19 (for example) connected to a Sun M9 switch #1 located at rack location A7. The connecting port is on line card 7 and is labeled B8/P1 on the silk screen.
Figure 9: Extract from topology file
Topology files such as the one shown in Figure 9 above are typically derived from the netlist developed by the integrator. MTS3600 and SDCIB648 are subsystem components which themselves have there own connectivity files (IBNL) files. These files are typically included in the OFED stack.
Bringing it all together
Once you have your topology file and the IBNL files are in the correct directory of the OFED stack - we are ready to start the process of debugging the fabric. There a two basic tools that where used as detailed below:
ibdiagnet is a tool is a tools that uses IB's in-band diagnostics functionality. It is run from any node connected to the fabric provided that node has the OFED stack installed and both the IBNL files and the topology files are accessible. ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices.
mpilinktest is an mpi application created by the Jülich Supercomputing Center, the host site for the JuRoPa cluster. mpilinktest is a parallel ping-pong test between all connections of a machine. Output of this program is a full communication matrix which shows the bandwidth between each processor pair and a report including the minimum bandwidth. The linktest runs for n processors in n steps where in each step n/2 pairs of processors will perform the MPI pingpong test (3 iterations, 128 kB messages). The selection of the pairs is random but after running all steps all possible pairs are covered.
ibdiagnet is used to capture the symbol, receive link, link width (links not running at 4x) and topology errors. However, ibdiagnet is not good at generating the high volumes of traffic typically seen when the machine is in production. mpilinktest is the tool used to generated the traffic while the data is simulaniously collected using ibdiagnet.
Typical diagnostic output
Figure 10 below details the typical output given by ibdiagnet when a approprite ibnl and topology files are used.
Figure 10: ibdiagnet output given when using ibnl and topology files.
Taking the 2nd error, as an example. The location string NEM8_A6_D/IS4B/U1/P22 is interpreted as follows:
- NEM8 is the name given to this QNEM unit.
- A6 is the location of the rack containing this module - it is given as Isle A, location 6. See figure 8.
- D is the vertical location of the module within the rack - running A thru D, with A in the bottom. This unit is at the top of the rack.
- IS4B is the instance of the ASIC with that unit. Each QNEM has two ASIC, IS4A and IS4B.
- U1 should be ignored. P22 is the port instance which has the error on that particular ASIC. Each ASIC has 36 ports.
For the above we note that although we can isolate the location of the QNEM module to a particular location in a particular rack, we cannot yet determine which particular physical port on the QNEM has the issue. To achieve this, we need to understand more about the internal layout of the QNEM module. Figure 11 details the internal connectivity of the Sun QNEM module.
Figure 11: QNEM internal connectivity.
Figure 11 suggests that IS4B Port 20 connects to the blade in Slot 8. As such, this particular link does not represent a cable - rather an internal backplane link to a Sun Blade server. This is the end of the diagnostic process as we have now identified a pair of component one of which is faulty. IB errors are reported at the receiver, as such, for this particular error we know that the problem is either the blade connection to the backplane or the backplane connectivity to the QNEM. Indeed, it could be the backplane itself. Either way, the service engineer task is now to swap each component individually with a known good component until the problem is resolved.
Summary - Evaluation of Goals
The first goals outlined earlier was to achieve in excess of 90% parallel efficiency across the whole cluster for HPL type benchmarks. Additionally, another goal was to be being able to quickly identify where the problems are in this JuRoPa IB network.
Today, errors in the JuRoPa cluster are quickly locatable. The preparation detailed herein was invaluable in bringing JuRoPa to production. Its true to say that the identification of errors, whilst not an unskilled job, has been significantly simplified by the methods detained herein.
During HPL (LinPack) benchmarking tests undertaken in early June '09 the JuRoPa machine achieved an impressive 274.8 Trillion floating point operations/second recorded over a sustained 11 hour period. This was achieved using 3221 compute nodes each running 8 Nahalem cores. This LinPack performance constitutes 91.6% of peak - making JuRoPA the global leader in parallel efficiency for commodity supercomputer clusters.
JOIN THE CONVERSATION
You must be a Registered Member in order to comment on Cluster Connection posts.
Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.
Login Register Now