Join us at Super Computing 2011!We invite you to visit us at the SC2011 conference in Seattle, Nov 14-17 at Booth #2040 See first-hand how we are enabling research discovery with Dell HPC solutions.
So far, I have only mentioned the compute node components of a clustered HPC system in order to point out the differences between SMP systems. There are several other components of note that impact the reliability of the overall system. We will mention the other components here briefly, with further explanation in addition subsequent posts.It is appropriate to classify the components of a clustered HPC system into two categories, which indicate their bearing on overall system reliability. The first category contains those components which have little effect on the overall reliability if they fail. The compute nodes, as described in the previous section, fall into this category. In comparison to SMP systems, clustered compute nodes greatly increase reliability; but in comparison to the other clustered components, they need very little fault tolerance built-in for the entire system to be reliable. It is very common in the industry to have HPC systems with several failed compute nodes offline at any one time, waiting for repair. It is typical to see vendor service contracts with response times measured in days or weeks for compute nodes.The second category includes those components that need to have greater reliability because, if they fail, they will have a detrimental effect to the entire system. These include components such as the job scheduler, cluster interconnect, login nodes, clustered storage, and network infrastructure. If any of these components were to fail, system availability would be affected to a large degree. For each of the components in this category, we will spend greater time and effort in fault tolerance design. In fact, the design of these HPC components is much harder to complete than is the design of the compute nodes. It is typical to see vendor service contracts with response times measured in hours for these critical components.Next Up… POWER DISTRIBUTION-- Blake GonzalesSee other posts from this Blog series:INTROSMPCLUSTERED SYSTEMSPOWER DISTRIBUTION