Join us at Super Computing 2011!We invite you to visit us at the SC2011 conference in Seattle, Nov 14-17 at Booth #2040 See first-hand how we are enabling research discovery with Dell HPC solutions.
Architecting HPC Systems for Fault Tolerance and Reliability (Introduction to a multi-part series)01/11/2010 The complex nature of HPC systems can at times have a detrimental effect on their ability to reliably complete the tasks at hand. At the same time, HPC systems are generally relied upon to perform many hundreds or thousands of independent jobs simultaneously. In many cases, the work to be performed by HPC systems is critical in nature. Because of this, reliability and fault tolerance is of upmost concern in HPC. Shared-memory multiprocessor (SMP) systems are generally prone to system wide failures due to single errors in memory, CPU or disk. Prevention of single errors which cause outages in SMP solutions has always been a struggle. With the ubiquitous use of clustered HPC technology in the last decade, the risk of system wide failures due to single points of failure can be minimized! Although, to accomplish increased reliability, these clustered solutions must be designed correctly to accomplish the desired effect.There are many “moving parts” so to speak in clustered solutions, so it is important to design each subsystem with an eye to how it relates to the other subsystems. Here I would like explore key hardware and software components that are likely to cause system wide failures, and suggest architecture design techniques to prevent such failures.-- Blake GonzalesSee other posts from this Blog series:INTROSMPCLUSTERED SYSTEMSCLUSTERED SYSTEM INFRASTRUCTUREPOWER DISTRIBUTION