This blog, written by Blake Gonzales, talks about issues around designing, administering, running, and architecting clusters from someone with a background in commercial HPC.


Start Blog Entry
Architecting HPC Systems for Fault Tolerance and Reliability (Introduction to a multi-part series)

01/11/2010

Blake GonzalesThe complex nature of HPC systems can at times have a detrimental effect on their ability to reliably complete the tasks at hand. At the same time, HPC systems are generally relied upon to perform many hundreds or thousands of independent jobs simultaneously. In many cases, the work to be performed by HPC systems is critical in nature. Because of this, reliability and fault tolerance is of upmost concern in HPC.

Shared-memory multiprocessor (SMP) systems are generally prone to system wide failures due to single errors in memory, CPU or disk. Prevention of single errors which cause outages in SMP solutions has always been a struggle. With the ubiquitous use of clustered HPC technology in the last decade, the risk of system wide failures due to single points of failure can be minimized! Although, to accomplish increased reliability, these clustered solutions must be designed correctly to accomplish the desired effect.

There are many “moving parts” so to speak in clustered solutions, so it is important to design each subsystem with an eye to how it relates to the other subsystems. Here I would like explore key hardware and software components that are likely to cause system wide failures, and suggest architecture design techniques to prevent such failures.

-- Blake Gonzales

See other posts from this Blog series:
INTRO
SMP
CLUSTERED SYSTEMS
CLUSTERED SYSTEM INFRASTRUCTURE
POWER DISTRIBUTION




End Blog Entry



End Blog Entry