Join us at Super Computing 2011!We invite you to visit us at the SC2011 conference in Seattle, Nov 14-17 at Booth #2040 See first-hand how we are enabling research discovery with Dell HPC solutions.
The real work of an HPC cluster happens on the compute nodes. Most of the other subsystems in a cluster are there to support the computational needs of the compute nodes.It might be a little surprising then, to find out then that the compute nodes generally require very little in the way of fault tolerant subsystems. This includes subsystems such as dual power supplies, multiple gateways to the network or storage, mirrored boot disks, etc. This is because the entire cluster can easily tolerate the occasional node failure. In very large HPC clusters it is very common to have several compute nodes inoperable at any given time.When a single compute node fails, typically only one job is effected, or subset of jobs, on the system. All other jobs will continue to run without incident. The effected job can then be restarted on another set of compute nodes while the failed node is repaired. It is rarely worth the expense to build redundancy into the compute node infrastructure. Most vendors that offer compute nodes for the HPC industry sell stripped down nodes that have little more than CPU, memory, and network interfaces.It is somewhat ironic that the subsystems that are the workhorse of an HPC cluster, are the same subsystems that require the least amount of built-in fault tolerance. The high fault tolerance of a cluster to an occasional failed job makes this possible.Next UP… BOOT NODES-- Blake GonzalesSee other posts from this Blog series:INTROSMPCLUSTERED SYSTEMSCLUSTERED SYSTEM INFRASTRUCTUREPOWER DISTRIBUTIONCOOLINGMEMORYLOGIN/HEAD NODES