Blake GonzalesThe real work of an HPC cluster happens on the compute nodes. Most of the other subsystems in a cluster are there to support the computational needs of the compute nodes.

It might be a little surprising then, to find out then that the compute nodes generally require very little in the way of fault tolerant subsystems. This includes subsystems such as dual power supplies, multiple gateways to the network or storage, mirrored boot disks, etc. This is because the entire cluster can easily tolerate the occasional node failure. In very large HPC clusters it is very common to have several compute nodes inoperable at any given time.

When a single compute node fails, typically only one job is effected, or subset of jobs, on the system. All other jobs will continue to run without incident. The effected job can then be restarted on another set of compute nodes while the failed node is repaired. It is rarely worth the expense to build redundancy into the compute node infrastructure. Most vendors that offer compute nodes for the HPC industry sell stripped down nodes that have little more than CPU, memory, and network interfaces.

It is somewhat ironic that the subsystems that are the workhorse of an HPC cluster, are the same subsystems that require the least amount of built-in fault tolerance. The high fault tolerance of a cluster to an occasional failed job makes this possible.

Next UP… BOOT NODES


-- Blake Gonzales

See other posts from this Blog series:
INTRO
SMP
CLUSTERED SYSTEMS
CLUSTERED SYSTEM INFRASTRUCTURE
POWER DISTRIBUTION
COOLING
MEMORY
LOGIN/HEAD NODES