Blake GonzalesUp until the last decade, the typical infrastructure used in High Performance Computing were shared-memory multiprocessor systems (SMP). These systems are such that memory and processors are shared within a single infrastructure, and managed by a single operating system image. This type of design is very convenient from a manageability perspective, as the number of hardware and operating system components to manage is very low (as compared to clustered systems).

On the other hand, application performance can easily suffer with SMP systems. Once an application is in the run state it will likely compete with other processes for resources such as memory, processors, or I/O bandwidth. Proper system tuning, and techniques such as cache coherency, allowed SMP systems to flourish in the HPC industry and were sufficient for their time.

One of the major issues with SMP systems is their inability to continue operation in the event of a system fault. A single uncorrectable error in one of its key subsystems will generally cause a system crash. When this occurs, all running applications are terminated, the operating system panics, and the entire system must be restarted. Obviously, this is not an optimal recovery scenario as it impacts all jobs on these typically very large systems. The following are examples of individual events that can cause a processor panic and subsequent system crash and restart on SMP systems: uncorrectable memory bit error, processor cache bit error, single processor failure, bus error, operating system disk failure, etc.

As a response to the poor fault tolerance in SMP systems, several technologies were invented to mitigate these single errors (such as error correcting memory or RAID disk subsystems), but the fact remains that a single error could bring down the entire system. One of the most interesting technologies invented, although not very widely used, was kernel level checkpointing of the entire system state. Kernel level checkpointing allows a failed SMP system to recover to a known state after a system crash. After a system restart, applications and other system processes continue to run from where they left off before a system crash. Although, users of an SMP system would still have to wait until the system was repaired before they could continue their work.


-- Blake Gonzales

See other posts from this Blog series: