Join us at Super Computing 2011!We invite you to visit us at the SC2011 conference in Seattle, Nov 14-17 at Booth #2040 See first-hand how we are enabling research discovery with Dell HPC solutions.
Within the last decade, clustered architectures have become the predominate design for new HPC systems. This has been in response to several factors, including the ability to utilize commodity hardware. SMP systems were mostly proprietary architectures that changed from vendor to vendor. There are many reasons for the migration to clustered systems, but we will mainly address only the fault tolerance aspects here.Clustered systems utilize systems components in a parallel fashion, and these components are generally independent of each other. If a single component fails it generally will not cause the entire system to crash, but rather will cause a specific subset of processes to abort. This leaves the rest of the clustered system intact to continue operation.The typical compute component in a clustered system is made up of a single running node that contains its own operating system image, memory, processors, and local. There are typically hundreds to thousands of these independent nodes in a clustered system. In essence, we ultimately have a large cluster of small SMP systems.In this architecture, a single job is run in parallel across a subset of the nodes in a system. Many of these parallel jobs are generally run simultaneously across the entire system. For example, on a small 9-node system we might have the following job mix:· Job A running on nodes 1, 4, 7· Job B running on nodes 2, 5, 8· Job C running on nodes 3, 6, 9Let’s consider what would happen if node 4 has a hardware fault and crashes. In this example, only job A would terminate while jobs B and C would continue execution. This is a substantial improvement over SMP systems where all jobs would terminate in the event of a hardware fault.The example given above is one in which each job has exclusive access to the nodes which were allocated. This is typically done so that multiple processes will not compete for resources and therefore reduce efficiency. It also has a beneficial effect on fault tolerance for individual jobs. Consider the next example below with a 7-node system.· Job A running on nodes 1, 2, 3· Job B running on nodes 3, 4, 5· Job C running on nodes 5, 6, 7In this example, if we have a failure in either node 3 or 5, two jobs will be terminated. If each job has mutually exclusive access to its own nodes, and a single node fails, a maximum of one job will be terminated as a result.Next Up….. CLUSTERED SYSTEM INFRASTRUCTURE-- Blake GonzalesSee other posts from this Blog series:INTROSMPCLUSTERED SYSTEMSCLUSTERED SYSTEM INFRASTRUCTUREPOWER DISTRIBUTION