Written by Blake Gonzales, this blog covers issues around designing, administering, running, and architecting clusters from a user perspective.


Start Blog Entry
Multicore in HPC - Where will we stand in 10 years?

Following an interesting article in IEEE Spectrum, Dell's Blake Gonzales wrote a blog post discussing the future of multicore in HPC & then posed some insightful questions to the High Performance Computing group on LinkedIn. This blog shares some of the very beneficial discussions from that LinkedIn discussion.

Link to Blog


Start Blog Entry

HPC Design - Will You be Needing any Processors with your Cluster Today?


There was a recent article in IT Business Edge that asks, “Do Processors Really Matter Anymore?” The author is commenting on an InformationWeek article which suggests that the relative value we previously put on processors, in the enterprise, has started to diminish because the value proposition for solutions that incorporate virtualization, system management, and efficient power have changed the paradigm. I tend to agree ...

Link to Blog


Start Blog Entry

Architecting HPC Systems for Fault Tolerance and Reliability: Part 11 - File Systems

The file systems in an HPC cluster provide data storage to individual nodes, and entire subsets of nodes. In the simplest form, a single node will have its own local disk with one or more file systems residing on that disk. In the global case, the entire system can have access to a global or clustered file system.

Link to blog


Start Blog Entry


Architecting HPC Systems for Fault Tolerance and Reliability: Part 10 - Job Scheduling

Most clustered HPC systems generally leverage job scheduling algorithms and routines in order to maximize the efficiency and use of a system. Unlike shared memory systems, where all jobs are generally in memory and running simultaneously, clustered systems will usually allocate one or more nodes for exclusive use by a particular job.

Link to blog post

Start Blog Entry

Architecting HPC Systems for Fault Tolerance and Reliability
: Part 9 - Boot Nodes

Cluster implementations vary in the way the individual compute nodes boot. In the simplest case, the compute nodes each have their own local boot disk, with a bootstrap and operating system. In other implementations, the compute nodes have no local storage and they boot over an internal network, pointing to a common boot image stored on a boot node. There is also a hybrid implementation where compute nodes initially boot from a common boot image, and then finish booting on their own locally stored operating system. These latter two cases, where a common boot node is used, require a degree of fault tolerance in their design. Read more ...

Link to blog post


Start Blog Entry
Architecting HPC Systems for Fault Tolerance and Reliability
: Part 8 - Compute Nodes

It might be a little surprising to find out that system compute nodes generally require very little in the way of fault tolerant subsystems. This includes subsystems such as dual power supplies, multiple gateways to the network or storage, mirrored boot disks, etc. This is because the entire cluster can easily tolerate the occasional node failure. Read more ...

Link to blog post

Start Blog Entry
Architecting HPC Systems for Fault Tolerance and Reliability
: Part 7 - Login/Head Nodes

While typically not extremely critical to the continued operation of a cluster system, this blog reviews Login & Head Nodes and techniques used to improve the reliability and availability of these system functions.

Link to blog post

Start Blog Entry

Architecting HPC Systems for Fault Tolerance and Reliability
: Part 6 - Memory

As discussed earlier, a failed memory bit in both SMP and clustered systems can cause data integrity issues or even an entire system crash. Error correcting memory (ECC) has been introduced to mitigate memory bit failures and actually detect and correct the failures.

Link to blog post

Start Blog Entry
Architecting HPC Systems for Fault Tolerance and Reliability
: Part 5 - Cooling
The major power consumers in HPC environments are your well-known system components such as processors, memory, and to some degree disk storage and interconnect hardware. The ratio of compute nodes to other components is usually high, so you will likely find that most of your power is consumed by your computational resources.

Link to blog post


Start Blog Entry
Architecting HPC Systems for Fault Tolerance and Reliability
: Part 4 - Power Distribution

We will now discuss some of the key components that need high reliability in HPC systems. Let’s start with the distribution of power to HPC systems and subsequently to the individual components. HPC systems by nature require large quantities of power. The details of power generation and distribution to a data center are not covered here; we start with a look at power once it arrives at a data center.

Link to blog post


Start Blog Entry
Architecting HPC Systems for Fault Tolerance and Reliability
: Part 3 - Clustered System Infrastructure

There are several cluster components of note that impact the reliability of the overall system. This blog will classify the components of a clustered HPC system into two categories, which indicate their bearing on overall system reliability.

Link to blog post


Start Blog Entry
Architecting HPC Systems for Fault Tolerance and Reliability: Part 2 - Clustered Systems

Within the last decade, clustered architectures have become the predominate design for new HPC systems. This has been in response to several factors, including the ability to utilize commodity hardware. SMP systems were mostly proprietary architectures that changed from vendor to vendor. There are many reasons for the migration to clustered systems, but we will mainly address only the fault tolerance aspects here.

Link to blog post


Start Blog Entry

Architecting HPC Systems for Fault Tolerance and Reliability: Part 1 - Symmetric Multiprocessor Systems

A review of SMP systems and some of the common causes of HPC system failures, as well as some of the tools to deal with these issues. The first blog post in a series from Blake Gonzales.

Link to blog post


Start Blog Entry

Architecting HPC Systems for Fault Tolerance and Reliability (Introduction to a multi-part series)

This blog by Dell HPC Computer Scientist Blake Gonzales is the first in a series of posts that introduces some of the main hardware and software causes of system-wide failures. The series will also offer suggestions to architecture design techniques designed to prevent failures.

Link to blog post


End Blog Entry