Blake GonzalesThe file systems in an HPC cluster provide data storage to individual nodes, and entire subsets of nodes. In the simplest form, a single node will have its own local disk with one or more file systems residing on that disk. In the global case, the entire system can have access to a global or clustered file system.

In either case, it is important to choose the correct file system with the appropriate journaling options. Most modern file systems (i.e. ufs, ext3, xfs) provide an option to enable journaling. Journaling is a technique used to quickly bring the file system back into a known good state in the event of a system failure. Data that is written to disk, before it is committed, will be logged to this journal. This is analogous to how modern databases handle transactions which update their data store. If a system outage occurs which causes an abrupt halt to disk I/O, the file system will be in a state with uncommitted transactions residing in the log. Whenever the file system is next mounted, likely at system boot time, the uncommitted transactions are committed and the file system is then back in a good state. This process generally takes just a few seconds, unlike a complete file system check (fsck) which may take hours.

The use of global file systems, either of the clustered or network attached variety, have become common in HPC systems. Clustered file systems typically provide very good I/O performance due to their use of multiple metadata and object storage nodes. This multiple node configuration also lends itself to very good fault-tolerance in the event of a node failure. For example, GPFS has been built such that metadata nodes store redundant information and can take on additional processing in the event a metadata server fails. This is also true for other clustered file systems such as Lustre.

Consider a clustered file system that has two metadata nodes, each node providing full file system access to clients. If one of these nodes fails, and the other stays up, two things will happen. First, all clients connecting to the good node continue processing as normal. Second, all clients connecting to the failed node will only see a short I/O pause while they are redirected to a good metadata node.

There should be redundant paths to the underlying storage from the metadata and data storage nodes. This protects the storage subsystem from an outage in the event of a path or channel failure. The last consideration is to protect against failures of single or multiple disks in the underlying storage subsystem. RAID implementations allow these types of failures to occur, while at the same time keeping the storage available for use.

NEXT UP… DISK CLONING

-- Blake Gonzales

See other posts from this Blog series:
INTRO
SMP
CLUSTERED SYSTEMS
CLUSTERED SYSTEM INFRASTRUCTURE
POWER DISTRIBUTION
COOLING
MEMORY
LOGIN/HEAD NODES
COMPUTE NODES
BOOT NODES
JOB SCHEDULING