The Dell NFS Storage Solution (NSS) now comes in two flavors – with and without High Availability. This blog discusses the High Availability Solution introduced in April 2011.

The requirement of 24/7 system run time and the continuous availability of user data imposes intense pressure on every storage system, as component failures or faults cannot be completely avoided in such systems. Dell has developed an NFS Storage Solution with an option of high availability (referred to as NSS-HA configurations in the rest of this article) to meet this challenge. The purpose of this solution is to improve the service availability and data integrity of NSS configurations in the presence of possible failures or faults, and to minimize the performance loss in the failure free case. For example, when a cluster component fails, the impact to cluster clients is minimized and both system service availability and user data integrity are maintained.

This article introduces the concept of high availability in this solution and briefly describes the components and the performance of this solution.

In the NSS and NSS-HA use cases, the clients of a HPC compute cluster are connected to the storage sub-system via a public network. This network can be either InfiniBand (IB) or 10Gigabit Ethernet (10 GbE). The NFS servers are connected to the public network and provide access to Dell PowerVault storage arrays used as the backend.

HA concept in the NSS-HA solution

The high availability feature of the NSS-HA solution consists of three levels of concepts, as shown in Figure 1.

Figure 1. The three-level of concepts of High Availability in the NSS-HA solution

From the top to bottom, the higher level concept is built upon the lower level:

  • An active/passive mode is adopted to construct an NFS server cluster in the solution, enabling the server failover, which is the foundation of the HA feature;
  • Redundant hardware components are required to implement the active/passive mode, as the nodes in the active/passive mode should be identical;
  • Several fault-tolerance mechanisms such as RAID6, HA-LVM, redundant power supplies, dual data paths to the storage, etc., are deployed to monitor or recover from faults or failures at both hardware and software levels.

The general architecture of NSS-HA

The NSS-HA solution is based on the Dell NSS. The difference between them is that the NSS-HA focuses mainly on providing storage service with high availability, while the NSS provides an easy to manage, reliable, and cost-effective solution for unstructured data with a consideration of high performance.

Figure 2 and Figure 3 show the infrastructure for NSS and NSS-HA, respectively. It is obvious that the infrastructure of NSS-HA looks more complicated than the one of NSS, as extra equipment such as additional NFS server, one private network switch, two Power distribution units (PDUs), etc., are required to enhance the whole system’s reliability and availability.

Figure 3. NSS-HA infrastructure

There is no free lunch: NSS-HA achieves the goal of high availability with the cost of hardware component redundancy. The key software component of NSS-HA is Red Hat Cluster Suite 5.5, which constructs an “HA cluster” with those redundant components. The HA cluster consists of two NFS servers, and an HA service which runs on one of the servers at a time. If a failure occurs, the HA service will failover to the other NFS server, while keeping the whole process transparent to the clients of the cluster as much as possible. To ensure data integrity, the HA service must run only on one cluster server at any given time.

For detailed information about the construction and configuration of the NSS-HA please refer to the NSS-HA white paper.

Potential Failures

The NSS-HA includes hardware and software components to build the HA functionality. The goal is to be resilient against several types of failures and transparently migrate the cluster service from one server to the other.
Table 1 lists types of failure and gives the answer of whether the NSS-HA can handle them or not.

Table 1. Types of failures

FAILURE TYPE Can be handled or not
Hardware failures in a single server Yes
OS level failures in a single server Yes
Application failures in a single server Yes
Public switch failure No
Private switch failure Yes
Raid controller failure on the attached PowerVault storage Yes
SAS link failure from server to the storage Yes

Simple performance evaluation: HA vs. HA disabled

With high availability as a goal, it is useful to understand how the HA option affects the system I/O performance. Does the HA option incurs extra overhead that negatively impacts system I/O performance? It is expected that the HA option has marginal impact on the system I/O performance, as the HA software components have a low level of I/O and CPU resource consumption. Figures 4 and 5 show comparison results for NSS and NSS-HA with two different network connections, IB and 10GbE, respectively. These results match our expectation: the I/O performance with HA option is almost identical to the performance without HA option when there are no failures. For these tests, the benchmark tool iozone was used. The tool was configured to generate a number of threads from different client nodes to write or read different files. For example, when testing the 4 client write case, one thread is generated on each of the four clients and each thread writes a separate file.

More comprehensive results can be found in the NSS-HA white paper.

Figure 4.

Figure 5.

By Xin Chen and Garima Kochhar