The HPC Engineering team at Dell has been focused on NFS with High Availability for some time, particularly as file systems have gotten so much larger. But recently we happened to run some tests on plain NFS (non-HA) tests and were blown away with the possibilities.
Using our standard NFS best practices, our configuration consisted of one NFS server and four direct attached storage arrays. Previous studies have shown that two RAID controller cards give better performance than a single card in such a configuration, so that’s what we used as well. The backend storage was four Dell PowerVault MD1200 arrays with 3TB 7.2K rpm NL-SAS drives, which were formatted as two RAID 60 sets, combined as a Linux logical volume and then formatted as an XFS file system. The figure below shows our test setup as well as the file system layout.
This file system was exported via NFS v3. The compute clients (in our case, a 64-node compute cluster) accessed the file system over InfiniBand using the IPoIB protocol.
We ran two sets of tests: one with the NFS export option ‘sync’ and the second with ‘async’. ‘Async’ implies that the NFS server can acknowledge writes before any changes made by that request have been committed to disk. This option usually improves performance, but at the cost that an unclean server restart (i.e. a crash) can cause data to be lost or corrupted. ‘Sync’ is recommended when reliability is paramount and ‘async’ when pure performance is the goal.
The graphs below show sync and async results for sequential and random I/O workloads. We tested multiple concurrent I/O clients to capture the scaling and peak behavior of our 144 TB NFS setup.
A single NFS gateway can achieve up to 2.5 GB/s for sequential writes. The performance for ‘async’ and ‘sync’ was about the same with ‘async’ having just a little bit better performance, but both peaked at about 2.5 GB/s.
The most compelling result was found on the random write IOPS, particularly for the ‘async’ performance. A total of 111K IOPs on random writes and with plain ol’ NFS and 7.2K drives! Notice that simply by changing the export option from ‘sync’ to ‘async’, the random write IOPS performance improved by a massive amount. Only the NFS export option was changed – nothing else.
Using ‘async’ instead of ‘sync’ can be an excellent fit for environments where performance, specifically write IOPS performance, is critical, and data integrity is managed through other processes like a back-up or if the data being manipulated is considered ‘scratch data’ i.e. temporary files that can be easily regenerated. Moreover, the sequential write tests showed a peak sequential throughput of ~2600 MiB/s. Pretty good for a simple NFS configuration comprised of standard components!
Sync and async reads performed similarly which is to be expected since the data can only be read so fast even given the export options. We measured a peak of ~13,000 IOPS for random reads and a peak of ~3450MiB/s for sequential reads (see figures below).
We also did some metadata tests to see if ‘sync’ or ‘async’ had any impact. For tests from 1 to 64 clients, we ran 1 thread per client. For the 128, 256 and 512 data points, we ran 2, 4 and 8 threads per client respectively. The performance is shown in the figures below. File create and file remove performance is 1.5x to 2x better with async when compared to sync. File stats performance, which are primarily read dependent operations, performs similarly with async and sync as expected.
The configuration we tested is very similar to Dell’s NFS Storage Solutions (NSS). The configuration used in these tests corresponds to NSS version 4 (NSS4), which is the current generation of NSS products. Details of the test bed and benchmarks are provided in the tables below.
Dell PowerEdge R720 with 2 PERC H810 cards.
Dual Intel Xeon E5-2680 @ 2.70 GHz processors. 128GB memory
NFS v3 used for these tests
Four PowerVault MD1200 storage arrays.
SAS based JBODs direct attached the NFS server via two PERC H810 RAID adapters.
12 * 3TB NL SAS disks per array. Total 48 disks, 144 TB.
Backend File system
Red Hat Scalable File system (XFS) on the NFS storage
64-server compute cluster comprised of Dell PowerEdge M420 servers
Interconnect for I/O
Mellanox InfiniBand FDR and FDR10.
All I/O traffic is over the InfiniBand links using the IPoIB protocol
Operating System on NFS server
Red Hat Enterprise Linux 6.3, kernel 2.6.32-279.14.1.el6.x86_64
IOzone benchmark. v 3.408
1024k record size
File size varied depending on number of concurrent clients to keep total I/O at 256GB.
Example, 1 client operated on a 256GB file. 2 clients operated on 128 GB files each, … 64 clients operated on a 4GB file reach.
IOzone benchmark. v3.408
4k record size
Each client operated on a 4GB file for all cases.
mdtest benchmark. v1.8.3
Each client was configured to create, stat and remove 10 million files.
Who would have thought that changing from ‘sync’ to ‘async’ could have such a drastic impact on random write IOPS and that simple RAID cards in the servers could provide such sequential write performance?
Using async does change the perspective on the storage, however. If you use async you should think of the storage as fast scratch and not necessarily one for permanent data storage. But in exchange for that you can get a great deal of performance.