Dell Community
High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

    The latest Dell NSS-HA solution was published on September 2013, of which the version is NSS5-HA. This release leverages Intel Ivy Bridge processors and RHEL 6.4 to offer higher overall system performance than previous NSS-HA solutions (NSS2-HA, NSS3-HA, and NSS4-HA, and, NSS4.5-HA).

    Figure 1 shows the design of NSS5-HA configurations. The major differences between NSS4.5-HA and NSS5-HA configurations are:

    • Processors:
      • NSS4.5-HA: E5-2680@2.7GHz, 8 cores per processor, (SandyBridge processors)
      • NSS5-HA: E5-2695v2@2.4GHz, 12 cores per processor, (IvyBridge processors)

    • Memory:
      • NSS4.5-HA: 8 x 8GiB,  1600MHz,  RDIMMs
      • NSS5-HA: 8 x 8 GiB, 1866MHz,  RDIMMs
    • OS:
      • NSS4.5-HA: RHEL6.3
      • NSS5-HA: RHEL6.4

    Except for those items and necessary software and firmware updates, NSS4.5-HA and NSS5-HA share the same HA cluster configuration and storage configuration. (Refer to NSS4.5-HA white paper for the detailed information about the two configurations.)

    Figure 1. NSS5-HA 360TB architecture

     

    Although Dell NSS-HA solutions have received many hardware and software upgrades to support higher availability, higher performance, and larger storage capacity since the first NSS-HA release, the architectural design and deployment guidelines of the NSS-HA solution family remain unchanged. In the rest of the blog only the I/O performance information of NSS5-HA will be presented, meanwhile, in order to show the performance difference between NSS5-HA and NSS4.5-HA, the corresponding performance numbers of NSS4.5-HA are also presented.

    For detailed information about NSS-HA solutions, please refer to our published white papers:

     

    Note: for any customized configuration/deployment, please contact your Dell representative for specific guidelines.

    NSS5-HA I/O performance summary

    Presented here are the results of the I/O performance tests for the current NSS-HA solution. All performance tests were conducted in a failure-free scenario to measure the maximum capability of the solution. The tests focused on three types of I/O patterns: large sequential reads and writes, small random reads and writes, and three metadata operations (file create, stat, and remove).

    A 360TB configuration was benchmarked with IPoIB network connectivity. A 64-node compute cluster was used to generate workload for the benchmarking tests. Each test was run over a range of clients to test the scalability of the solution.

    The IOzone and mdtest utilities were used in this study. IOzone was used for the sequential and random tests. For sequential tests, a request size of 1024KiB was used. The total amount of data transferred was 256GiB to ensure that the NFS server cache was saturated. Random tests used a 4KiB request size and each client read and wrote a 4GiB file. Metadata tests were performed using the mdtest benchmark and included file create, stat, and remove operations. (Refer to Appendix A of the NSS4.5-HA white paper for the complete commands used in the tests.)

    IPoIB sequential writes and reads

    Figures 2 and 3 show the sequential write and read performance. For the NSS5-HA, the peak read performance is 4379MB/sec, and the peak write performance is 1327MB/sec. From the two figures, it is obviously that the current NSS-HA solution has higher sequential performance numbers than the previous one.

    Figure 2. IPoIB large sequential write performance

     Figure 3. IPoIB large sequential read performance

    IPoIB random writes and reads

    Figure 4 and Figure 5 show the random write and read performance. From the figure, the random write performance peaks at the 32-client test case and then holds steady. In contrast, the random read performance increases steadily beyond going from 32, to 48 to 64 clients indicating that the peak random read performance is likely to be greater than 10244 IOPS (the performance for 64-client random read test case).

    Figure 4. IPoIB random write performance

    Figure 5. IPoIB random read performance

    IPoIB metadata operations

    Figure 6, Figure 7, and Figure 8 show the results of file create, stat, and remove operations, respectively. As the HPC compute cluster has 64 compute nodes, in the graphs below, each client executed a maximum of one thread for client counts up to 64. For client counts of 128, 256, and 512, each client executed 2, 3, or 4 simultaneous operations.

    From the three figures, both NSS5-HA and NSS4.5-HA have very similar performance behaviors, as the two lines for NSS5-HA and NSS4.5-HA in each figure are almost identical; it indicates that the changes we have with NSS5-HA do not have obvious impact on the performance of metadata operations.

     Figure 6. IPoIB file create performance

    Figure 7. IPoIB file stat performance

    Figure 8. IPoIB file remove performance

     

     

  • Over 100,000 IOPs with plain ol’ NFS and 7,200 rpm drives

    The HPC Engineering team at Dell has been focused on NFS with High Availability for some time, particularly as file systems have gotten so much larger. But recently we happened to run some tests on plain NFS (non-HA) tests and were blown away with the possibilities.

    Using our standard NFS best practices, our configuration consisted of one NFS server and four direct attached storage arrays. Previous studies have shown that two RAID controller cards give better performance than a single card in such a configuration, so that’s what we used as well. The backend storage was four Dell PowerVault MD1200 arrays with 3TB 7.2K rpm NL-SAS drives, which were formatted as two RAID 60 sets, combined as a Linux logical volume and then formatted as an XFS file system. The figure below shows our test setup as well as the file system layout.

    This file system was exported via NFS v3. The compute clients (in our case, a 64-node compute cluster) accessed the file system over InfiniBand using the IPoIB protocol.

    We ran two sets of tests: one with the NFS export option ‘sync’ and the second with ‘async’. ‘Async’ implies that the NFS server can acknowledge writes before any changes made by that request have been committed to disk.  This option usually improves performance, but at the cost that an unclean server restart (i.e. a crash) can cause data to be lost or corrupted. ‘Sync’ is recommended when reliability is paramount and ‘async’ when pure performance is the goal.

    The graphs below show sync and async results for sequential and random I/O workloads. We tested multiple concurrent I/O clients to capture the scaling and peak behavior of our 144 TB NFS setup.

    A single NFS gateway can achieve up to 2.5 GB/s for sequential writes. The performance for ‘async’ and ‘sync’ was about the same with ‘async’ having just a little bit better performance, but both peaked at about 2.5 GB/s.

    The most compelling result was found on the random write IOPS, particularly for the ‘async’ performance. A total of 111K IOPs on random writes and with plain ol’ NFS and 7.2K drives! Notice that simply by changing the export option from ‘sync’ to ‘async’, the random write IOPS performance improved by a massive amount. Only the NFS export option was changed – nothing else.

    Using ‘async’ instead of ‘sync’ can be an excellent fit for environments where performance, specifically write IOPS performance, is critical, and data integrity is managed through other processes like a back-up or if the data being manipulated is considered ‘scratch data’ i.e. temporary files that can be easily regenerated. Moreover, the sequential write tests showed a peak sequential throughput of ~2600 MiB/s. Pretty good for a simple NFS configuration comprised of standard components!

    Sync and async reads performed similarly which is to be expected since the data can only be read so fast even given the export options. We measured a peak of ~13,000 IOPS for random reads and a peak of ~3450MiB/s for sequential reads (see figures below).

    We also did some metadata tests to see if ‘sync’ or ‘async’ had any impact. For tests from 1 to 64 clients, we ran 1 thread per client. For the 128, 256 and 512 data points, we ran 2, 4 and 8 threads per client respectively. The performance is shown in the figures below. File create and file remove performance is 1.5x to 2x better with async when compared to sync. File stats performance, which are primarily read dependent operations, performs similarly with async and sync as expected.

    The configuration we tested is very similar to Dell’s NFS Storage Solutions (NSS). The configuration used in these tests corresponds to NSS version 4 (NSS4), which is the current generation of NSS products. Details of the test bed and benchmarks are provided in the tables below.

    NFS server

    Dell PowerEdge R720 with 2 PERC H810 cards.

    Dual Intel Xeon E5-2680 @ 2.70 GHz processors. 128GB memory

    NFS v3 used for these tests

    NFS storage

    Four PowerVault MD1200 storage arrays.

    SAS based JBODs direct attached the NFS server via two PERC H810 RAID adapters.

    12 * 3TB NL SAS disks per array. Total 48 disks, 144 TB.

    Backend File system

    Red Hat Scalable File system (XFS) on the NFS storage

    I/O clients

    64-server compute cluster comprised of Dell PowerEdge M420 servers

    Interconnect for I/O

    Mellanox InfiniBand FDR and FDR10.

    All I/O traffic is over the InfiniBand links using the IPoIB protocol

    Operating System on NFS server

    Red Hat Enterprise Linux 6.3, kernel 2.6.32-279.14.1.el6.x86_64

    Sequential Tests

    IOzone benchmark. v 3.408

    1024k record size

    File size varied depending on number of concurrent clients to keep total I/O at 256GB.

    Example, 1 client operated on a 256GB file. 2 clients operated on 128 GB files each,  … 64 clients operated on a 4GB file reach.

    Random Tests

    IOzone benchmark. v3.408

    4k record size

    Each client operated on a 4GB file for all cases.

    Metadata tests

    mdtest benchmark. v1.8.3

    Each client was configured to create, stat and remove 10 million files.

    Who would have thought that changing from ‘sync’ to ‘async’ could have such a drastic impact on random write IOPS and that simple RAID cards in the servers could provide such sequential write performance?

    Using async does change the perspective on the storage, however. If you use async you should think of the storage as fast scratch and not necessarily one for permanent data storage. But in exchange for that you can get a great deal of performance.

  • Dell HPC NFS Storage Solution - High Availability Solution: NSS4.5-HA and NSS4.5-HA XL configurations

    The latest Dell NSS-HA solution was published on October 2012, of which the version is NSS4.5-HA. This release leverages the latest Dell PowerVault Storage stack (MD3260 and MD3060e) to offer denser storage solutions than previous NSS-HA solutions (NSS2-HA, NSS3-HA, and NSS4-HA).

    Furthermore, as an extension of NSS4.5-HA configuration, the NSS4.5-HA XL configuration was developed. Figure 1 and Figure 2 show the design of NSS4.5-HA and NSS4.5-HA XL configurations, respectively. The major difference between NSS4.5-HA and NSS4.5-HA XL configurations is that NSS4.5-HA is only able to support one MD3260 + MD3060E storage stack, while NSS4.5-HA XL is able to support two MD3260 + MD3060E storage stacks concurrently. Thus, for the sake of simplicity, NSS4.5-HA XL configuration can be considered as two NSS45-HA configurations, except the two configurations share the same pair of PowerEdge R620 servers.

    Figure 1. NSS4.5-HA 360TB architecture

    Figure 2. NSS4.5-HA XL 2x360TB architecture

    It is also worth mentioning that there are two pairs of active-passive servers configured for two storage stacks in XL configuration, each pair only hosts the I/O requests for one storage stack. For example, two PowerEdge R620 servers are labeled by “active” and “passive”, respectively.

    • Pair 1: server “active” handles I/O requests to/from storage stack 1, server “passive” is standby for the storage service running at “active”.
    • Pair 2: server “passive” hosts storage stack 2, server “active is standby for the storage service running at “passive”.

    Thus, once server “active” suffers a catastrophic failure, the storage service hosted by “active” will automatically fail over to server “passive”; similarly, once “passive” fails, the storage service running at it will fail over to “active”.

    For detailed information about the XL configuration, please refer to Dell NFS Storage Solution with High Availability – XL configuration.  

     

    Although Dell NSS-HA solutions have received many hardware and software upgrades to support higher availability, higher performance, and larger storage capacity since the first NSS-HA release, the design architecture and guideline of the NSS-HA solution family remain unchanged. Thus, for the rest of the blog, only the deployment and sequential I/O performance information of NSS4.5-HA and NSS4.5-HA XL will be presented.

    For detailed information about NSS-HA solutions, please refer to our published white papers:

     

    NSS4.5-HA and NSS4.5-HA XL deployment summary

    Table 1 shows the six items for successfully deploying NSS4.5-HA and NSS4.5-HA XL configurations:

    • Usage: specifies the hardware setup options for an NSS-HA configuration.
    • Configuration guide: specifies the name of the document for deploying an NSS-HA configuration.
    • HA cluster configuration utility: specifies the name of the utility for configuring HA cluster. It is used to generate HA cluster configuration file.
    • File system configuration utility: specifies the name of the utility for file system configuration. It is used to configure storage devices, including creating/removing physical volumes, volume groups, logical volumes, and the XFS file system.
    • PowerVault storage configuration utility: specifies the name of a utility for PowerVault storage configuration. It is used to configure the PowerVault MD3260 and MD3060e storage stacks.
    • HA cluster monitor utilities: specifies the names of utilities for monitoring HA cluster components.
      • sas_path_check.sh: It is used by Red Hat HA cluster management tool to monitor the status of SAS paths.
      •  ibstat_script.sh: It is used by Red Hat HA cluster management tool to monitor the status of IB links, if IPoIB is deployed.

     

    Table 1 NSS4.5-HA vs. NSS4.5-HA XL

     

    NSS4.5-HA configuration

    NSS4.5-HA XL configuration

    Usage

    Only two options:

    • 180TB raw storage capacity.
    • 360TB raw storage capacity.

    Only one options:

    • 2 x 360TB raw storage capacity (two independent file systems).

     

    Configuration guide

    NSS4.5_HA_recipe_v1.0.pdf

    NSS4.5_HA_XL_recipe_v1.0.pdf

     

    HA cluster configuration utility

    cluster_config.sh

    cluster_config_xl.sh

     

    File system configuration utility

    nssha63_single.py

     

    nssha63_xl.py

    PowerVault storage configuration utility

    • 180TB: MD3260_180TB.scr
    • 360TB: MD3260_360TB.scr

     

    There are two utilities for XL configuration:

    • MD3260_360TB_1.scr
    • MD3260_360TB_2.scr

     

    HA cluster monitor utilities

    sas_path_check.sh

     

    ibstat_script.sh

     

    Please refer to the attachments of the blog for all configuration guides and utilities mentioned above.

    Note: for any customized configuration/deployment, please contact your Dell representative for specific guidelines.

    NSS4.5-HA and NSS4.5-HA XL sequential I/O performance summary

    For NSS45-HA XL configuration, there are two scenarios when clients are accessing two storage stacks concurrently.

    • Failure-free: there is no failure. Each PowerEdge R620 server handles I/O requests with only one storage stack (MD3260 + MD3060E).
    • Failure: there is a failure occurred in a PowerEdge R620. Thus, the other server handles I/O requests with two storage stacks concurrently, instead of one, as the storage service from the failed server will be transferred (fail over) to the healthy one.  

    It is worth discussing the I/O performance behaviors between the two cases. Figure 3 and Figure 4 present the sequential write and read performance numbers collected from NSS4.5-HA configuration, and two scenarios of NSS4.5-HA XL configuration.

    • “NSS45-HA” denotes the performance numbers for NSS4.5-HA standard configuration.
    • “NSS45-HA-XL-single-server” denotes the performance numbers collected in the failure scenario for NSS4.5-HA XL configuration.
    • “NSS45-HA-XL-two-servers” denotes the performance numbers collected in the failure-free scenario for NSS4.5-HA XL configuration.

    Note: the performance benchmarking methodology are same when collecting the three sets of the performance numbers, except that for XL configuration, half of clients are accessing one storage stack, the other half are accessing the other storage stack, thus, the performance of XL configuration is the overall performance of the two storage stacks.  For detailed benchmarking information, please refer to NSS4.5-HA white paper, section 5, page 17.  

    Write performance

    As mentioned above, the write performance of XL configuration is the aggregate write performance for accessing the two storage stacks concurrently. For example, in a 32-client test case, all 32 client nodes access a single storage stack for NSS4.5-HA standard configuration, while for XL configuration in any scenario; each storage stack only handles the concurrent I/O requests from 16 client nodes. Thus, by processing the I/O requests in two different storage stacks concurrently instead of a single storage stack, it is expected to observe that the overall write performance numbers of XL configuration in any scenario are general twice better than the ones of the NSS4.5-HA standard configuration. The results presented in Figure 3 confirm the expectation.

    It is also worth pointing out that there is little write performance difference between the two scenarios for NSS4.5-HA XL configuration, as shown in Figure 3. It is reasonable, because for the sequential write workloads, the storage stack itself is the bottleneck for the entire solution, thus, it helps little for write performance by increasing the computation power and network bandwidth (adding one more R620).

    Figure 3. Sequential write performance

    Read performance

    As shown in Figure 4, it is interesting to point out that the read performance of XL configuration in a failure scenario is almost half of the one in a failure-free scenario. Should it be expected that XL configuration always has similar read performance in any scenario as its write performance? The answer is NO. As discussed above, due to the slow write request processing speed of a storage array, write performance of XL configuration in the two scenarios is similar, which is independent of computation power and network bandwidth. While, the PowerVault MD3260 and MD3060E storage array has much faster read request processing speed than write request processing speed; 4 GB/sec can be achieved according to our server-to-storage test records. Thus, in a failure scenario, the computation power and network bandwidth in a single R620 become insufficient to handle high volume of read requests for the two storage array concurrently. As shown in Figure 4, such resource insufficiency not only makes the read performance of XL configuration twice worse than the one of XL configuration in a failure-free scenario, but also makes it worse than the one of NSS4.5-HA standard configuration with the increase of concurrent read requests. It is also worth mentioning that once the system is recovered from the failure scenario, the I/O workload will be balanced again, and a big overall performance improvement will be observed.

    Figure 4. Sequential read performance

  • HPC I/O performance over NFS – now 20x faster. How!?

    In the last few months, we have been evaluating host-based caching software solutions in the Dell HPC engineering lab.  Take your familiar NFS environment – an NFS server with direct attached storage, serving data to an HPC compute cluster. Now add some host-based caching software to the NFS server and a couple of PCIe SSDs to be the cache layer between memory and the backend disks. Don’t change anything in the storage configuration, the file system or how the clients access the NFS server.  Does this improve the performance of standard HPC I/O workloads? If so, by how much? That was the goal of our study.

    Host-based caching software is not a recent development. What’s unique about the version we tested, Dell Fluid Cache for DAS, is that it provides a write-back feature, not just write-through. This enables write I/O patterns to be accelerated in addition to read I/O patterns.

    The figure below shows the configuration that was used for this study. Four storage arrays were attached to an NFS server. To enable host-based caching, a SSD controller, two Express Flash PCI-e SSDs and the caching software were added to the NFS server.

    The graphs below chart the aggregate IOPS of a random I/O pattern to a regular NFS setup as well as with a NFS+caching configuration. A 64-node HPC compute cluster was used to drive I/O and test the two storage configurations.  The HPC cluster and NFS server were both connected to a shared InfiniBand fabric. As shown below, we measured up to 6.4x improvement in random write IOPs and up to 20x improvement in random read IOPs with caching software when compared to the baseline plain NFS configuration!

    The random I/O workload consisted of multiple clients concurrently writing or reading a 4GB file each. The iozone benchmark tool was used to measure IOPs and record size was set to 4k. The caching software was configured in write-back mode.

    The configuration used for these tests is summarized in the table below. 

    Caching software

    Dell Fluid Cache for DAS 1.0

    Cache pool

    Two 350GB Dell PowerEdge Express Flash PCIe SSDs

    This provides ~350 GB for the write cache as writes are mirrored, and ~700GB for the read cache.

    NFS server

    Dell PowerEdge R720

    Dual Intel Xeon E5-2680 @ 2.70 GHz processors. 128GB memory

    NFS storage

    Four PowerVault MD1200 storage arrays.

    SAS based JBODs direct attached the NFS server.

    12 * 3TB NL SAS disks per array. Total 48 disks, 144 TB.

    Backend File system

    Red Hat Scalable File system (XFS) on the NFS storage

    I/O clients

    64-server compute cluster comprised of Dell PowerEdge M420 servers

    Interconnect for I/O

    Mellanox InfiniBand FDR and FDR10.

    All I/O traffic is over the IB links using the IPoIB protocol

    *The baseline NFS configuration is similar to the Dell NSS family of storage solutions.

    *The caching software and cache pool of SSDs were used for the NFS+caching tests only.

    For details on the configuration, performance of write-back vs. write-through modes, and to see the impact on different I/O patterns, please consult the complete study available as a technical white paper. The white paper also contains a step-by-step appendix on configuring and tuning such a solution.

  • HPC Storage: Understanding how your applications perform I/O

    HPC Storage is arguably one of the most pressing issues in HPC. Selecting various HPC Storage solutions is a problem that requires some research, study, and planning to be effective – particularly cost-effective. Getting this process started usually means understanding how your applications perform I/O.

    I recently wrote a detailed article that presents some techniques for examining I/O patterns in HPC applications. The story called, HPC Storage - Getting Started with I/O Profiling, reviews different ways you can measure the performance of storage system used in HPC, as well as the applications used. My goal is to share some ideas about how to analyze your I/O needs from the perspective of an application.


    The article is posted in HPC Magazine, you can read it here.

  • Unbalanced Memory Performance

    by Joseph Stanfield

    It is well understood that a server with a “balanced” memory configuration yields the best performance for your servers (See Memory Selection Guidelines for HPC and 11G PowerEdge Servers). Balanced implies that all memory channels of the server are populated equally and with identical memory modules (DIMMS). But there are certain situations where an unbalanced configuration might be needed.  Cost limitations, capacity requirements, and application needs area all possible factors. This blog will provide a brief overview of how to gain the best performance from an unbalanced memory configuration.

    To better understand the demerits of unbalanced configurations and to determine which unbalanced configuration is the best, several tests were conducted in our lab. We have seen many requests for servers configured with 48GB and 96GB of memory. Satisfying these capacity requirements on the latest generation of servers that have four memory channels per socket is only possible with unbalanced configurations. Using the available 2GB, 4GB, 8GB and 16GB DIMMs, we tested the configurations described below.

    For the purpose of this study, a Dell PowerEdge M620 was used with the following configuration:

    Dual CPU
    Intel Xeon E5-2680 @ 2.70GHz
    BIOS
    1.1.2
    CPLD
    1.0.2
    iDRAC Version
    1.06.06
    Node Interleaving
    Disabled
    Memory Mode
    Optimized
     
     
    Memory Used For Testing
     
     
    2GB 1Rx8 @1600 MT/s
     
    4GB 2Rx8 @1600 MT/s
     
    8GB 2Rx4 @1600 MT/s
     
    16GB 2Rx4 @1600 MT/s

    Figure 1: PowerEdge M620 configuration and memory used for testing.

    Two capacity tests (48GB and 96GB) were performed with eight different memory organization options using the STREAM memory bandwidth benchmark. Due to the similarities in memory channel population and benchmarking results, this blog will focus on the 96GB options. For a comparison of the capacities tested, see the figure 6 at the end of the blog. All results report the total measured system memory bandwidth.

    The first test utilized fully populated memory banks across all four channels (see figure 2).  Each CPU in this case supports up to 3 DIMMs per channel but, a maximum capacity configuration reduces the speed at which the memory operates significantly, impacting the overall performance as evident by the result.

    Option

    CPU1

    CPU2

    Triad

     

    12x4GB

     

    12x4GB

     

     56GB/s

    Figure 2

    For the remaining seven unbalanced tests, three of the options were unbalanced across processors and four were unbalanced across memory channels.

    Balanced Configuration Reference
    Before we began the unbalanced testing, we needed a reference point and some actual STREAM results from a balanced configuration.  Six valid configurations were benchmarked which gave us an average benchmark outcome to baseline against when testing the unbalanced options.

    Figure 3 shows an example of a balanced configuration that has an equal amount of memory per channel for each CPU. All of the results are in the same ball park and the maximum performance was achieved when the system board was identically populated with 1600 MT/s DIMMS across four channels per CPU. From the results, it’s clear that the DIMM capacity is not a factor for memory bandwidth. The balanced configurations consisted of 1 or 2 DIMMs per channel, with either 8 or 16 identical DIMMs in the server.

    Configs

    CPU1

    CPU2

    Triad Results

    Opt 1 32GB

    8x 2GB

    8x 2GB

    74GB/s                  

    Opt 2 32GB

    4x 4GB

    4x 4GB

    77GB/s

    Opt 3 64GB

    8x 4GB

    8x 4GB

    78GB/s

    Opt 4 64GB

    4x 8GB

    4x 8GB

    77GB/s

    Opt 5 128GB

    8x 8GB

    8x 8GB

    71GB/s

    Opt 6 128GB

    4x 16GB

    4x 16GB

    76GB/s

    Figure 3: An example of two DIMMs per channel populated in a balance configuration.

    Unbalanced Across Processors
    The next three options proposed balanced memory across channels but an unbalanced configuration between the processors. The example in figure 4 has all four memory channels assigned to CPU 1 populated with two DIMMS. This will operate with a larger capacity and will generally have lower latency. CPU 2 also has all four memory channels populated but, with one DIMM each. Depending on which CPU executes the process, there may be a performance reduction due to the higher latency caused by a remote memory request. This happens if the memory required is more than what is assigned to CPU.  Interestingly, the performance results from benchmarking were comparable with the balanced configuration test. This is likely due to the memory benchmark itself; the limits of the memory capacity per CPU are not exercised and the symmetric population across memory channels keeps the memory bandwidth high.
             

       
    Figure 4: An example of an unbalanced configuration across CPUs configuration.

     

    The table below shows the exact configurations used for each option and the corresponding STREAM Triad results.

    Option

    CPU1

    CPU2

    Triad

     

    8x8GB

     

    8x4GB

     

    75GB/s

     

    8x8GB

     

    4x8GB

     

     77GB/s

     

    4x16GB

     

    4x8GB

     

    76GB/s

    Unbalanced Across Channels
    The unbalanced channel configurations that were tested had partially populated memory channels resulting in bandwidth bottlenecks. Figure 5 shows an example of unequally populated memory channels.  

    With the exception of option 5, the tests executed with unbalanced channel configurations completed with significantly lower results than testing executed against unbalanced memory per CPU structure.  The actual configurations that were tested are illustrated in the table below.

     

     Figure 5: An example of an unbalanced across memory channels configuration.  


    Option

    CPU1

    CPU2

    Triad

     

     4x4GB 4x8GB

     

    4x4GB 4x8GB

     

     75GB/s

     

    6x8GB

     

     

     6x8GB

     

    44GB/s

     

     

    6x8GB

     

    6x8GB

     

     

    63GB/s

     

     

    3x16GB

     

     

    3x16GB

     

     

    62GB/s

    As mentioned earlier, 48GB capacity configurations were also tested with the eight organization options populated similarly to the 96GB tests. The DIMM channels were populated similarly between both sets of tests with smaller capacity DIMMs used for the 48GB tests. The results followed similar trends, with the exception of Option 5 on the 48GB test. The 48GB benchmark mixed 2GB and 4GB DIMM capacities with mixed ranks resulting in a lower performance.  Details of the DIMMs used can be reviewed in figure 1.

    Figure 6: Unbalanced 48GB and 96GB STREAM Triad Comparison Results.

    Our conclusion resulting from this exercise is that it is possible to achieve desirable results with an unbalanced memory configuration, provided that the memory assigned to the CPU is identical and does not exceed more than two DIMMs per channel.

    References

    1. Memory Performance Guidelines for Dell PowerEdge 12th Generation Servers.

    http://en.community.dell.com/techcenter/b/techcenter/archive/2012/07/26/memory-performance-guidelines-for-dell-poweredge-12th-generation-servers.aspx

     2. Nehalem and Memory Configurations.

    http://en.community.dell.com/techcenter/b/techcenter/archive/2009/04/08/nehalem-and-memory-configurations.aspx

     3. Memory Selection Guidelines for HPC and 11G PowerEdge Servers

    http://content.dell.com/us/en/enterprise/d/business~solutions~whitepapers~en/Documents~11g-memory-selection-guidelines.pdf.aspx

  • NFS Storage Solution -– delivering greater than 4000MB/s throughput!

    By Xin Chen, Garima Kochhar and Mario Gallegos. August 2012.

    Almost all clusters, no matter the size, need an NFS based solution. In some clusters this storage is used only for applications and home directories. In others depending on the size of the cluster and the IO requirements, it can be used for processing temporary files as well. Every system admin knows NFS, but in our experience, tuning NFS is non-trivial. You already have the servers, the storage and the software. How do you get the best performance and reliability out of your configuration? And how do you get up to 4000MB/s throughput from NFS? Keep reading!

    NSS-HA is a line of optimized NFS based storage solutions for HPC configurations that also provide High Availability. Including the latest solution described here, three versions of NSS-HA solutions have been released since 2011. With the introduction of the latest version of NSS-HA (called NSS4-HA and described in this article), the NFS servers have been upgraded to take advantage of several new technologies that promise to improve IO performance.

    We’re going to throw some model numbers at you now. This is to easily and simply explain what new technologies we’re talking about. If you’re familiar with this, skip ahead to the next paragraph. These technologies have been released with the Dell PowerEdge R620 server. It features the Intel Xeon E5-2600 series processors (based on the Intel micro-architecture codenamed Sandy Bridge-EP) and provides enhanced systems management features and lower power consumption when compared to the previous 11th generation Dell PowerEdge servers. The integrated PCIe Gen-3 I/O capabilities of the latest Intel Xeon processors allow for a faster interconnect using the 56 Gb/sec fourteen data rate (FDR) InfiniBand adapters. For 10 Gb Ethernet solutions, an onboard Network Daughter Card that does not consume a PCIe slot is now an option. The 1U PowerEdge R620 has enough PCIe slots to satisfy the requirements of the NSS-HA solution and allows for a denser solution. Additionally, the PowerEdge R620 provides increased memory capacity and bandwidth.  All these factors combine to provide better IO performance as described below. The storage subsystem of this release of the NSS-HA remains unchanged.

    For readers familiar with the NSS-HA solutions, Table 1 gives an easy to read way to see what’s new. It also allows you to decide if you need an upgrade at all. Note that there are significant changes in configuration steps between the NSS2-HA and NSS3-HA releases, while there are few configuration changes between the NSS3-HA and NSS4-HA releases. With the help of the PowerEdge R620 and FDR InfiniBand network connection, the NSS4-HA solution now achieves sequential read peak performance up to 4058 MB/sec! The Performance section at the bottom of this article gives more detail on the IO performance of this configuration compared to the previous generation.

    Table 1. The comparisons among NSS-HA Solutions

      NSS2-HA Release (April 2011) NSS3-HA Release (February 2012) “Large capacity configuration” NSS4-HA Release   (July 2012)  “PowerEdge R620 based solution”
    Release Purpose Initial Release. Add the ability to support greater than 100TB storage capacity. Move to latest server technology.  Take advantage of the performance improvement with Dell PowerEdge 12th generation servers.
    Storage The maximum supported size is 96 TB in a standard configuration. The maximum supported size is 288 TB in a standard configuration.
    Capacity   The XL configuration supports 2x288 TB (two file systems).
      The XL configuration supports 2x96 TB (two file systems).  
         
    Sequential Performance Peak write: 1275 MB/sec. Peak write: 1495 MB/sec. Peak write: 1535 MB/sec.
    (standard configuration) Peak read:  2430 MB/sec. Peak read:  2127 MB/sec. Peak read:  4058 MB/sec.
    Configuration Details The complete configuration steps can be found at Dell HPC NFS Storage Solution High Availability Configurations, Version 1.1. Compared to the NSS2-HA release, there are significant changes in configuration steps. Compared to the NSS3-HA release, there are only a few changes in NSS4-HA configuration steps.
       
    The complete configuration steps can be found at Dell HPC NFS Storage Solution – High availability with large capacities, Version 2.1. The complete configuration steps will be published in July 2012.
       
    HA Functionalities All three releases use the same mechanisms to tolerate or recover the following failures:
    ·        Single local disk failure on a server
    ·        Single server failure
    ·        Power supply or power bus failure
    ·        Fence device failure
    ·        SAS cable/port failure
    ·        Dual SAS cable/card failure
    ·        InfiniBand /10GbE link failure
    ·        Private switch failure
    ·        Heartbeat network interface failure
    ·        RAID controller failure on Dell PowerVault MD3200 storage array

    Why did we use Dell PowerEdge R620 servers?

    Compared to the NSS3-HA solution releases, the biggest change in the NSS4-HA release is that the new Dell 12th generation PowerEdge R620 server is deployed as an NFS server, while the Dell 11th generation PowerEdge R710 server was used as the NFS server in the two previous releases.

    The Dell NSS-HA solution is designed to provide storage service to HPC clusters. Besides providing high availability and reliability, it is also essential for a storage solution to deliver excellent I/O performance for HPC clusters. The PowerEdge R620 leverages current state-of-art technologies to enhance existing network and disk I/O processing capabilities, compared to PowerEdge R710. The key features of the PowerEdge R620 are listed below. These features position the PowerEdge R620 to be a better performing platform and better performing NFS server in NSS-HA than the PowerEdge R710:

    • Faster processor: The PowerEdge R620 is equipped with the new Intel Xeon E5-2680 processor, which provides faster processing speed and more cores than the Xeon E5630 used in the PowerEdge R710.
    • Larger capacity and faster memory: With this release of the NSS-HA solution, the NFS server is equipped with 128 GB of memory running at 1600 MT/s versus 96 GB of 1333 MT/s memory in the previous solution. Larger memory size and higher frequency are critical to server performance.  
    • Fast internal connection: faster connections are provided throughout the system with 8.0 GT/s with Intel Quick Path Interconnect (QPI) compared to 5.86 GT/s supported with the Intel Xeon E5630 in the PowerEdge R710.
    • Faster InfiniBand link: In the PowerEdge R620, a PCIe Gen 3 based fourteen data rate (FDR) card is supported, which can provide a bandwidth of up to 56 Gb/sec. While the PowerEdge R710 can only support PCIe Gen 2 speeds and uses quad data rate (QDR) links which have a maximum bandwidth of 40 Gb/sec. 
    • Smaller form factor: The PowerEdge R620 is a 1U rack server, while the PowerEdge R710 is a 2U rack server. That translates into a denser solution with this release of the NSS-HA solution.
    • The PowerEdge R620 can support an onboard 10 Gb Ethernet network daughter card for clusters that require 10 GbE connectivity, which frees a PCIe slot in the NFS server.

    Performance Improvement

    Due to the many powerful features of the PowerEdge R620, the current NSS-HA release provides significant I/O performance improvement:

    • Sequential read/write performance: about 75 percent increment on average; most of the improvement is with sequential reads. The write performance does not change much between the current and previous release, as the RAID 6 write performance is largely determined by the storage system itself (disk drives in the storage subsystem are configured with RAID 6). 
    • Random read/write performance: about 17 percent increment for random writes and 23 percent increment for random reads on average.
    • Metadata operation performance: the increment on average is more than 20 percent for file create, stat, and remove operations.

    The following figures show the comparisons between NSS3-HA and NSS4-HA. Note: NSS3-HA and NSS4-HA have the exact same storage subsystem.

    Figure 1. IPoIB large sequential write performance: NSS4-HA vs. NSS3-HA

    Figure 2. IPoIB large sequential read performance: NSS4-HA vs. NSS3-HA

    Figure 3. IPoIB random write performance: NSS4-HA vs. NSS3-HA

    Figure 4. IPoIB random read performance: NSS4-HA vs. NSS3-HA

    Figure 5. IPoIB file create performance: NSS4-HA vs. NSS3-HA

    Figure 6. IPoIB file stat performance: NSS4-HA vs. NSS3-HA

    Figure 7. IPoIB file remove performance: NSS4-HA vs. NSS3-HA

    For detailed information about Dell NSS4-HA solution, please refer to “Dell HPC NFS Storage Solution High Availability (NSS-HA) Configurations with Dell PowerEdge 12th Generation Servers.” For the detailed Dell NSS4-HA configuration guide, please refer to the attachment of the blog.

  • A Complete Tier HPC Storage Solution?

    New technologies appear almost daily in today’s world.  Most don’t make it long term.  The technologies that survive solve problems for the user.  Some technologies are sold off, absorbed, and never see a user again.  This can leave the user bewildered and angry looking for a replacement.  Technologies can be used for a while and accepted, but then are placed on a shelf forgotten waiting for revitalization.  It is this type that I wish to discuss.  But first let’s describe the problem as seen from a special user.

    Image from ATLAS Experiment. Caption: Candidate Higgs decay to four electrons recorded by ATLAS in 2012.

    The ATLAS experiment which is part of the Large Hadron Collider (LHC) is collecting data at an enormous rate.  About 3 Petabytes (PB) a year, and they will be collecting it for 20 years plus.  They also have a requirement that the data be available for years to come.  This directive is meant for new research ideas and techniques.  The data then should be near line to be re-analyzed if needed in the future.  The collected data is distributed around the globe using a worldwide grid.  The grid is a construct implemented by the LHC community members.  However, each member is responsible for their own implementation.  There are exceptions, but most members have a specific task in the search for new physics.  Breaking this into components and techniques of how this is done and who has responsibility is beyond this scope of this blog.   For storage which is the scope of this blog, it can be summarized as high speed data for simulation and reconstruction, high throughput write once and read many times data for analysis, and long term near line data.  This of course is a classic definition for Tier storage.

    Now imagine that you are a large user of this data. Let’s say that you have 15PB on hand and you are responsible for analysis, simulation, and reconstruction using ATLAS data.  You have 2 versions of Scientific Linux, 4 types of storage software (dCache, ext3, NFS, and HPS), and 6 different CERN Openlab software version’s to make your storage tier work.  It really doesn’t matter what hardware you have on the floor, but it is tons of disks and tapes.  I can see why meetings are held at the Pub.  This is a severe problem in my head, and wouldn’t it be great if the LHC community had a better technology.  Something that is easy to manage, scales up, scales out, fast, loves any version of Linux, capable of high throughput, and can manage tape drives.  Well, maybe soon, there will be.

    Last August, I was invited to work on a whitepaper that may provide this type of solution using Dell servers and storage.  Some very smart guys and girls at Clemson University decided to pull Parallel Virtual File System (PVFS) off the shelf.  They have been busy revitalizing it for the 21st century.  They call it OrangeFS.  OrangeFS is currently at version 2.0 will do most of the items described above.  However, many improvements are planned for 3.0 including tape drive drivers for the management of tape system. This addition will make OrangeFS a complete tier system.  Is it the long awaited storage management solution that the LHC community and/or HPC in general have been waiting for?  Let’s step back for a very quick look at the technical claims.

    Metadata is distributed across the storage system.  When a new user does an “ls” on the whole system, OrangeFS should not crash like some HPC file systems.  It is supposed to handle small files better than other HPC file systems.  It is open source and free as a bird, but what about support.  Clemson has a support service in place called Omnibond.  I understand that there is a cost for this and needs further understanding from yours truly.  The user can define any type of Tier in the upcoming version 3.  The most interesting claim to me, after getting burned more than once by sold code mentioned above, is that this is Clemson’s code.  They are not a company that can be bought or sold.  This screams stability to me which is a rare thing indeed. 

    With the claims aside, I see OrangeFS as a large box as opposed to the triangle description used so often---high speed storage on top, NFS in the middle, tape on the bottom or something similar.  I use the box description because the same file system can be used in the entire environment. If you need more high speed storage for a week just define it.  Need more capacity in the middle, just add it.  I don’t know if or when OrangeFS will move to the mainstream.  But I would like to look back I know that I gave it a good push.  Please follow the link to the Clemson whitepaper below.

    http://content.dell.com/us/en/enterprise/d/business~solutions~engineering-docs~en/Documents~orange-fs-reference-architecture.pdf.aspx

     

  • Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

    We have posted two blogs (1) (2) to discuss DELL NFS Storage Solution with High Availability (NSS-HA) in the past. This article introduces a new configuration of DELL NSS-HA solution which is able to support larger storage capacities (> 100 TB) compared to the previous configurations of NSS-HA.

    Dell gets the support from Red Hat for XFS capacities greater than a 100 Terabytes . Details on the work that was done to get this exception for Dell is in the blog post: Dell support for XFS greater than 100 TB.

     

    As the design principles and goals for this configuration remain the same as previous Dell NSS-HA configurations, we will only describe the difference between this configuration and the previous configurations. For complete details, please refer to our white papers titled “Dell HPC NFS Storage Solution High Availability Configurations, Version 1.1.” and “Dell HPC NFS Storage Solution High Availability Configurations with Large Capacities, Version 2.1.”

    Storage density

    In previous configurations of NSS-HA (1), each storage enclosure was equipped with 12 3.5” 2TB NL-SAS disk drives. The larger capacity 3TB disk drives are a new component in the current configuration. The storage arrays in the solution, Dell PowerVault MD3200 and PowerVault MD1200 expansion arrays are the same as in the previous version of the solution but with updated firmware. The higher capacity 3TB disks now allow higher storage densities in the same rack space. Table 1 provides information on new capacity configurations possible with the 3TB drives. This table is not a complete list of options; intermediate capacities are available as well.

    Storage configuration

    In previous configurations of NSS-HA, the file system had a maximum of four virtual disks. A Linux physical volume was created on each virtual disk. The physical volumes (PV) were grouped together into a Linux volume group and a Linux logical volume was created on the volume group. The XFS file system was created on this logical volume.

    With this configuration, if more than four virtual disks are deployed, the Linux logical volume (LV) is extended, in groups of four, to include the additional PVs. In other words, groups of four virtual disks are concatenated together to create the file system. Data is striped across each set of four virtual disks. However it is possible to create users and directories such that different data streams go to different parts of the array and thus ensure that the entire storage array is utilized at the same time. The configuration is shown in Figure 1 for a 144TB configuration and a 288TB configuration.

    Red Hat High Availability Add-On

    Red Hat High Availability Add-On is a key component for constructing a HA cluster. In previous configurations of NSS-HA, the add-on used is distributed with RHEL 5.5. With this release, the version distributed with RHEL6.1 is adopted. There are significant changes in the HA design between the previous RHEL 5.5 release and the new RHEL 6.1 release. New and updated instructions to configure the HA cluster with RHEL 6.1 are listed in the appendix A of our white paper “Dell HPC NFS Storage Solution High Availability Configurations with Large Capacities, Version 2.1.

    Red Hat Scalable File System package

    In previous configurations of NSS-HA, the version of XFS is 2.10.2-7 which is distributed with RHEL 5.5. In the current version of NSS-HA, the version of XFS used is 3.1.1-4 and is distributed with RHEL 6.1. The most important feature of the current XFS for users is that it is able to support greater than a 100 Terabytes of storage capacity.

    Summary of changes

    Table 2 lists the similarities and difference in storage components. Table 3 lists the similarities and differences in the NFS servers.

    Dell NSS-HA solution provides a high availability and high performance storage service to high performance computing clusters via an InfiniBand or 10Gigabit Ethernet network. Performance characterization of this version on the solution is described in “Dell HPC NFS Storage Solution High Availability Configurations with Large Capacities, Version 2.1.” Additionally, in our next few blogs, we will discuss the performance of the random and metadata tests on 10GbE and other performance related topics.

    By Xin Chen and Garima Kochhar 

     References

    1.  Dell NFS Storage Solution with High Availability - an overview

    http://en.community.dell.com/techcenter/high-performance-computing/b/hpc_storage_and_file_systems/archive/2011/07/24/dell-nfs-storage-solution-with-high-availability-an-overview.aspx

    2.  Dell NFS Storage Solution with High Availability – XL configuration

    http://en.community.dell.com/techcenter/high-performance-computing/b/hpc_storage_and_file_systems/archive/2011/08/04/dell-nfs-storage-solution-with-high-availability-xl-configuration.aspx

    3.  Red Hat Enterprise Linux 6 Cluster Administration -- Configuring and Managing the High Availability Add-On.

    http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/pdf/Cluster_Administration/Red_Hat_Enterprise_Linux-6-Cluster_Administration-en-US.pdf

    4.  Dell HPC NFS Storage Solution High Availability Configurations, Version 1.1

    http://i.dell.com/sites/content/business/solutions/whitepapers/en/Documents/dell-hpc-nssha-sg.pdf

  • Dell support for XFS greater than 100 TB

    Enterprise storage needs demand solutions that can scale up and scale out in terms of capacity and performance. This is especially true in HPC environments where the additional constraint of cost is paramount. Dell has responded with cost effective solutions for the HPC storage needs in three different spaces (as illustrated in the figure below):

    In particular, NSS is a very cost-efficient solution that can deliver high performance at moderate capacities. However, previous versions of the NSS were restricted to a ceiling of 100 Terabytes (100 * 2^40 or 100 TiB) due to the Red Hat support limit for XFS.

    Given industry current demands for storage solutions with bigger capacity in this space, Dell has been working extensively with Red Hat on XFS testing and validation to expand the support beyond the 100 TiB barrier and meet the current business needs of our customers.

    As a result of this effort, Red Hat has granted Dell support for XFS up to 288 TB (raw disk space) on NFS Storage Solutions with a single namespace, and even bigger capacities on custom design solutions. This is a very important milestone for Dell’s quest towards providing Petabyte storage solutions.

    For details about the performance characteristics and different capacities of our new version of the NSS, please take a look at the NSS white paper.

    Written by Garima Kochhar, Jose Mario Gallegos, and Xin Chen