High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • HPC Performance Comparison of Intel Xeon E5-2600 and E5-2600 v2 Series Processors

    by Ranga Balimidi, Ishan Singh, and Rafi Ikbal 

    Dell recently updated the 12th generation PowerEdge server line with the Intel Xeon E5-2600 v2 series processors. In this blog we compare the performance of the Intel Xeon E5-2600 v2 processors against the previous E5-2600 series processors across a variety of HPC benchmarks and applications. We also compare the performance of 1600MT/s DIMMs with 1866MT/s DIMMs; 1866MT/s is only supported with Intel Xeon E5-2600 v2 series processors. Intel Xeon E5-2600 v2 series processors are supported on Dell PowerEdge R620, R720, M620, C6220 II, C8220 and C8220x platforms with the latest firmware and BIOS updates. 

    Intel Xeon E5-2600 series processors use a 32 nanometer based manufacturing process, CPU on planar double-gate transistors. They fall under the tock process of Intel’s tick-tock model of development and included a new microarchitecture (codenamed Sandy Bridge) to replace the Intel Xeon 5500 series processors that were built on the architecture code named Nehalem.

    Intel Xeon E5-2600 v2 series processors (codenamed Ivy Bridge) are based on the 22 nm manufacturing process. There is a die shrink, known as the "tick" step of Intel’s tick-tock model and is based on 3D tri-gate transistors. 

    To maintain consistency across the server configurations having Intel Xeon E5-2695 v2 and Intel Xeon E5-2665 processors, we have used processors of the same frequency and wattage across both processor families.                                                

    Cluster Configuration

    Hardware

    Server Model

    PowerEdge R620

    Processors

    Dual Intel Xeon E5-2665 2.4GHz (8 cores) 115W

    Dual Intel Xeon E5-2695 v2 2.4GHz (12 cores) 115W

     

    Total 16 cores per server

    Total 24 cores per server

    Memory

    128GB memory, total per server

    Configuration - 1-16GB Dual Rank DDR3 RDIMM per channel                    (8 * 1600MT/s 16GB DIMMs or  8 * 1866MT/s 16GB DIMMs) 

    Interconnect

    Mellanox InfiniBand - ConnectX3 FDR connected back-to-back

    Software

    Operating System

    Red Hat Enterprise Linux 6.4 (kernel version 2.6.32-358.el6 x86_64)

    Cluster Software

    Bright Cluster Manager  6.1

    OFED

    Mellanox OFED 2.0.3             

    Intel Compiler

    Version 14.0.0

    Intel MPI

    Version 4.1

    BIOS

    Version 2.0.3

    BIOS Settings

    System Profile set to Max Performance

    (Logical Processor disabled ,Turbo enabled, C states disabled, Node Interleave disabled)

    Applications

    HPL

    v2.1 From Intel MKL v11.1

    Stream

    v5.10, Array Size 160000000, Iterations 100

    NAS Parallel Benchmarks

    v3.2, Problem Size=D Class

    WRF

    v2.2, Input Data Conus 12K

    LINPACK or HPL

    The Linpack benchmark measures how fast a computer solves linear equations and measures a system's floating-point computing power; requires a software library for performing numerical linear algebra on digital computers. 

    STREAM

    The STREAM benchmark is a simple synthetic program that measures sustained memory bandwidth in MegaBytes. It uses COPY, SCALE, SUM and TRIAD programs to evaluate the memory bandwidth. Operations of these programs are shown below:

    COPY:             a(i) = b(i)
    SCALE:           a(i) = q*b(i)
    SUM:              a(i) = b(i) + c(i)
    TRIAD:            a(i) = b(i) + q*c(i)

    NPB (LU, EP and FFT)

    The NAS Parallel Benchmarks (NPB) is a set of benchmarks targeting performance evaluation of highly parallel supercomputers.

    LU:  Solves a synthetic system of nonlinear PDEs using three different algorithms involving block tri-diagonal, scalar penta-diagonal and symmetric successive over-relaxation (SSOR) solver kernels, respectively.

    EP: Generate independent Gaussian random deviates using the Marsaglia polar method.

    FFT: Solve a three-dimensional partial differential equation using the fast Fourier transform (FFT).

    WRF

    The Weather Research and Forecasting (WRF) Model is a numerical weather prediction system designed to serve atmospheric research and weather forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture allowing for parallel computation and system extensibility.

    Results and Analysis

    The tests conducted with the Intel Xeon E5-2665 configuration are labeled SB-16c. Tests conducted with 16 cores on the Intel Xeon E5-2695 v2 are labeled IVB-16c. Finally, tests with all 24 cores on the Intel Xeon E5-2695 v2 are referenced as IVB-24c.

    Tests that used 1600MT/s DIMMS are designated with the suffix 1600, while tests that used 1866MT/s DIMMs are designated with the suffix 1866.

    Single Node Performance:

    For single node runs, we have compared the performance obtained with the server’s default configurations with both SB and IVB processors, using all the cores available in the system. In addition, for WRF and NPB-EP, we have also compared the performance of the server after turning off 4 cores per processor for Intel Xeon E5-2600 v2 configurations.

    The following graph shows the single node performance gain with the Intel Xeon E5-2600 v2 series when compared to then E5-2600 series. For HPL, only the out of box sustained performance is compared when utilizing all the cores in the server. Since NPB-LU and NPB-FT require processor cores to be in order of a power of 2 for them to run, their runs are not shown for IVB-24c.

    Relative performance is plotted using the SB-16c-1600 configuration as the baseline.

    HPL yielded 1.53x sustained performance on IVB-24c as compared to SB-16c. This is primarily due to the increase in the number of cores. NPB and WRF yielded up to ~7 – 10% improvement when executed on 16 cores on Intel Xeon E5-2695 v2 when compared to SB-16c. WRF performs 22% better with IVB-24c when compared to SB-16c; NPB-EP shows ~38% improvement with IVB-24c when compared to SB-16c. NPB-EP shows improved results compared to WRF, because of its parallel nature which requires less communication among MPI processes thus greatly benefitting from the increase in number of cores.

    The performance increase of WRF, NPB-EP and NPB-FT on 1866MT/s DIMMs over 1600MT/s DIMMS is 2.35%, 0.26% and 2.73% respectively. NPB-LU shows 10% increase in performance. This behavior is due to the large problem size required for NPB-LU which helps it show a considerable performance increase with the faster memory as compared to NPB-EP, NPB-FT or WRF. 

    Dual Node performance:

    For dual node tests we have compared the performance obtained with server’s default configurations with both SB and IVB processors. In addition, for WRF and NPB-EP we have also compared the performance of the server after turning off 4 cores per processor for Intel Xeon E5-2600 v2 configurations.

    The two-node cluster in these tests is connected back-to-back via InfiniBand FDR. All dual node tests were conducted with 1600MT/s memory DIMMs. 

    The following graph shows single node performance gains with the Intel Xeon E5-2600 V2 series when compared to then E5-2600 series processors plotted as IVB-48c and SB-32c respectively. Since E5-2600 has 8 cores per socket compared to 12 cores in Intel Xeon E5-2600 V2, one set of results was taken with four cores of Intel Xeon E5-2600 V2 shut down through BIOS and it is plotted as IVB-32c.

    HPL was executed on a two node cluster with E5-2665 (total 32 cores, 16 cores per server) and E5-2695 v2 (total 48 cores, 24 cores per server). HPL yielded 1.52x sustained performance on IVB-48c as compared to SB-32c whereas WRF, NPB-EP and NPB-LU have shown a performance improvement of ~2.5%. There is ~7 %- 8% increase in the performance with WRF and NPB-EP on 32 cores. It is ~ 22 – 32% difference when compared between 48 cores E5-2695 v2 and 32 cores E5-2665 runs.

    Following graph shows STREAM results


    The graph compares the memory bandwidth of the E5-2600 v2 processor to its predecessor, the E5-2600. With E5-2600, the maximum supported memory speed is 1600MT/s. With E5-2600 v2, that maximum is 1866MT/s. We’ve compared 1600MT/s DIMMs for SB-16c and IVB-24c, and also plotted the improved memory bandwidth with 1866MT/s on IVB-24c.

    The IVB-24c test shows up ~15% increase in memory bandwidth with 1866MT/s DIMMs when compared to IVB-24c with 1600MT/s DIMMs due to the higher frequency of the 1866MT/s DIMMs. And it shows ~27% increase when compared to SB-16c with 1600MT/s. This increase is because of the dual memory controllers on E5-2695 v2 processor that support 2 memory channels each as compared to the single memory controller with 4 Channels on E5-2665 processor.

    Conclusion

    In this study, we have found that E5-2600 v2 processors have significant performance improvement over E5-2600 processors. Increase in number of cores, Larger L3 cache and dual memory controller are contributing to performance. We could see huge improvement in performance with embarrassingly parallel applications like NPB-EP. We also see an increase in performance with 1866MT/s DIMMs over 1600MT/s DIMMs.

    References

    1. www.intel.com
    2. http://www.netlib.org/linpack/
    3. http://www.nas.nasa.gov/publications/npb.html 
    4. http://www.cs.virginia.edu/stream/ref.html
  • How to Enable "HPC-mode" to Achieve up to 6% Improvement in HPL Efficiency

    By Nishanth Dandapanthula and Garima Kochhar

    HPC (High Performance Computing) mode is a new feature introduced in the BIOS which improves the performance of certain workloads on Dell servers based on AMD Interlagos processors. This blog describes how to enable and take advantage of the HPC mode and includes some performance results of the impact of HPC mode on a PowerEdge R815 server.

    Enabling HPC mode through BIOS

    The BIOS version which introduces HPC mode on the R815 is 2.8.2. To enable HPC mode through the BIOS, the BIOS must be set as shown in Table 1. Figure 1 and 2 show the screenshots of the steps involved in enabling HPC mode through the BIOS. Note that setting the HPC mode to “enabled” within the Processor Settings tab alone will not fully enable HPC mode.

    Table 1: Enabling HPC mode through BIOS

    Figure 1: Enable HPC mode in the Processor Settings tab

    Figure 2: Change options in the Power Management Tab

    Enabling HPC mode through DTK

    To enable HPC mode in a cluster environment, Dell’s OpenManage Deployment Toolkit can be used. The power management settings which are listed above can be set using the existing syscfg command line. The new parameter for HPC mode is “--hpcmode”. This parameter is being introduced with DTK v4.1.Details are listed in Table 2 below.

    Table 2: HPC mode through DTK

    Other dependencies

    The Red Hat Enterprise Linux kernel 2.6.32-220.17.1.el6 or later is needed for HPC mode to function. Without the support enabled in this kernel, the server will kernel panic on boot when HPC mode in enabled in the BIOS.

    Impact of HPC mode

    To measure the impact of HPC mode on the performance of the server we used the High Performance Linpack (HPL) benchmark. The prebuilt HPL binaries were obtained from http://developer.amd.com/libraries/acml/downloads/pages/default.aspx. These binaries were built using Open64 compilers. Table 3 shows the test server configuration and Table 4 details the performance results. This evaluation was done on a single server.

    Table 3: Test Server Configuration


    Table 4: Impact of HPC mode

     

    From Table 4, it can be seen that HPC mode provides up to 6% improvement in HPL efficiency. This increased performance is at the expense of higher power consumption and is recommended only for those environments where the power available can support this mode of operation. Another caveat to be noted is that the performance improvement provided by HPC mode for workloads other than HPL is minimal.

    The table compares the results of the new “HPC mode” BIOS option to the previous “Max Performance” Power Management option. For the “HPC mode” BIOS option to take effect, it is mandatory that the “Power Management” option must be set to “Custom”, the “CPU Power and Performance Management” option must be set to “OS DBPM” and the “Fan Power and Performance Management” option must be set to “Maximum Performance”.

    References




  • HPC performance on the 12th Generation (12G) PowerEdge Servers

    By Nishanth Dandapanthula

    What can you expect from the latest servers from Dell? What kind of performance and energy efficiency do all those speeds and feeds translate to? We spent the last several weeks in the HPC lab at Dell putting the 12G servers through some tests and this blog captures some of those results.

    Dell’s all new 12th generation (12G) dual socket PowerEdge® servers feature the Intel® Xeon® E5-2600 series processors. These processors are based on the latest Intel micro-architecture, codenamed Sandy Bridge. 12G servers include many features beyond SandyBridge. There have been enhancements in systems management, power efficiency, network adapters and options to mix 1GbE and 10GbE, SSD drives and so on.

    In this blog, we focus on compute performance and energy efficiency. We quantify the performance improvement provided by the 12G servers when compared to the previous 11th generation (11G) servers. The 11G servers, released in 2009, were based on the Intel Xeon 5500 and 5600 series processors (Nehalem-EP and Westmere-EP). We use a variety of applications and micro benchmarks for our comparison. This article gives a detailed account of single server performance evaluation comparing 11G and 12G servers.

    Sandy Bridge vs. Westmere

    The 11G servers included the Xeon X5600 series processors (Westmere). Table 1 describes the basic differences between Sandy Bridge and Westmere. With the increased number of cores, memory channels, QPI links etc., it is not hard to perceive that Sandy Bridge will have a profound impact on performance when compared to Westmere. Intel also introduced Advanced Vector Extensions (AVX) [1] in Sandy Bridge. Among the plethora of advantages provided by AVX, it doubles the number of FLOPS/cycle when compared to Westmere or Nehalem. This provides a huge boost in performance. A complete description of AVX is provided in [2].

    Table 1: E5-2600 Vs. X5600

    Machine Configuration & Experimental Setup

    To quantify the performance advantage provided by 12G servers over 11G servers for HPC compute workloads, we compare the PowerEdge R620 (12G) server to the PowerEdge R610 (11G) server. Both these servers have a 1 U form factor and are 2 socket systems. Table 2 and Table 3 provide the server configurations of both the machines. The versions of BIOS and iDRAC (integrated Dell Remote Access Controller) used were the latest revisions at the time of the experiments. Table 4 provides the BIOS settings used for the experiments. Note that the evaluation R620 used for the tests was an engineering prototype machine, with the latest test firmware and BIOS at the time.

    Table 2: R620 Configuration

    Table 3: R610 Configuration

    Our base processor on the R620 is the Intel Xeon E5 -2680, which is a 2.7GHz (C1 stepping - proto), 130W processor. To match the core speed of the base processor on the R610, we picked the X5660 which is a 2.8 GHz, 95W Westmere processor. We also use the X5690 which is a 3.46 GHz, 130W Westmere processor to match the wattage of the base processor.

    Table 4: BIOS Settings

    Memory Bandwidth

    Figure 1: Stream Triad Memory Bandwidth

    The Stream [4] benchmark is used to measure the memory bandwidth of the system. Figure 1 shows that, relative to the previous generation PowerEdge R610 server:

    -       There is an ~85 % increase in memory bandwidth when 1600 MHz DIMMs are used on the R620

    -       A ~61 % increase in memory bandwidth is measured when 1333 MHz DIMMs are used on the R620.

    -       Taking into account the additional cores on the R620, the memory bandwidth per core on the R620 is still better by 21-39% when compared to the R610.

    HPL Performance and Energy Efficiency


    To accomplish the HPL runs on both the server configurations, we use Intel MPI 4.0.3.008, Intel MKL 10.3. The problem size for each of the HPL iterations is maintained at a constant of 90% of the total server memory. HPL efficiency measured as the ratio of sustained performance to theoretical maximum performance shows a 6 % improvement on the R620 server when compared to the R610.

    Figure 2 represents the results pertaining to a single server. Results are presented relative to the PowerEdge R610 configured with 2.8GHz processors.

    -       There is a 175 % increase in absolute performance when similar core speed processors are used (bars for R610, 2.8 GHz, 95W and R620, 2.7Gz, 130W, 1333MHZ) with the 12G server performing significantly better.

    -       There is not much of a difference in HPL performance on the 12G servers when 1333MHz DIMMS are switched with 1600MHz DIMMS. This validates the study made in [3] that shows that HPL is not sensitive to memory speed.

    Figure 2: HPL Performance

    Figure 3 shows the energy efficiency of a server when HPL is being run. Energy efficiency is measured as in terms of performance delivered for each watt of power consumed (Performance/W or GLFOPS/W). Results are presented relative to the PowerEdge R610 configured with 3.46GHz processors.

    -       There is a 100% increase in GFLOPS/Watt when compared to R610. That is, a 12G server provides double the performance when consuming the same amount of power as an 11G server. This impact can be attributed to the 33% more cores on the Sandy Bridge processors when compared to the Westmeres, and to increase in number of Floating Point operations executed every clock on the Sandy Bridge. It is also indicative of the overall energy efficient design of the Dell PowerEdge R620.

    Figure 3: Power Consumption when running HPL

    Idle Power

    Figure 4: Idle Power Consumption

    Idle power is measured as the power consumed by a server after it has reached a stable state (Boot process is complete) but is idle with no jobs running on the system. Most data centers tend to have some downtime during off peak hours. Idle power is an important metric to determine the energy efficiency of the data center when no jobs are running. Figure 4, depicts the relative idle power usage for different configurations of the servers. Results are presented relative to the PowerEdge R610 configured with 3.46GHz processors.

    -       The 12G servers consume 21% less power compared to a 11G server when processors with similar wattage are used (bars corresponding to the R610 with 3.46GHz, 130W processors, 1333 MHz DIMMs and the R620 with 2.7 GHz, 130 W processors, 1333 MHz DIMMs).

    The idle power consumed by the 12G machines is startlingly low. This is due to the improvements made for energy efficiency not just in the Intel SandyBridge processors but also on the overall Dell platform. As mentioned before, 12G servers from Dell have several new features and enhancements beyond Intel’s processors.

    Summary & Conclusion

    Studies comparing a 12G server and an 11G server indicate that

    • HPL performance on the new 12G servers is better by
      • 175 % when machines with similar core speed processors are compared
      • 100% when GFLOPS/Watt are compared
      • Memory Bandwidth is better by 85 % on the 12G servers.

    The subsequent blogs will give a detailed account of the communication aspect of the 12G servers and their advantages when compared to 11G servers. In future, we also plan on following it up with a blog that will provide an insight into application level performance.

    References

    1. http://software.intel.com/en-us/avx/
    2. http://software.intel.com/en-us/articles/intel-avx-new-frontiers-in-performance-improvements-and-energy-efficiency/.
    3. http://i.dell.com/sites/content/business/solutions/whitepapers/ja/Documents/HPC_Dell_11g_BIOS_Options_jp.pdf
    4. http://www.cs.virginia.edu/stream/

     

  • Designing Scalable 10Gb Ethernet Networks (Part 1)

    Many HPC and high throughput computing (HTC) application environments are well served by gigabit Ethernet as the primary cluster interconnect.  Increasing processor core counts and the availability of cost effective quad socket systems are growing the IO demands of compute nodes.  When the available bandwidth of a gigabit connection is exceeded, IO wait cycles are introduced, the overall throughput of a compute node is constrained and CPU utilization drops.  Transitioning to 10Gb Ethernet is one way to address the increasing IO demand.  A challenge with 10Gb Ethernet networks for clusters is deploying a cost effective network that scales as you grow and minimizes cost.  Using a conventional multi-tier design is one possible solution for building a scalable 10Gb network.  However, the bandwidth available at the top tier will limit the size of the network that can be built and often the first tier switches introduce over-subscription into the network because the amount of uplink bandwidth is less than the bandwidth required by systems connected into the switches.  The solution to these limitations is to build the network using a switch like the Dell Force10 Z9000TM in a “fat tree” topology taking advantage of the Z9000’s distributed core capabilities.  The Distributed Core Architecture Using the Z9000 Core Switching System White Paper discusses the advantages of this approach and the various communications protocols used in implementation, but does not cover how you actually design a fat tree network.  The Dell Force10 Z9000 TM  is a line-speed, 32-port 40GbE, two rack unit, top of rack (TOR) switch.  Each 40Gb QSFP port can be split into 4 10Gb SFP+ ports using a simple splitter cable.

    A two tier fat tree network uses what are commonly called “leaf” and “spine” switches.  Leaf switches are switches at the edge of the network that connect to the compute or storage node elements in the cluster.  Spine switches make up the second tier of switches that connect the leaf switches together.  I will cover non-blocking solution design with this posting and follow-up with some possibilities for oversubscribed designs at a later date.

    A non-blocking leaf switch configuration using the Z9000 is made by splitting half of the Z9000’s 32 40Gb ports into 10Gb ports enabling the connection of up to 64 compute nodes (each 40Gb port is split into 4 10Gb ports).  The remaining 16 40Gb ports will be used as uplink ports to connect into spine switches.  There is an equal number of 40Gb ports used for connection to compute nodes and spine switches so this is a non-blocking configuration.

    To complete the fat tree network, leaf switches are connected to spine switches.  The number of spine switches needed is determined by counting the number of 40Gb uplinks from all leaf switches and dividing by 32 (since there are 32 40Gb ports in the Z9000).  To connect the network leaf switch uplinks are evenly divided between the spine switches.

    The following example has four leaf switches in non-blocking configuration which can support a 256-node cluster.  There are a total of 64 40Gb uplinks spread across two Z9000 spine switches.

    The maximum number of non-blocking 10Gb connected systems that can be configured into a single network fabric using the leaf switch configuration described here is 2048.  2048 comes from multiplying the number of ports in the spine switch times the number of nodes that can be connected to each leaf switch.  If we change from using 40Gb connections to 10Gb connections in the spine and leaf switches, by splitting each QSFP port into 4 SFP+ ports, the Z9000 becomes a 128 port 10Gb switch and the maximum number of non-blocking 10Gb ports in a single network fabric grows to 8192 (128 spine switch ports * 64 leaf switch node connections).

    Designing a conventional 10Gb multi-tier topology either limits the size of your network to 100s of ports  or requires the introduction of oversubscription.  Also, purchasing a core switch of sufficient size could cost hundreds of thousands of dollars.  Using a fat tree topology basd on the Z9000 enables a massive number of nodes to be connected into a scalable, high performing network.  You can start small and grow as your cluster grows and dramatically reduce the cost of the solution.

    In part 2 I will cover how to design oversubscribed fat tree networks.

  • HPC I/O performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

    By: Nishanth Dandapanthula, Munira Hussain

    Overview

    The new generations Intel Sandy Bridge Servers have PCIe Generation 3 bus slots available that offer many benefits in terms of bandwidth and latency which is useful for inter-node communication in a High Performance Computing Cluster. In terms of transfer rate, PCIe Gen3 offers up to 8 GT/s rate versus 5 GT/s provided by the older generation PCIe Gen2 slots. Additionally, the PCIe Gen3 uses a different encoding scheme that results in lower overhead and delivers greater bandwidth and lower latency.

    In this blog we will focus on the performance comparison between PCIe Gen2 versus PCIe Gen3 and the impact it has on bandwidth and latency. The performance improvement will be measured for both Quad Data Rate (QDR) and Fourteen Data Rate (FDR) InfiniBand Adapters on the 12G servers.

    The Fourteen Data Rate (FDR) InfiniBand from Mellanox [3] are PCIe Gen3 based cards delivering up to 54 Gbits/s of theoretical bandwidth on a 4X link lane slot. The Quad Data Rate (QDR) Infiniband adapters are both Gen2 based and Gen3 adapters. The theoretical bandwidth for the QDR adapters is 32 Gbits/s on a slot with 4X link lanes.

    Experimental Setup

    With the introduction of FDR InfiniBand adapters and PCIe Gen3 slots, the 12G servers provide an enormous improvement in bandwidth and latency from the interconnect perspective. We used the experimental setup as shown in Table 1 (12G) and Table 2 (11G) to quantify the advantage provided by 12G servers at a micro benchmark level, when compared to the previous 11G servers. To obtain the best possible latency, the BIOS options have been set as mentioned in Table 3. The servers were connected back to back without a switch in order to demonstrate the absolute performance improvement without considering the overhead introduced by the switch.

    Table 1: R620 Configuration

    Table 2: R610 Configuration

    Table 3: BIOS Settings

    Results

    The following results were obtained by using MVAPICH 1.2 [1] and OSU Micro benchmarks 3.1.1 [2]. We compared the performance of three different interconnect speeds; FDR PCIe Gen3, QDR PCIe Gen3 and QDR PCIe Gen2 using the latency, bandwidth and bi-directional bandwidth benchmarks from the OSU benchmarks suite.

    From Figure 1, we can infer that:

    -        An 87 % improvement in bandwidth was obtained when QDR PCIe Gen2 (11G) and FDR PCIe Gen3 (12G) are compared.

    -        A 16 % improvement when QDR PCIe Gen2 (11G) and QDR PCIe Gen3 (12G) are compared. This can be attributed to the benefits provided by the PCIe Gen3 Slot.

    Figure 1: OSU Bandwidth

    Figure 2 represents the performance comparison using the OSU Bidirectional Bandwidth benchmark:

    -        A 69% improvement is seen when QDR PCIe Gen2 (11G) and FDR PCIe Gen3 (12G) are compared.

    -        A 20 % improvement when QDR PCIe Gen2 (11G) and QDR PCIe Gen3 (12G) are compared.

    Figure 2: OSU Bidirectional Bandwidth

    Figure 3 and Figure 4 depict the OSU Latency benchmark comparison over different interconnect speeds. With the new 12G servers, we hit the lowest micro benchmark level latency, when the FDR PCIe Gen3 adapters are used.

    -        For small message sizes, the latency is better by 40 % when QDR PCIe Gen2 (11G) and FDR PCIe Gen3 (12G) are compared.

    -        The latency numbers for small message sizes show minimal difference between QDR PCIe Gen3 and FDR PCIe Gen3 performance.  For large message sizes, the difference in performance is significant when QDR PCIe Gen3 and FDR PCIe Gen3 are compared.

    Figure 3: OSU Latency (Small Message Size)

     

    Figure 4: OSU Latency (Large Message Size)

    Summary & Conclusion

    Studies comparing an FDR PCIe Gen3 Adapter and a QDR PCIe Gen2 Adapter using the OSU benchmark suite indicate that the:

    -        Bandwidth is better by 87%

    -        Bi-directional Bandwidth is better by 69%

    -        Latency is lower by 40%

    In subsequent blogs, we plan on application level studies on a larger cluster to understand the performance at scale.

    References

    [1] http://mvapich.cse.ohio-state.edu/

    [2] http://mvapich.cse.ohio-state.edu/benchmarks/

    [3] http://www.mellanox.com/content/pages.php?pg=infiniband_cards_overview&menu_section=41

  • 2GB/core is the HPC Gold Standard … But I Know I Need 48GB/node

    I got some e-mail after the previous blog (http://dell.to/144sqai) on 2GB/core recommendations for HPC compute nodes. It turns out that some of you know the memory capacity requirements of your workloads and it is currently 48GB per (2-socket) compute node. Kudos for determining the minimum amount of memory required! 

    But configuring to the minimum required memory assumes that less memory is “better": it costs less money and has less potential negatives. More on that later.

    Continuing, the logic goes that 48GB/node is 24GB/socket on a 2-socket node. And since there are four (4) memory channels per socket on an Intel SandyBridge-EP processor (E5-2600) and one would like to maximize the memory bandwidth, one needs 4 x 6 GB DIMMs to achieve the required 24GB per socket.  But, alas, there is no such thing as a 6 GB DIMM.

    Hence, a 4 GB DIMM and a 2 GB DIMM are used on each memory channel. Several of you shared this configuration data with me. This does many things correctly:

    1. Complies with my previous Rule #1: Always populate all memory channels with the same number of DIMMs. (That is, on all processors use the same DIMMs Per Channel or DPC). Check.
    2. Complied with my previous Rule #2: Always use identical DIMMs across a memory bank. Check.
    3. Does not use 3 DPC, which would negatively affect memory performance. Check.
    4. Meets the known memory capacity requirements. 4 GB plus 2 GB is 6 GB. 6GB per memory channel is 24GB/socket and the required 48GB/node. Check.

    Therefore, the memory configuration is balanced and a good one, technically speaking.

    However, let’s dig deeper and take into account a few other things. One is my previous Rule #3: Always use 1 DPC (if possible to meet the required memory capacity). The others are to consider today’s price and tomorrow’s requirements.

    As stated in the previous blog, I like to create the “best” memory configuration for a given compute node and then see if the memory/core capacity is sufficient. In other words, in high performance computing take memory performance into account (first) in addition to the age-old capacity requirements. And as usual, price comes into play. In this 48GB/node case, the price is indeed a driving factor.

    To be consistent with the previous blog, we’ll use the same memory sizes and prices, based upon the Dell R620, a general purpose, workhorse, 1U, rack-mounted, 2-socket, Intel SandyBridge-EP (E5-2600) compute node platform. Below is that same snapshot of the memory options and their prices taken on 12-July-201




    Here’s the layout of a 48GB/node configuration using 4 GB DIMMs and 2GB DIMMs.  Also, in the figure is the total memory price for that configuration.

    Here’s an alternate layout using 8 GB DIMMs.  Also, in the figure is the total memory price for this configuration.


    Here are the key features of the second configuration: 

    • More than the 48GB capacity required
    • Less $$$ (per node; consider this ~$300 savings times the total number of nodes)
    • Less parts to potentially fail (in fact, half the parts to fail)
    • Fewer types of spare DIMM parts to stock
    • Easier correct replacement of failed DIMMs
    • More available memory slots for future expansion
    • “Future proof”

    “Future proof?  What does he mean by that?”   Did you notice the memory per core in the figures above?  The 48GB/node configuration using 4 GB DIMMs and 2GB DIMMs is 3GB/core for today’s mainstream 8-core processor.  The 48GB/node specification may in fact be tied to the GB/core and the core count per processor.  Today’s node may need 48GBs, but a node with more cores may need more memory. 

    We know from several public places (e.g., http://www.sqlskills.com/blogs/glenn/intel-xeon-e5-2600-v2-series-processors-ivy-bridge-ep-in-q3-2013/ ) that the follow-on to the Intel SandyBridge-EP processor (E5-2600), codenamed Ivy Bridge-EP, will officially be called the Intel Xeon E5-2600 v2.  The mainstream v2 processor will feature ten (10) cores, compared to today’s 8 cores.  With this future processor, the alternate memory configuration above using 8 x 8GB provides a total of 64GB/node.  This 64GB/node on a 2-socket node with 20 cores is 3.2GB/core, still exceeding the 3GB/core of the 48GB node today.

    If you have comments or can contribute additional information, please feel free to do so.  Thanks.  --Mark R. Fernandez, Ph.D.

    @MarkFatDell

    #Iwork4Dell

  • 12G HPC Solution with ROCKS+ from StackIQ

    By Munira Hussain and Anil Maurya

    ROCKs+ 6.0.1 is based on the open source project ROCKS and is supported by StackIQ. The solution stack is tested, verified and validated on the latest Dell hardware based on Intel Sandy Bridge and AMD Interlagos. The solution stack is designed to automate, deploy and manage High Performance Computing Clusters. Additionally, it incorporates Dell recommended environment, parameters and scripts that are set and configured automatically during installs.  

    The main highlight of this release with ROCKS+ 6.0.1 is the addition of support for the following components:

    -        Support for Intel Sandy Bridge-EP Servers: R620, R720, M620 and C6220.

    -        Red Hat Enterprise Linux 6.2 (kernel -2.6.32-220.el6.x86_64).

    -        nVidia CUDA 4.1 roll for GPGPU for R720 and C6220/C6145 servers.

    -        Mellanox OFED 1.5.3-3 to support PCI-E Gen 3 Fourteen Data Rate (FDR) technology.

    -        Additional scripts and tools in the Dell roll to configure optimal BIOS settings and iDRAC/BMC settings specific for Dell hardware. (see below for more details)

    -        Introduction of a GUI based implementation to manage, monitor and run commands from the web console.

     

     Rocks+ 6.0.1 has many features as a HPC software solution stack, including:

    -        Physical or virtual infrastructures can be quickly provisioned, deployed, monitored, and managed.

    -        Pre-packaged, automatically configured software stacks called “Rolls” are available to simplify Big Infrastructure deployments on physical or virtual servers. These rolls have both data center and cloud versions.

    -        syscfg commands are used to enforce HPCC recommended BIOS settings and console redirection on all the nodes.

    -        HPC roll has MPI libraries, so OpenMPI and mpich2 are natively supported.

     

     

    Additionally, StackIQ has incorporated a new and enhanced web-based interface which:

     

    -        Allows administrator to monitor the status of nodes, for gathering the CPU and network stats.

    -        Helps in viewing the network interface settings of all the nodes.

    -        Can view and change attributes settings like hostname, etc. of all the nodes.

    -        If the ganglia roll is installed, administrator can use the ganglia web based interface to monitor the CPU, IO and network statistics data of the cluster.

    -        ROCKS+ 6.0.1 uses Avalanche installer to deploy nodes. A visualization of the deployment is displayed in the GUI when the Avalanche installer is deploying multiple systems. It displays the nodes pulling packages from each other, rather than from frontend, hence removing the I/O bottleneck on the Front End Installer or Headnode.

     

     

    ROCKS+ 6.0.1 Monitoring with Ganglia Roll

     

    The HPC Cluster is a hierarchical architecture with various components. The jumbo code from StackIQ is a bundled solution that incorporates various software components that tie together nicely to deploy and configure a Cluster. The package contains the following rolls; however additional rolls can be supplied or removed per configuration.

    • Base Roll and Core roll – comprises of the main ROCKs open source components
    • Kernel Roll- Support for OS level support
    • Web Roll – This roll adds support for the GUI based web console using the Apache web server
    • HPC roll – MPI middleware libraries and tools
    • Ganglia Roll – Open Source Cluster Monitoring tool used to captures health status
    • Sun Grid Engine – Open source job scheduler
    • OFED Roll – option of Mellanox OFED or Qlogic OFED depending on the cluster hardware configuration.
    • NVidia CUDA roll – Official Support for GPGPU drivers and compilers
    • Dell Roll – This includes Dell scripts and tools to configure optimal BIOS, BMC. iDRAC settings and provide updated drivers and firmware for the current and new Dell hardware.

     

    References:

    www.stackiq.com.

     

  • TACC's Stampede Gallops to #7 Fastest Computer in the World

    As happens every six months, the Top500 list of fastest supercomputers was released again during SC12 in Salt Lake City. The list represents a great way to track and measure trends, and celebrate the advancements in technological achievement at the very high-end of the high performance computing (HPC) industry.


    One of the newest systems to be recognized by Top500 is the Stampede system at the University of Texas (UT) and the Texas Advanced Computing Center (TACC). We at Dell are excited because we view our partnership with TACC as very valuable in helping to advance not only the capabilities that researchers and students have access to at TACC, but also push the envelope of what is possible.
     
    Personally I enjoy more so to learn and understand what HPC is doing to help advance science and research worldwide. To learn more, be sure to watch the Dell Tech Center and www.HPCatDell.com to view TACC's Jay Boisseau deliver a great presentation from SC12 titled, Transforming Science with Stampede.

    So congratulations to our friends at TACC, and the other 499 systems that made this November's list. To learn more about the Top500, or TACC, please see some links below to some recent news stories.
     
    Top500 Summary, Nov. 2012

    Stampede supercomputer gives scientists a powerful new tool

    TACC Presents Petascale Systems and More at SC12
     
    TACC Stampede Overview

  • Dell / Terascala HPC Storage Solution - Lustre Based Storage, HSS5.0

    By Mario Gallegos

     Introduction

    In this blog, we present the latest version of our Lustre based, parallel file system HPC Storage Solution, HSS5.0.

    This release includes a combination of updated hardware and software components, as well as an improved Web UI. The hardware updates include the Intel E5-2600 v2 series processors (code named Ivy Bridge), 1866 MT/s DDR3 memory DIMMs and support for 4 TB Near Line SAS disks. Updated software components are Lustre version 2.1.5, Lustre client support for RHEL 6.4 and improvements made by our partner Terascala to their TeraOS Web UI, used to administer and monitor the solution. These updates allow up to 33% increase in storage capacity and density, while maintaining or improving the overall system performance. 

    Figure 1: 960TB Dell | Terascala HPC Storage Solution - typical HSS5.0 “eXtra Large” configuration.


    As can be seen from Figure 1, the new version of the solution keeps the same basic server and storage configuration as its predecessor (refer to the Dell | Terascala HPC Storage Solution 4.5 white paper for more details about HSS4.5 configuration), while introducing a refresh to the newer and faster processors and memory and larger capacity via 4 TB Near Line SAS disks. One of the major improvements to the solution comes from the use of Lustre version 2.1.5 (replacing Lustre version 1.8.8 in HSS4.5), which allows the use of larger OSTs (from a maximum of 24 TiB to 128 TiB), objects (files) larger than 2 TiB and a number of stability, security and performance enhancements. In addition, Lustre 2.1.5 offers interoperability with recent 1.8.X clients.


    The improvements made by our partner Terascala to their TeraOS Web UI feature an improved method to discover, identify and resolve hardware problems as well as possible issues caused by applications or users, such as resource contention, suboptimal transfers, and file system-tuning opportunities.


    As an example, consider a situation where the file system access was imbalanced.

    Figure 2: Poor performance due to unbalanced workflow.

    The TeraOS Web UI can provide insight into the situation, allowing administrators to correct the problem, enabling optimal storage system performance:

    Figure 3: Normal performance under a balanced workflow.

    Notice that just by correcting the issue of imbalanced accesses, the file system was able to deliver a throughput improvement of about 4x. Similar situations can arise from seemingly mundane tasks, like users issuing an “ls –l” on the Lustre file system, applications that use mostly small files or that are programmed to use small I/O accesses, and a number of other situations can result in similar issues.

    In addition, the Web UI can provide trend data that can be used to react and prevent appropriately to similar situations, such as a full file system (react to a transient peak in the load) or a system that needs to increase its capacity (prevent constant problems due to a system consistently close to max capacity).

    HSS5.0 Offerings

    With the hard disk refresh, the capacities offered on HSS5.0 have increased, as presented on Figure 2.

    Figure 4: Available HSS5.0 Configurations.

    Total U[1]

    16U

    16U

    16U

    16U

    24U

    42U

    36U

    Custom


    Drive Size

    1 TB

    2 TB

    3 TB

    4 TB

    4 TB

    4 TB

    4 TB

    Raw Capacity

    120 TB

    240 TB

    360 TB

    480 TB

    960 TB

    1920 TB

    1440 TB

    Custom (PB+)

    Peak Read[2] Performance

    6.7 GB/s

    6.7 GB/s

    6.7 GB/s

    6.7 GB/s

    6.7 GB/s

    13.4 GB/s

    20.1 GB/s

    Custom  10s GB/s

    Peak Write2 Performance

    3.5 GB/s

    3.5 GB/s

    3.5 GB/s

    3.5 GB/s

    3.5 GB/s

    7 GB/s

    10.5 GB/s

    Custom  10s GB/s

    Note: For any custom configuration, please contact your Dell HPC sales representative. 



    [1] Management Server not shown, it can be 1U (R320) for small clusters or 2U (R720XD) for large clusters. 2U size was assumed for all “Total U” calculations.

    [2] The performance values listed are based on the HSS4.5 performance studies performed on a XL configuration with 3 TB disks, which are expected to yield the same sequential throughput for the HSS5.0 configurations listed in the table part of Figure 2.

    Summary

    The Dell-Terascala HPC Storage Solution version 5.0 is now available.
    HSS5.0 brings to our customers:

    • Increased Capacity: A larger storage capacity provided by the use of 4TB NLS disk drives and Lustre version 2.1.5.
    • Updated Hardware: Intel Xeon E5-2600v2 series processors (code named Ivy Bridge) and faster 1866 MT/s DDR3 memory DIMMs.
    • New Software:  Lustre version 2.1.5 and Lustre client support for RHEL6.4.
    • Improved GUI:  An improved Web UI that allows better monitoring and analytics used to discover and resolve hardware failures, application issues such as resource contention, and file-system-tuning opportunities.

    Contact your Dell HPC sales representative to get more details about this new version of the Lustre based HPC storage solution.

  • Designing Scalable 10Gb Ethernet Networks (Part 2)

    In my first post on designing scalable 10Gb Ethernet networks I discussed some of the motivations for migrating HPC and HTC computing environments from 1Gb Ethernet to 10Gb Ethernet and how to design scalable, non-blocking 10Gb networks using the Dell Force10 Z9000 TM.  It may be the case that your current computing environment needs more bandwidth than 1Gb Ethernet offers but 10Gb is overkill and will be underutilized.  It is possible to design scalable 10Gb networks that meet lower IO throughput requirements by introducing oversubscription into the architecture.

    Oversubscription exists when the theoretical peak bandwidth needs of the systems connected to a switch exceeds the theoretical peak bandwidth of the uplinks out of the switch.  Oversubscription is expressed as the ratio of inputs to outputs (ex. 3:1), or as a percent which is calculated (1 – (# outputs / # inputs)).  For example (1 – (1 output / 3 inputs)) = 67% oversubscribed).  It is important to remember that oversubscription in a network is not inherently bad.  It is a feature that must be designed as a part of an overall computing solution.  Note that the oversubscription of a network describes the worst case bandwidth for an environment.  If all the servers connected to a leaf switch are not saturating their individual 10Gb links at the same time, the actual bandwidth delivered to a server will be higher than the oversubscribed value and may even be the full 10Gb of bandwidth. The actual available bandwidth will depend on the IO access demands across all servers connected to a switch.

    Using the Dell Force10 Z9000TM to design oversubscribed distributed core 10Gb networks is very similar to designing non-blocking networks.  Simply use more of the 40Gb ports of the switch to connect to servers than to connect to spine switches.  Suppose it was determined that a 3:1 oversubscribed 10Gb network was the ideal configuration for an environment.  How do you figure out how many downlinks you need to servers and uplinks to spine switches to achieve 3:1 oversubscription?  I know this sounds weird and too easy to be true, but the port counts are determined by dividing the total number of ports in the switch by the sum of the number of inputs and number of outputs in the oversubscription ratio.  The result of this formula is the number of uplink ports.

    Applying the formula to the Z9000, the number of uplink ports in a 3:1 oversubscribed leaf switch configuration is eight.  Splitting the remaining 24 40Gb ports into 4 10Gb ports each results in 96 10Gb SFP+ links for connection to servers.

    As in a non-blocking configuration, leaf switches are connected to spine switches to complete the fat tree network.  The number of spine switches needed is determined by same method used for a non-blocking network.  Count the number of 40Gb uplinks from all leaf switches and divide by 32 (since there are 32 40Gb ports in the Z9000).  Network leaf switch uplinks are evenly divided between the spine switches. To achieve balanced performance you should have the same number of uplinks from a leaf switch connected to each of the spine switches.  Mathematically, the number of uplinks you have should be evenly divisible by the number of spine switches.

    It is also possible to use the Dell Force10 S4810TM when building distributed core networks.  The S4810 has 48 10Gb SFP+ ports plus four 40Gb uplink ports and may be used as both leaf and spine switches or can be used as a leaf switch in combination with Z9000 switches for the spine.

    As I stated in my previous post, using a fat tree topology and the distributed core capabilities of the Dell Force10 Z9000TM and Dell Force10 S4810TM switches enables a massive number of nodes to be connected into a scalable, high performing 10Gb network.  Start small and grow as your cluster grows.