High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • SC14 Schedule of Presentations for Thursday, November 20

    Here’s what’s on tap at the two theaters in the Dell Booth (# 1739) on Thursday, November 20:

    Beignet:

    • 10:15 a.m., William Edsall, HPC Lead Analyst, Dow Chemical
    • 1:15 p.m., Rhian Resnick, Asst. Director Middleware and HPC, Florida Atlantic University – “Lessons Learned Building a Research Computing Environment”
    • 2:00 p.m., Padma Raghavan, Ph.D., Penn State

    Chicory:

    • 10:30 a.m., Skip Garner, Ph.D., Virginia Bioinformatics Institute, Founder and Chief Scientist at Genomeon and Helitext – “Mining 8,000+ Genomes for Cancer Diagnostics”
    • 11:00 a.m., Kenneth Buetow, Ph.D., Arizona State University
    • 11:30 a.m., Muhammad Atif, Ph.D., NCI and ANU – “Perspectives on Implementations of an HPC Cloud Backed by Dell Hardware”
    • 1:00 p.m., Jimmy Pike, Senior Fellow and the Chief Architect and Technologist, Dell – “The New Dell and Scalable Solutions:  HPC, Big Data, and Cloud Computing”
    • 1:30 p.m., Paul Muzio, Director CUNY HPC Center, City University of New York – “The City University, The City and the Environment”
  • SC14 Schedule of Presentations for Wednesday, November 19

    Here’s what’s on tap at the two theaters in the Dell Booth (# 1739) on Wednesday, November 19:

    Beignet Theater:

    • 10:15 a.m., Erik Deumens, Ph.D., Director of UF Research Computing, University of Florida – “Secure and High Performance Work on Restricted Data”
    • 11:00 a.m., Dan Majchrzak, Director Research Computing, University of South Florida – “SSERCA: A State-Wide HPC Collaboration”
    • 11:30 a.m., Curt Hillegas, Ph.D., Director of Research Computing, Princeton University – “Computational Infrastructure to Support a Diverse Community of Researchers”
    • 1:30 p.m., Panel Presentation – “HPC in the Cloud: The Overcast has Cleared”
      • Muhammad Atif, Ph.D., National Computational Infrastructure, Australian National University
      • Larry Smarr, Ph.D., University of California, San Diego
      • Roger Rintala, Intelligent Light
      • Boyd Wilson,  Clemson University and Omnibond
    • 3:00 p.m., Panel Presentation – “Data Intensive Computing: the Gorilla Behind the Computation"
      • Ken Buetow, Ph.D., Arizona State University
      • Erik Deumens, Ph.D., University of Florida
      • Niall Gaffney, TACC
      •  William Law, Stanford
    • 4:15 p.m., Jimmy Pike, Senior Fellow and the Chief Architect and Technologist, Dell – “The New Dell and Scalable Solutions:  HPC, Big Data, and Cloud Computing”

    Chicory Theater:

    • 10:30 a.m., Dan Stanzione, Ph.D., Texas Advanced Computing Center
    • 11:15 a.m., Merle Giles, NCSA – “HPC as an Innovation Engine”

     

  • Join Us at SC14! Schedule for Nov. 18 for Presentations in Our Two Theaters

    Welcome to SC14! Join us in Booth #1739 on Tuesday, November 18  for these exciting presentations in our two theaters.

    Chicory Theater:

    • 10:30, Niall Gaffney, Director of Data Intensive Computing, Texas Advanced Computing Center – “Wrangler: A New Generation of Data Intensive Computing”
    • 11:15, HonggaoLiu, Ph.D., Deputy Director of the Center for Computation & Technology, Louisiana State University – “Accelerating Computational Science and Engineering with Dell Supercomputers in Louisiana”
    • 1:45, Chris Lynberg, R&D Computer Scientist, Centers for Disease Control – “Advancing CDC Technologies”
    • 2:30, Charlie McMahon, Ph.D., Vice President of Information Technology and CTO, Tulane University – “HPC Cluster Cypress”
    • 3:15, Kevin Hildebrand, HPC Architect for the Division of Information Technology, University of Maryland
    • 4:00, Michael Norman, Ph.D., Director, San Diego Supercomputer Center - “Comet: HPC for the 99 Percent”

    Beignet Theater:

    • 10:15, Henry Neeman, Ph.D., Asst. VP, Information Technology, Oklahoma University – “The OneOklahoma Friction Free Network"
    • 11:00, John D’Ambrosia, Dell – “Ethernet – the Open Standards Approach to Supercomputing”
    • 11:30, Roger Bielfed, Ph.D., Senior Director Information Technology Services, Case Western Reserve University – “Research Computing and Big Data at Case Western Reserve University"
    • 12:00, Jimmy Pike, Senior Fellow and Chief Architect and Technologist, Dell – “The New Dell and Scalable Solutions: HPC , Big Data, and Cloud Computing”
    • 1:30, Larry Smarr, Ph.D., University of California, San Diego – “Using Dell’s HPC Cloud & Advanced Analytic Software to Discover Radical Changes in the Microbiome in Health and Disease”
    • 2:15, Walt Ligon, Ph.D., Clemson University
    • 3:00, James Lowey, Vice President of Technology, TGEN – “HPC for Genomics, the Next Generation
    • 3:45, Eldon Walker, Ph.D., Cleveland Clinic

  • Jimmy Pike Leads Strategy Discussions and Breakout Sessions at SC14

    There's always so much to see and learn at the Supercomputing Conference, and SC14 is no exception. Along with an impressive list of guest speakers in our two theaters (Booth #1739), and our panels featuring some of the top experts in HPC, this year we're also featuring a schedule of additional, informative sessions.

    Jimmy Pike, Dell Senior Fellow and the Chief Architect and Technologist will lead three strategy discussions in our booth, as well as host three additional breakout sessions.

    The strategy discussion will focus on how Dell is building on its long-time leadership in providing HPC clusters for modeling and simulation by embarking on a path of greater leadership in the innovation and development of tightly integrated, highly scalable solutions that help achieve better insights and faster answers to comprehensive computational and data challenges.

    The breakout session topics will include:

    • Improving HPC Acceleration with Configurable System Design
    • HPC Storage Options from Integrated Solutions to Roll Your Own Configurations
    • Analyzing Streaming Data with In-Memory Technologies

    More information, including a full schedule, is attached.

     

  • Jimmy Pike Discusses the Upcoming Panels at SC14

    Dell Fellow Jimmy Pike recently spoke with Rich Brueckner of insideHPC about the exciting panel discussions planned for SC14 in New Orleans.  You can hear the podcast in its entirety here. Be sure to visit us at booth #1739 in the Big Easy!

  • Containers, Docker, Virtual Machines and HPC

    by Nishanth Dandapanthula and Joseph Stanfield

    Introduction

    In our previous blog post, we presented a comparative analysis of a bare metal system versus a virtual machine in an OpenStack environment using HPC applications. In this blog we add containers [2] to this mix. The concept behind containers is not new. A variety of systems utilize the core concept of an isolated application, such as BSD Jails [3], Solaris Zones [4], AIX Workload Partitions [5], and LXC containers. In this blog, we attempt to explain what comprises a container, and how it differs from a virtual machine. We will then quantify the difference in performance when comparing a container, bare metal system (BM) and a virtual machine (VM) with the exact same resources in terms of CPU and memory. 

    What are Docker containers?

    Every application has its own dependencies, which include both software (services, libraries) and hardware (CPU, memory) resources. Docker [6 ] is a light weight virtualization mechanism which isolates these dependencies per each application by packaging them into virtual containers. These containers can run on any Linux server making the application portable and giving the users flexibility to run the application on any framework such as Cloud (private or public), virtual machines, bare metal machines etc. Containers are scalable and are launched in an environment similar to chroot [7].

    How are containers different from a virtual machine?

    With a virtual machine, a good chunk of resources must be allocated, then emulated for the guest OS and hypervisor. Each guest OS runs as an independent entity from the host system, using virtual hardware that requires physical CPU cores, memory, and enough hard drive space for the OS. (See figure 1.) A Docker container, on the other hand, is executed with the Docker engine instead of a hypervisor and utilizes resources available on the host system (see figure 2). Without all of the overhead of a hypervisor and VM, containers can theoretically achieve the same performance as a bare metal system. Does the Docker Engine have non-trivial overhead? The short answer is No. The explanation is in the results section below.

    Pros and Cons

    At an application level, containers would be ideal, considering the loading time and system resources that would need to be committed to launching a virtual machine. But your application requirements must meet the architecture of the host OS. Containers are limited to the host ecosystem on which they are built (kernel, binaries, libraries etc.). This could be problematic when working in a heterogeneous environment (windows, multiple Linux distributions, etc.).

    VMs are very helpful in a heterogeneous environment since they have no such dependencies. An administrator can easily deploy multiple operating systems from prepackaged images over bare metal servers running any OS. Consumers also have a wide range of management software that can be chosen to fit their specific needs such as VM Ware & OpenStack.

    Performance and analysis

     For the scope of this blog, the terminology used is as follows.

    • The Bare Metal machine (BM) refers to the physical server running RHEL 6.5.
    • The Virtual machine (VM) refers to the VM running on a hypervisor on this bare metal machine using all the cores and memory of the bare metal system.
      • OpenStack Icehouse-3 RDO PackStack (Running on a separate Server)
      • QEMU KVM and RHEL 6.5
      The Docker container refers to the containerized applications which run on the BM using all the available cores and memory.
      • Version 1.0 and custom-built RHEL 6.5 Image

    Table 1 describes the test setup for these single node tests. The BIOS options chosen are HPC defaults and a few relevant ones are listed below. We used a sample set of applications from the HPC domain, both proprietary and open source for this comparison. The details of the applications are mentioned in Table 2

      Table 1 Test Bed Configuration

     (Click on images to enlarge)

    Table 2 Applications

    Applications

    Application Characteristics

    Benchmark dataset used

    Metric

    HPL 2.0

    Compute intensive benchmark measuring the floating point rate of execution

    N = 110000, NB = 168

    GFlops

    ANSYS Fluent V15

    Proprietary computational fluid dynamics application

    Truck_poly_14m

    Rating (Jobs/Day)

    Stream Triad

    Measures the sustained memory bandwidth

    N = 160000000

    MB/s

    Nas Parallel Benchmarks (NPB)

    Kernels and benchmarks derived from computational fluid dynamics (CFD) applications

    Class D

    Rating (Jobs/Day)

    LS-DYNA 6.1.0

    Proprietary, structural and fluid analysis simulation software used in manufacturing, crash testing, aerospace industry, automobile industry etc.

    Top crunch 3 vehicle collision

    Rating (Jobs/Day)

    WRF 3.3

    Open source application which helps in atmospheric research and forecasting

    Conus 12KM

    Rating (Jobs/Day)

    MILC 7.6.1

    MIMD lattice computation is an open source quantum chromo dynamics code which performs large scale numerical simulations to study the strong interactions which occur in sub atomic physics

    Input file from Intel Corp.

    Rating (Jobs/Day)

    The performance difference between BMs and VMS was explained previously in [1]. Figures 1 and 2 add Docker containers to this mix and show the relative performance for the above mentioned applications on containers and VMs when compared to BMs. Since containers are very light weight compared to VMs and because of the fact that the containers are NUMA aware, they performs on par with BM in all the cases where as the VM takes a hit in almost all the cases. From the results is also evident that Docker engine has a trivial overhead.

     

    Figure 3 Application performance of Containers and VMs relative to BMs


    Figure 4 Performance of NPB on containers and VMs relative to BMs

    Conclusion and Future Work

    The results above show that containerized solutions perform competitively with bare metal servers whereas VMs can take a hit of up to 25% depending on the application characteristics. As mentioned previously, VMS provide a great deal of flexibility when working in a heterogeneous environment whereas Docker containers are primarily focused on applications and their dependencies.

    The studies so far are based on a single node and work for some scenarios. We are working on scaling these studies across a cluster and introducing high speed interconnects (InfiniBand, RoCE). The goal is to compare the performance of a cluster of VMs and Containers to bare metal servers interconnected with InfiniBand.

    References

    1. http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2014/07/15/hpc-in-an-openstack-environment
    2. https://linuxcontainers.org/
    3. https://www.freebsd.org/doc/handbook/jails.html
    4. http://docs.oracle.com/cd/E18440_01/doc.111/e18415/chapter_zones.htm
    5. http://www.ibm.com/developerworks/aix/library/au-workload/
    6. https://www.docker.com/whatisdocker/
    7. https://help.ubuntu.com/community/BasicChroot

     

     

  • Justifying the Expenditure on HPC

    by Jimmy Pike

    High performance computing can offer an organization immeasurable value by cutting costs, reducing the time to market, enabling life-changing discovery, or any number of other quantifiable variables. However, there still exists a very real disconnect between the value realized through HPC, and an ability to justify the expenditure.

    According to a recent survey conducted by Intersect360 for the Council on Competitiveness, some 3/4 of companies asked admitted that HPC was critical to their organization's competitiveness. Yet approximately 2/3 also indicated they face challenges in justifying the expenditure to some degree, with one in ten declaring it is a significant challenge.

    From the survey we can ascertain two primary challenges companies face when considering HPC: price and Return on Investment (ROI.) Although the price point can be a hindrance - especially for smaller organizations - it is the difficulty in clearly showing ROI that can make the cost such a difficult obstacle to overcome.  After all, HPC doesn't always allow for a prediction that $X spent will yield $Y returned. 

    So, what can be done? Well, in the long term solutions like increased scalability and greater government investment will help bridge the divide between need and expenditure. However, we also have a duty right now to help educate the key decision makers about the ROI available through HPC.

    For example, recently I spoke with some industry leaders, who admitted that without supercomputing it can take 3-5 years to get their product safely completed. However, the decision makers are unwilling to make the needed HPC investment to safely and successfully reduce that time frame. My response to that statement was simple: What do your companies have to lose? You can build, test, build and repeat, or simulate and test. The latter provides significant ROI. 

    Additionally, there is a growing demand for iterative research. Rather than running a batch, stopping it, making the change and running it again, there is now an emerging ability in some environments to change variables along the way without running a new batch. That ability can prove to be invaluable.

    Finally, for smaller companies, there are some options to "test run" software to see how costly any required licenses will be. The National Center for Supercomputing Applications (NCSA) at the University of Illinois, Urbana-Champaign, for example, allows companies to use its iForge cluster to better understand what performance and scalability gains will be realized under various conditions. (You can read more about the program in an earlier blog.)

    Ultimately, it's up to us to help industries better understand and explain the myriad benefits that come with high performance computing.  Because when organizations recognize that the ROI is well worth the expenditure, everyone benefits.

    If you're interested in the full methodology, comments and results from the Intersect360 / Council on Competitiveness survey, you can access it here.

  • Testing Altair's Drop Test Simulation Software

    Recently, Dell, Intel and Altair collaborated to test Altair's drop test simulation software. The simulation occurred on a Dell cluster powered by Intel processors.  Drop test simulations, or impact analysis, is one of the most important stages of product design and development because it helps manufacturers speed up time-to-market by allowing for higher levels of design, while reducing the need for physical testing.

    During the testing, engineers focused on a specific use case: testing whether the addition of a damper gasket would reduce stress on phone design.  The goal was to discover the optimized gasket design that minimized filtered stress in the edge elements of the LCD.  The existing gap between the phone shield and carrier plate can bend and cause high stress levels in the LCD module in back drop tests.

    The research consisted of three steps:

    • Design - Modeling the concept in HyperMesh, and generating design variables with morphing technology and input file parameterization.
    • Optimize - Performing a design-of-experiment to generate a response surface, followed by optimization performed on that surface versus a finite element model.
    • Verify - Evaluating / simulating the optimized design with finite element analysis and then verifying the performance results.

    Engineers ran  21 drop test simulations necessary in this optimization study and benchmarked three different Intel processors in 2-node configurations. 

    Results Overview

    The results reveled that design engineers utilizing Altair's drop test solution on Dell / Intel systems can optimize phone impact performance. This  ensures all warranty and customer satisfaction standards are met.  Additionally, design quality can be improved. How product components perform when the effects of changes are explored can provide insights into the dynamic behavior of real-world drop testing. This means reduced product development timelines and costs, allowing manufacturers the freedom to focus on improved designs for a better final product.

    To learn more about this study and see more detailed results, you can read the attached published white paper. You can also hear Stephan Gillich from Intel, Eric Lequiniou from Altair, and Martin Hilgeman discuss the study and its findings on the This Week in HPC podcast.

  • Performance Improvement with Intel Xeon® Phi

    by Ashish Kumar Singh and Calvin Jacob

    This blog details the performance improvement that has been achieved using Intel Xeon® Phi on the current generation Dell servers as compared to the same server of CPU configuration only. Though the performance runs carried out consisted of multiple iterations of runs along with multiple BIOS options, only the best results are recorded in this blog. For each of the application, the settings which yielded the best performance are recorded in the results description for each of the applications. All the performance runs were carried out with hyper threading (logical processor) disabled. The system configuration of the systems used are as below:

     

    Dell PowerEdge® R730

    Operating System

    Red Hat Enterprise Linux 6.5

    CPU

    2x Intel Xeon® E5-2695v3

    Memory

    16x 16GB DDR4 DIMMs, 2133MHz

    Co-processor

    2x Intel Xeon® Phi 7120P

    Power Supply

    2x 1100W (non-redundant)

    Intel Compiler

    Intel Parallel Studio XE 2015

    Driver for co-processor

    MPSS 3.3

    The applications chosen along with the respective versions and the domains are as below:

    Application

    Version

    Domain

    High Performance Linpack

    v2.1

    System Benchmark

    STREAM

    5.10

    System Benchmark

    SHOC

    v.1.1.4a-mic

    System Benchmark

    NAMD

    v2.10

    Molecular Dynamics

    ANSYS Mechanical

    v15.0

    Finite Element Analysis

    The BIOS options selected for this blog as below:

    System BIOS Options

    Settings

    Memory Settings > Snoop Mode

    Cluster on Die

    Processor Settings > Logical Processor

    Disabled

    Processor Settings > QPI Speed

    Maximum Data Rate

    Processor Settings > Configurable TDP

    Nominal

    System Profile Settings > System Profile

    Performance

    Specification of Intel Xeon® E5-2695v3:

    Cores

    14 Cores

    Clock speed

    2.3GHz

    Intel QPI Speed

    9.6GT/s

    Maximum TDP

    120W

    Cache

    35MB

    Memory Channels

    4

    Specification of Intel Xeon® Phi 7120P:

    Cores

    61

    Clock Speed

    1.238GHz

    Memory

    16GB GDDR5

    Memory Channels

    16

    Maximum Memory Bandwidth

    352GB/s

    L2 Cache

    30.5MB

    Maximum TDP

    300W

    The comparison has been made between 3 configurations:

    1. R730 with two Intel Xeon® E5-2695v3 only
    2. R730 with two Intel Xeon® E5-2695v3 and one Intel Xeon® Phi 7120P
    3. R730 with two Intel Xeon® E5-2695v3 and two Intel Xeon® Phi 7120P 

    Throughout this blog, the performance has been expressed comparing configuration 1 through configuration 3. There has been a considerable amount of performance gain that has been seen while scaling from CPU only configurations to servers with one and two Intel Xeon® Phi 7120P.

    Below are the details of the applications used for the study and the results obtained.

    High Performance Linpack is the benchmark that would calculate the total throughput (sustained throughput) obtained from a system, this data can be used for calculating the overall efficiency by comparing it against the maximum theoretical throughput. For the below recorded run, CPU was set to Performance Mode and the Intel Xeon® Phi was set to ECC ON and Turbo OFF. The achieved throughput on CPU only was taken as the baseline. Below is the comparison of the scale-up performance runs carried out on configurations with processors and Intel Xeon Phi.

    High Performance Linpack (GFLOPS) on Dell PowerEdge® R730

    2x Intel Xeon E5-2695v3

    2x Intel Xeon® E5-2695v3, 1x Intel Xeon® Phi 7120P

    2x Intel Xeon® E5-2695v3, 2x Intel Xeon® Phi 7120P

    839.3

    1720.5

    2634.5

    (Click on images to enlarge.)

    We observe a 100% and 200% performance improvement on configurations with one Intel Xeon® Phi 7120P and two Intel Xeon® Phi 7120P respectively. The runs on Intel Xeon® Phi were done in the offload mode. In offload mode, the program is launched on the host CPU but executed on the Intel Xeon® Phi. 

    STREAM benchmark is used for measuring the memory bandwidth in a system. This measures the rate at which the data transfers happen within system or within Intel Xeon® Phi. The performance numbers were recorded for CPU set to performance mode and the Intel Xeon® Phi set to ECC OFF and Turbo ON. Below is the data from the STREAM performance runs:

    The achieved stream bandwidth on CPU only configuration on R730 was taken as the baseline. We observe performance improvements of 50% to 70% on configurations with one Intel Xeon® Phi 7120P and two Intel Xeon® Phi 7120P as against the CPU only configuration. The best STREAM bandwidth was observed on Intel Xeon® Phi 7120P with ECC set to OFF and turbo set to ON. 

    SHOC measures the maximum device memory bandwidth for different levels of the memory hierarchy and different access patterns. Results are reported in GB/s. Host-to-Device bandwidth is measured by SHOCDownload and Device-to-Host bandwidth is measured by SHOCReadback. There is an overall bandwidth of approximately 6.9GB/s seen across SHOCDownload and SHOCReadback. The details are as provided below.

    Processor 1

    Processor 2

    Intel Xeon® Phi-1

    Intel Xeon® Phi-2

    Intel Xeon® Phi-1

    Intel Xeon® Phi-2

    SHOCDownload (GB/s)

    6.91

    6.84

    6.86

    6.87

    SHOCReadback (GB/s)

    6.92

    6.89

    6.91

    6.90

    NAMD (NAnoscale Molecular Dynamics program) is a molecular dynamics simulation package which uses the Charm++ parallel programming model. It has good parallel efficiency and is used to simulate large systems. Here the metric used is nano seconds/day. The benchmark has been run with three different datasets, namely ApoA1, ATPase and STMV. The performance numbers were recorded for CPU set to performance mode and the Intel Xeon® Phi set to ECC OFF and Turbo ON.

    NAMD (nano seconds/day) on Dell PowerEdge® R730

    2x Intel Xeon E5-2695v3

    2x Intel Xeon E5-2695v3, 1x Intel Xeon® Phi 7120P

    2x Intel Xeon E5-2695v3, 2x Intel Xeon® Phi 7120P

    APoA1

    2.65

    4.29

    6.02

    ATPase

    0.9

    1.39

    2.06

    STMV

    0.25

    0.4

    0.58

    ApoA1 gene provides instructions for making a protein called Apolipoprotein A-I. With the ApoA1 dataset, we observe 60% and 130% performance improvements on configuration with one and two Intel Xeon® Phi 7120P respectively as compared to CPU only configurations.

    ATPase are a class of enzymes that catalyze the decomposition of ATP into ADP and free phosphate ion. With the ATPase dataset, we observe 55% and 130% performance improvements on configuration with one and two Intel Xeon® Phi 7120P respectively as compared to CPU only configurations.

    STMV (Satellite Tobacco Mosaic Virus) is a small, icosahedral plant virus. With the STMV dataset, we observe 60% and 130% performance improvements on configuration with one and two Intel Xeon® Phi 7120P respectively as compared to CPU only configurations.

    ANSYS Mechanical is a comprehensive FEA analysis (finite element) tool for structural analysis, including linear, nonlinear and dynamic studies. The performance numbers were recorded for CPU set to performance mode and the Intel Xeon® Phi set to ECC ON and Turbo OFF. The best performance was seen with 16 cores. The behavior is as expected and the details for the same can be found here.

    Concluding remarks:

    It has been noted from the results as recorded above that there is a considerable amount of performance gain for most of the applications when using Intel Xeon® Phi with performance improvements in the range of up to 300% across the chosen spectrum of applications and the add-on Intel Xeon® Phi configuration. Overall the users can take advantage of superior performance of the latest Dell PowerEdge® Server powered by Intel Xeon® E5-26xxv3 with support for new extensions coupled with increased memory speed from DDR3 to DDR4.

  • Comparing Haswell Processor Models for HPC Applications

    by Garima Kochhar

    This blog evaluates four Haswell processor models (Intel® Xeon® E5-2600 v3 Product Family) comparing them for performance and energy efficiency on HPC applications. This is part three in a three part series. Blog one provided HPC results and performance comparisons across server generations, comparing Ivy Bridge (E5-2600 v2), Sandy Bridge (E5-2600) and Westmere (X5600) to Haswell. The second blog discussed the performance and energy efficiency implications of BIOS tuning options available on the new Dell Haswell servers.

    In this study we evaluate processor models with different core counts, CPU frequencies and Thermal Design Power (TDP) ratings and analyze the differences in performance and power. Focusing on HPC applications, we ran two benchmarks and four applications on our server. The server in question is part of Dell’s PowerEdge 13th generation (13G) server line-up. These servers support DDR4 memory at up to 2133 MT/s and Intel’s latest E5-2600 v3 series processors (architecture code-named Haswell). Haswell is a net new micro-architecture when compared to the previous generation Sandy Bridge/Ivy Bridge. Haswell based processors use a 22nm process technology, so there’s no process-shrink this time around. Note the “v3” in the Intel product name – that is what distinguishes a processor as one based on Haswell micro-architecture. You’ll recall that “E5-2600 v2” processors are based on the Ivy Bridge micro-architecture and plain E5-2600 series with no explicit version are Sandy Bridge based processors. Haswell processors require a new server/new motherboard and DDR4 memory. The platform we used is a standard dual-socket rack server with two Haswell-EP based processors. Each socket has four memory channels and can support up to 3 DIMMs per channel (DPC). 

    Configuration

    Table 1 below details the applications we used and Table 2 describes the test configuration on the new 13G server.

    Table 1 - Applications and Benchmarks

    Table 2 - Server configuration

    Components

    Details

    Server

    PowerEdge R730xd prototype

    Processor

    2 x Intel® Xeon® E5-2693 v3 – 2.6/2.2 GHz, 14c, 145W

    2 x Intel® Xeon® E5-2680 v3 – 2.5/2.1 GHz, 12c, 120W

    2 x Intel® Xeon® E5-2660 v3 – 2.6/2.2 GHz, 10c, 105W

    2 x Intel® Xeon® E5-2640 v3 – 2.6/2.2 GHz, 8c, 90W

    * Frequency noted as “Rated base/AVX base GHz”

    Memory

    128GB - 8 x 16GB 2133 MHz DDR4 RDIMMs

    Hard drive

    1 x 300GB SAS 6Gbps 10K rpm

    RAID controller

    PERC H330 mini

    Operating System

    Red Hat Enterprise Linux  6.5 x86_64

    Kernel

    2.6.32-431.el6.x86_64   

    BIOS settings

    As noted per test

    MPI

    Intel® MPI 4.1.3.049

    Math Library

    Intel® MKL 11.1.3.174

    Compilers

    Intel® 2013_sp1.3.174  - v14.0.3.174 Build 20140422

    All the results shown here are based on single-server performance.  The following metrics were used to compare performance.

    • Stream – Triad score as reported by the stream benchmark.
    • HPL – GFLOP/second as reported by the benchmark.
    • Fluent – Solver rating as reported by the application.
    • LS DYNA – Elapsed Time as reported by the application.
    • WRF – Average time step computed over the last 719 intervals for Conus 2.5km.
    • MILC – Time as reported by the application.

    Power was recorded during the tests on a power meter attached to the server. The average steady state power is used as the power metric for each benchmark.

    Energy efficiency (EE) computed as Performance per Watt (performance/power).

    Results

    Figure 1 plots the performance of the four processor models (SKUs) for the benchmarks and applications used in this study. The BIOS was set to Early Snoop memory mode (ES), DAPC system profile (Turbo enabled, C-states enabled), and Logical Processor (Hyper-Threading) was turned off. All other BIOS options were at Dell defaults. The baseline used for comparison is the E5-2660 v3 10c 2.6 GHz processor.

    From the graph, it can be seen that the memory bandwidth for the first three processor models (E5-2697 v3, E5-2690 v3, E5-2660 v3) was about the same. These SKUs can support memory at 2133 MT/s. The E5-2640 v3 has lower memory bandwidth since the maximum memory speed supported is 1866 MT/s.

    All the other applications show a steady performance improvement with higher bin processor models. For codes that have per-core licenses, the improvement with higher bin processors that have more cores is not commensurate with the increase in number of cores. For example, when comparing the 10c SKU to the 12c SKU, adding 20% more cores (20 cores vs. 24 cores) allows Fluent running truck_poly_14m a 17% performance improvement.  (Click on images to enlarge.)

    Figure 1 - Performance - ES.DAPC

    Figure 2 plots the relative energy efficiency of the test cases in Figure 1 using the 10c E5-2660 v3 SKU as the baseline.

    Since the 14c, 12c and 10c SKUs have very similar memory bandwidth @ 2133 MT/s and the higher end processors have higher TDP and consume more power, the Stream EE follows the inverse of the TDP. (EE is performance/power).

    HPL shows an improvement in energy efficiency with higher bin processors, the improvements in performance out-weigh the additional power consumed by the higher wattage processors.

    For all the other applications, the energy efficiency is within 5% for each SKU and varies per application. Fluent and LS-DYNA share similar characteristics with the 12c and 8c SKUs measuring slightly EE than the 14c and 10c SKUs.  WRF and MILC have similar trends with the lower end SKUs showing better EE than the higher end SKUs.

    Figure 2 - Energy Efficiency - ES.DAPC

    Figure 3 plots similar results for performance and energy efficiency on a different BIOS configuration. In these tests the BIOS was set to Cluster On Die Snoop mode, Performance profile (Turbo enabled, Cstates disabled) and Logical Processor disabled. Recall that the Cluster On Die (COD) mode is only supported on SKUs that have two memory controller per processor, i.e. 10 or more cores. The 8c E5-2640 v3 does not support COD mode.

    The relative performance and energy efficiency patterns shown in Figure 3 for COD.Perf match those of the ES.DAPC mode (Figures 1 and 2). We know from Blog 2 that COD.Perf performs 1-3% better than ES.DAPC for the applications and data sets used in this study. This improvement is seen across the different processor models given that the relative performance between SKUs stays similar for ES.DAPC and COD.Perf BIOS settings.

     

    Figure 3 - Performance, Energy Efficiency - COD.DAPC

    Figure 4 plots the idle and peak power consumption across the different BIOS settings. (Note this data was gathered on an early prototype unit running beta firmware.) The power measurements shown here are for comparative purposes across SKUs and not an absolute indicator of the server’s power requirements. The text values within the bar graph show relative values using the 10c E5-2660 v3 as a baseline.

    The idle power of the system is similar irrespective of the processor model used. This is good and demonstrates the energy efficiency of the base system. As desired, a higher wattage processor does not consume additional power when the system is idle.

    As expected, the peak power draw measured during HPL initialization is greater for the higher bin CPUs that have higher TDP. 

    Figure 4 - Power idle and peak - ES.DAPC

    Conclusion

    There is a clear performance up-side for all the applications and datasets studied here when using higher bin/higher core count processors. The goal of this study was to quantify these performance improvements as a guide to choosing the best processor model for a workload or cluster. This blog concludes our three part series on the impact of the new Dell 13G servers with Intel Haswell processors on HPC applications.