Here’s what’s on tap at the two theaters in the Dell Booth (# 1739) on Thursday, November 20:
Here’s what’s on tap at the two theaters in the Dell Booth (# 1739) on Wednesday, November 19:
Welcome to SC14! Join us in Booth #1739 on Tuesday, November 18 for these exciting presentations in our two theaters.
There's always so much to see and learn at the Supercomputing Conference, and SC14 is no exception. Along with an impressive list of guest speakers in our two theaters (Booth #1739), and our panels featuring some of the top experts in HPC, this year we're also featuring a schedule of additional, informative sessions.
Jimmy Pike, Dell Senior Fellow and the Chief Architect and Technologist will lead three strategy discussions in our booth, as well as host three additional breakout sessions.
The strategy discussion will focus on how Dell is building on its long-time leadership in providing HPC clusters for modeling and simulation by embarking on a path of greater leadership in the innovation and development of tightly integrated, highly scalable solutions that help achieve better insights and faster answers to comprehensive computational and data challenges.
The breakout session topics will include:
More information, including a full schedule, is attached.
Dell Fellow Jimmy Pike recently spoke with Rich Brueckner of insideHPC about the exciting panel discussions planned for SC14 in New Orleans. You can hear the podcast in its entirety here. Be sure to visit us at booth #1739 in the Big Easy!
by Nishanth Dandapanthula and Joseph Stanfield
In our previous blog post, we presented a comparative analysis of a bare metal system versus a virtual machine in an OpenStack environment using HPC applications. In this blog we add containers  to this mix. The concept behind containers is not new. A variety of systems utilize the core concept of an isolated application, such as BSD Jails , Solaris Zones , AIX Workload Partitions , and LXC containers. In this blog, we attempt to explain what comprises a container, and how it differs from a virtual machine. We will then quantify the difference in performance when comparing a container, bare metal system (BM) and a virtual machine (VM) with the exact same resources in terms of CPU and memory.
What are Docker containers?
Every application has its own dependencies, which include both software (services, libraries) and hardware (CPU, memory) resources. Docker [6 ] is a light weight virtualization mechanism which isolates these dependencies per each application by packaging them into virtual containers. These containers can run on any Linux server making the application portable and giving the users flexibility to run the application on any framework such as Cloud (private or public), virtual machines, bare metal machines etc. Containers are scalable and are launched in an environment similar to chroot .
How are containers different from a virtual machine?
With a virtual machine, a good chunk of resources must be allocated, then emulated for the guest OS and hypervisor. Each guest OS runs as an independent entity from the host system, using virtual hardware that requires physical CPU cores, memory, and enough hard drive space for the OS. (See figure 1.) A Docker container, on the other hand, is executed with the Docker engine instead of a hypervisor and utilizes resources available on the host system (see figure 2). Without all of the overhead of a hypervisor and VM, containers can theoretically achieve the same performance as a bare metal system. Does the Docker Engine have non-trivial overhead? The short answer is No. The explanation is in the results section below.
Pros and Cons
At an application level, containers would be ideal, considering the loading time and system resources that would need to be committed to launching a virtual machine. But your application requirements must meet the architecture of the host OS. Containers are limited to the host ecosystem on which they are built (kernel, binaries, libraries etc.). This could be problematic when working in a heterogeneous environment (windows, multiple Linux distributions, etc.).
VMs are very helpful in a heterogeneous environment since they have no such dependencies. An administrator can easily deploy multiple operating systems from prepackaged images over bare metal servers running any OS. Consumers also have a wide range of management software that can be chosen to fit their specific needs such as VM Ware & OpenStack.
Performance and analysis
For the scope of this blog, the terminology used is as follows.
Table 1 describes the test setup for these single node tests. The BIOS options chosen are HPC defaults and a few relevant ones are listed below. We used a sample set of applications from the HPC domain, both proprietary and open source for this comparison. The details of the applications are mentioned in Table 2.
Table 1 Test Bed Configuration
(Click on images to enlarge)
Table 2 Applications
Benchmark dataset used
Compute intensive benchmark measuring the floating point rate of execution
N = 110000, NB = 168
ANSYS Fluent V15
Proprietary computational fluid dynamics application
Measures the sustained memory bandwidth
N = 160000000
Nas Parallel Benchmarks (NPB)
Kernels and benchmarks derived from computational fluid dynamics (CFD) applications
Proprietary, structural and fluid analysis simulation software used in manufacturing, crash testing, aerospace industry, automobile industry etc.
Top crunch 3 vehicle collision
Open source application which helps in atmospheric research and forecasting
MIMD lattice computation is an open source quantum chromo dynamics code which performs large scale numerical simulations to study the strong interactions which occur in sub atomic physics
Input file from Intel Corp.
The performance difference between BMs and VMS was explained previously in . Figures 1 and 2 add Docker containers to this mix and show the relative performance for the above mentioned applications on containers and VMs when compared to BMs. Since containers are very light weight compared to VMs and because of the fact that the containers are NUMA aware, they performs on par with BM in all the cases where as the VM takes a hit in almost all the cases. From the results is also evident that Docker engine has a trivial overhead.
Figure 3 Application performance of Containers and VMs relative to BMs
Figure 4 Performance of NPB on containers and VMs relative to BMs
The results above show that containerized solutions perform competitively with bare metal servers whereas VMs can take a hit of up to 25% depending on the application characteristics. As mentioned previously, VMS provide a great deal of flexibility when working in a heterogeneous environment whereas Docker containers are primarily focused on applications and their dependencies.
The studies so far are based on a single node and work for some scenarios. We are working on scaling these studies across a cluster and introducing high speed interconnects (InfiniBand, RoCE). The goal is to compare the performance of a cluster of VMs and Containers to bare metal servers interconnected with InfiniBand.
by Jimmy Pike
High performance computing can offer an organization immeasurable value by cutting costs, reducing the time to market, enabling life-changing discovery, or any number of other quantifiable variables. However, there still exists a very real disconnect between the value realized through HPC, and an ability to justify the expenditure.
According to a recent survey conducted by Intersect360 for the Council on Competitiveness, some 3/4 of companies asked admitted that HPC was critical to their organization's competitiveness. Yet approximately 2/3 also indicated they face challenges in justifying the expenditure to some degree, with one in ten declaring it is a significant challenge.
From the survey we can ascertain two primary challenges companies face when considering HPC: price and Return on Investment (ROI.) Although the price point can be a hindrance - especially for smaller organizations - it is the difficulty in clearly showing ROI that can make the cost such a difficult obstacle to overcome. After all, HPC doesn't always allow for a prediction that $X spent will yield $Y returned.
So, what can be done? Well, in the long term solutions like increased scalability and greater government investment will help bridge the divide between need and expenditure. However, we also have a duty right now to help educate the key decision makers about the ROI available through HPC.
For example, recently I spoke with some industry leaders, who admitted that without supercomputing it can take 3-5 years to get their product safely completed. However, the decision makers are unwilling to make the needed HPC investment to safely and successfully reduce that time frame. My response to that statement was simple: What do your companies have to lose? You can build, test, build and repeat, or simulate and test. The latter provides significant ROI.
Additionally, there is a growing demand for iterative research. Rather than running a batch, stopping it, making the change and running it again, there is now an emerging ability in some environments to change variables along the way without running a new batch. That ability can prove to be invaluable.
Finally, for smaller companies, there are some options to "test run" software to see how costly any required licenses will be. The National Center for Supercomputing Applications (NCSA) at the University of Illinois, Urbana-Champaign, for example, allows companies to use its iForge cluster to better understand what performance and scalability gains will be realized under various conditions. (You can read more about the program in an earlier blog.)
Ultimately, it's up to us to help industries better understand and explain the myriad benefits that come with high performance computing. Because when organizations recognize that the ROI is well worth the expenditure, everyone benefits.
If you're interested in the full methodology, comments and results from the Intersect360 / Council on Competitiveness survey, you can access it here.
Recently, Dell, Intel and Altair collaborated to test Altair's drop test simulation software. The simulation occurred on a Dell cluster powered by Intel processors. Drop test simulations, or impact analysis, is one of the most important stages of product design and development because it helps manufacturers speed up time-to-market by allowing for higher levels of design, while reducing the need for physical testing.
During the testing, engineers focused on a specific use case: testing whether the addition of a damper gasket would reduce stress on phone design. The goal was to discover the optimized gasket design that minimized filtered stress in the edge elements of the LCD. The existing gap between the phone shield and carrier plate can bend and cause high stress levels in the LCD module in back drop tests.
The research consisted of three steps:
Engineers ran 21 drop test simulations necessary in this optimization study and benchmarked three different Intel processors in 2-node configurations.
The results reveled that design engineers utilizing Altair's drop test solution on Dell / Intel systems can optimize phone impact performance. This ensures all warranty and customer satisfaction standards are met. Additionally, design quality can be improved. How product components perform when the effects of changes are explored can provide insights into the dynamic behavior of real-world drop testing. This means reduced product development timelines and costs, allowing manufacturers the freedom to focus on improved designs for a better final product.
To learn more about this study and see more detailed results, you can read the attached published white paper. You can also hear Stephan Gillich from Intel, Eric Lequiniou from Altair, and Martin Hilgeman discuss the study and its findings on the This Week in HPC podcast.
by Ashish Kumar Singh and Calvin Jacob
This blog details the performance improvement that has been achieved using Intel Xeon® Phi on the current generation Dell servers as compared to the same server of CPU configuration only. Though the performance runs carried out consisted of multiple iterations of runs along with multiple BIOS options, only the best results are recorded in this blog. For each of the application, the settings which yielded the best performance are recorded in the results description for each of the applications. All the performance runs were carried out with hyper threading (logical processor) disabled. The system configuration of the systems used are as below:
Dell PowerEdge® R730
Red Hat Enterprise Linux 6.5
2x Intel Xeon® E5-2695v3
16x 16GB DDR4 DIMMs, 2133MHz
2x Intel Xeon® Phi 7120P
2x 1100W (non-redundant)
Intel Parallel Studio XE 2015
Driver for co-processor
The applications chosen along with the respective versions and the domains are as below:
High Performance Linpack
Finite Element Analysis
The BIOS options selected for this blog as below:
System BIOS Options
Memory Settings > Snoop Mode
Cluster on Die
Processor Settings > Logical Processor
Processor Settings > QPI Speed
Maximum Data Rate
Processor Settings > Configurable TDP
System Profile Settings > System Profile
Specification of Intel Xeon® E5-2695v3:
Intel QPI Speed
Specification of Intel Xeon® Phi 7120P:
Maximum Memory Bandwidth
The comparison has been made between 3 configurations:
Throughout this blog, the performance has been expressed comparing configuration 1 through configuration 3. There has been a considerable amount of performance gain that has been seen while scaling from CPU only configurations to servers with one and two Intel Xeon® Phi 7120P.
Below are the details of the applications used for the study and the results obtained.
High Performance Linpack is the benchmark that would calculate the total throughput (sustained throughput) obtained from a system, this data can be used for calculating the overall efficiency by comparing it against the maximum theoretical throughput. For the below recorded run, CPU was set to Performance Mode and the Intel Xeon® Phi was set to ECC ON and Turbo OFF. The achieved throughput on CPU only was taken as the baseline. Below is the comparison of the scale-up performance runs carried out on configurations with processors and Intel Xeon Phi.
High Performance Linpack (GFLOPS) on Dell PowerEdge® R730
2x Intel Xeon E5-2695v3
2x Intel Xeon® E5-2695v3, 1x Intel Xeon® Phi 7120P
2x Intel Xeon® E5-2695v3, 2x Intel Xeon® Phi 7120P
(Click on images to enlarge.)
We observe a 100% and 200% performance improvement on configurations with one Intel Xeon® Phi 7120P and two Intel Xeon® Phi 7120P respectively. The runs on Intel Xeon® Phi were done in the offload mode. In offload mode, the program is launched on the host CPU but executed on the Intel Xeon® Phi.
STREAM benchmark is used for measuring the memory bandwidth in a system. This measures the rate at which the data transfers happen within system or within Intel Xeon® Phi. The performance numbers were recorded for CPU set to performance mode and the Intel Xeon® Phi set to ECC OFF and Turbo ON. Below is the data from the STREAM performance runs:
The achieved stream bandwidth on CPU only configuration on R730 was taken as the baseline. We observe performance improvements of 50% to 70% on configurations with one Intel Xeon® Phi 7120P and two Intel Xeon® Phi 7120P as against the CPU only configuration. The best STREAM bandwidth was observed on Intel Xeon® Phi 7120P with ECC set to OFF and turbo set to ON.
SHOC measures the maximum device memory bandwidth for different levels of the memory hierarchy and different access patterns. Results are reported in GB/s. Host-to-Device bandwidth is measured by SHOCDownload and Device-to-Host bandwidth is measured by SHOCReadback. There is an overall bandwidth of approximately 6.9GB/s seen across SHOCDownload and SHOCReadback. The details are as provided below.
Intel Xeon® Phi-1
Intel Xeon® Phi-2
NAMD (NAnoscale Molecular Dynamics program) is a molecular dynamics simulation package which uses the Charm++ parallel programming model. It has good parallel efficiency and is used to simulate large systems. Here the metric used is nano seconds/day. The benchmark has been run with three different datasets, namely ApoA1, ATPase and STMV. The performance numbers were recorded for CPU set to performance mode and the Intel Xeon® Phi set to ECC OFF and Turbo ON.
NAMD (nano seconds/day) on Dell PowerEdge® R730
2x Intel Xeon E5-2695v3, 1x Intel Xeon® Phi 7120P
2x Intel Xeon E5-2695v3, 2x Intel Xeon® Phi 7120P
ApoA1 gene provides instructions for making a protein called Apolipoprotein A-I. With the ApoA1 dataset, we observe 60% and 130% performance improvements on configuration with one and two Intel Xeon® Phi 7120P respectively as compared to CPU only configurations.
ATPase are a class of enzymes that catalyze the decomposition of ATP into ADP and free phosphate ion. With the ATPase dataset, we observe 55% and 130% performance improvements on configuration with one and two Intel Xeon® Phi 7120P respectively as compared to CPU only configurations.
STMV (Satellite Tobacco Mosaic Virus) is a small, icosahedral plant virus. With the STMV dataset, we observe 60% and 130% performance improvements on configuration with one and two Intel Xeon® Phi 7120P respectively as compared to CPU only configurations.
ANSYS Mechanical is a comprehensive FEA analysis (finite element) tool for structural analysis, including linear, nonlinear and dynamic studies. The performance numbers were recorded for CPU set to performance mode and the Intel Xeon® Phi set to ECC ON and Turbo OFF. The best performance was seen with 16 cores. The behavior is as expected and the details for the same can be found here.
It has been noted from the results as recorded above that there is a considerable amount of performance gain for most of the applications when using Intel Xeon® Phi with performance improvements in the range of up to 300% across the chosen spectrum of applications and the add-on Intel Xeon® Phi configuration. Overall the users can take advantage of superior performance of the latest Dell PowerEdge® Server powered by Intel Xeon® E5-26xxv3 with support for new extensions coupled with increased memory speed from DDR3 to DDR4.
by Garima Kochhar
This blog evaluates four Haswell processor models (Intel® Xeon® E5-2600 v3 Product Family) comparing them for performance and energy efficiency on HPC applications. This is part three in a three part series. Blog one provided HPC results and performance comparisons across server generations, comparing Ivy Bridge (E5-2600 v2), Sandy Bridge (E5-2600) and Westmere (X5600) to Haswell. The second blog discussed the performance and energy efficiency implications of BIOS tuning options available on the new Dell Haswell servers.
In this study we evaluate processor models with different core counts, CPU frequencies and Thermal Design Power (TDP) ratings and analyze the differences in performance and power. Focusing on HPC applications, we ran two benchmarks and four applications on our server. The server in question is part of Dell’s PowerEdge 13th generation (13G) server line-up. These servers support DDR4 memory at up to 2133 MT/s and Intel’s latest E5-2600 v3 series processors (architecture code-named Haswell). Haswell is a net new micro-architecture when compared to the previous generation Sandy Bridge/Ivy Bridge. Haswell based processors use a 22nm process technology, so there’s no process-shrink this time around. Note the “v3” in the Intel product name – that is what distinguishes a processor as one based on Haswell micro-architecture. You’ll recall that “E5-2600 v2” processors are based on the Ivy Bridge micro-architecture and plain E5-2600 series with no explicit version are Sandy Bridge based processors. Haswell processors require a new server/new motherboard and DDR4 memory. The platform we used is a standard dual-socket rack server with two Haswell-EP based processors. Each socket has four memory channels and can support up to 3 DIMMs per channel (DPC).
Table 1 below details the applications we used and Table 2 describes the test configuration on the new 13G server.
Table 1 - Applications and Benchmarks
Table 2 - Server configuration
PowerEdge R730xd prototype
2 x Intel® Xeon® E5-2693 v3 – 2.6/2.2 GHz, 14c, 145W
2 x Intel® Xeon® E5-2680 v3 – 2.5/2.1 GHz, 12c, 120W
2 x Intel® Xeon® E5-2660 v3 – 2.6/2.2 GHz, 10c, 105W
2 x Intel® Xeon® E5-2640 v3 – 2.6/2.2 GHz, 8c, 90W
* Frequency noted as “Rated base/AVX base GHz”
128GB - 8 x 16GB 2133 MHz DDR4 RDIMMs
1 x 300GB SAS 6Gbps 10K rpm
PERC H330 mini
Red Hat Enterprise Linux 6.5 x86_64
As noted per test
Intel® MPI 4.1.3.049
Intel® MKL 126.96.36.199
Intel® 2013_sp1.3.174 - v188.8.131.52 Build 20140422
All the results shown here are based on single-server performance. The following metrics were used to compare performance.
Power was recorded during the tests on a power meter attached to the server. The average steady state power is used as the power metric for each benchmark.
Energy efficiency (EE) computed as Performance per Watt (performance/power).
Figure 1 plots the performance of the four processor models (SKUs) for the benchmarks and applications used in this study. The BIOS was set to Early Snoop memory mode (ES), DAPC system profile (Turbo enabled, C-states enabled), and Logical Processor (Hyper-Threading) was turned off. All other BIOS options were at Dell defaults. The baseline used for comparison is the E5-2660 v3 10c 2.6 GHz processor.
From the graph, it can be seen that the memory bandwidth for the first three processor models (E5-2697 v3, E5-2690 v3, E5-2660 v3) was about the same. These SKUs can support memory at 2133 MT/s. The E5-2640 v3 has lower memory bandwidth since the maximum memory speed supported is 1866 MT/s.
All the other applications show a steady performance improvement with higher bin processor models. For codes that have per-core licenses, the improvement with higher bin processors that have more cores is not commensurate with the increase in number of cores. For example, when comparing the 10c SKU to the 12c SKU, adding 20% more cores (20 cores vs. 24 cores) allows Fluent running truck_poly_14m a 17% performance improvement. (Click on images to enlarge.)
Figure 1 - Performance - ES.DAPC
Figure 2 plots the relative energy efficiency of the test cases in Figure 1 using the 10c E5-2660 v3 SKU as the baseline.
Since the 14c, 12c and 10c SKUs have very similar memory bandwidth @ 2133 MT/s and the higher end processors have higher TDP and consume more power, the Stream EE follows the inverse of the TDP. (EE is performance/power).
HPL shows an improvement in energy efficiency with higher bin processors, the improvements in performance out-weigh the additional power consumed by the higher wattage processors.
For all the other applications, the energy efficiency is within 5% for each SKU and varies per application. Fluent and LS-DYNA share similar characteristics with the 12c and 8c SKUs measuring slightly EE than the 14c and 10c SKUs. WRF and MILC have similar trends with the lower end SKUs showing better EE than the higher end SKUs.
Figure 2 - Energy Efficiency - ES.DAPC
Figure 3 plots similar results for performance and energy efficiency on a different BIOS configuration. In these tests the BIOS was set to Cluster On Die Snoop mode, Performance profile (Turbo enabled, Cstates disabled) and Logical Processor disabled. Recall that the Cluster On Die (COD) mode is only supported on SKUs that have two memory controller per processor, i.e. 10 or more cores. The 8c E5-2640 v3 does not support COD mode.
The relative performance and energy efficiency patterns shown in Figure 3 for COD.Perf match those of the ES.DAPC mode (Figures 1 and 2). We know from Blog 2 that COD.Perf performs 1-3% better than ES.DAPC for the applications and data sets used in this study. This improvement is seen across the different processor models given that the relative performance between SKUs stays similar for ES.DAPC and COD.Perf BIOS settings.
Figure 3 - Performance, Energy Efficiency - COD.DAPC
Figure 4 plots the idle and peak power consumption across the different BIOS settings. (Note this data was gathered on an early prototype unit running beta firmware.) The power measurements shown here are for comparative purposes across SKUs and not an absolute indicator of the server’s power requirements. The text values within the bar graph show relative values using the 10c E5-2660 v3 as a baseline.
The idle power of the system is similar irrespective of the processor model used. This is good and demonstrates the energy efficiency of the base system. As desired, a higher wattage processor does not consume additional power when the system is idle.
As expected, the peak power draw measured during HPL initialization is greater for the higher bin CPUs that have higher TDP.
Figure 4 - Power idle and peak - ES.DAPC
There is a clear performance up-side for all the applications and datasets studied here when using higher bin/higher core count processors. The goal of this study was to quantify these performance improvements as a guide to choosing the best processor model for a workload or cluster. This blog concludes our three part series on the impact of the new Dell 13G servers with Intel Haswell processors on HPC applications.