Author: Yogendra Sharma, Ashish Singh, September 2016 (HPC Innovation Lab)
This blog describes the performance analysis on a PowerEdge R930 server powered by four Intel Xeon E7-8890 v4 @2.2GHz processors (code named as Broadwell-EX). Primary objective of this blog is to compare the performance of HPL, STREAM and few scientific applications ANSYS Fluent and WRF with the previous generation of Intel processor Intel Xeon E7-8890 v3 @2.5GHz codenamed Haswell-EX. Below are the configurations used for this study.
4 x Intel Xeon E7-8890 email@example.comGHz (18 cores) 45MB L3 cache 165W
4 x Intel Xeon E7-8890 firstname.lastname@example.orgGHz (24 cores) 60MB L3 cache 165W
1024 GB = 64 x 16GB DDR4 @2400MHz RDIMMS
1024 GB = 32 x 32GB DDR4 @2400MHz RDIMMS
Processor Settings > Logical Processors
Processor Settings > QPI Speed
Maximum Data Rate
Processor Settings > System Profile
Software and Firmware
RHEL 6.6 x86_64
RHEL 7.2 x86_64
Benchmark and Applications
V2.1 from MKL 11.2
V2.1 from MKL 11.3
v5.10, Array Size 1800000000, Iterations 100
v3.5.1, Input Data Conus12KM, Netcdf-126.96.36.199
V3.8 Input Data Conus12KM, Netcdf-4.4.0
Table 1: Details of Server and HPC Applications used with Broadwell-EX processors
In this section of the blog, we have compared benchmark numbers with two generations of processors on the same server platform i.e. PowerEdge R930 as well as performance of Broadwell-EX processors with different CPU profiles and memory snoop modes namely Home Snoop (HS) and Cluster On Die(COD).
The High Performance Linpack Benchmark is a measure of a system's floating point computing power. It measures how fast a computer solves a dense n by n system of linear equations Ax = b, which is a common task in engineering. HPL benchmark was run on both PowerEdge R930 servers (With Broadwell-EX and Haswell-EX ) with block size of NB=192 and problem size of N=340992.
Figure 1: Comparing HPL Performance across BIOS profiles Figure 2: Comparing HPL Performance over two generations of processors
Figure 1 depicts the performance of PowerEdge R930 server with Broadwell-EX processors on different BIOS options. HS (Home snoop mode) performs better than the COD (Cluster-on-die) on both of the system profiles Performance and DAPC. Figure 2 compares the performance between four socket Intel Xeon E7-8890 v3 and Intel Xeon E7-8890 v4 processor servers. HPL showed 47% performance improvement with four Intel Xeon E7-8890 v4 processors on R930 server in comparison to four Intel Xeon E7-8890 v3 processors. This was due to ~33% increase in the number of cores and 13% increase due to new improved version of both Intel compiler and Intel MKL.
Stream benchmark is a synthetic benchmark program that measures sustainable memory bandwidth and the corresponding computation rate for simple vector kernels.
Figure 3: Comparing STREAM Performance across BIOS profiles Figure 4: Comparing STREAM Performance over two generations of processors
As per Figure 3, the memory bandwidth of PowerEdge R930 server with Intel Broadwell-EX processors are same on different bios profiles. Figure4 shows the memory bandwidth of both Intel Xeon Broadwell-EX and Intel Xeon Haswell-EX processors with PowerEdge R930 server. Both Haswell-EX and Broadwell-EX support DDR3 and DDR4 memories respectively, while the platform with this configuration supports 1600MT/s of memory frequency for both generation of processors. Due to the same memory frequency supported by the PowerEdge R930 platform for both generation of processors, both Intel Xeon processors have same memory bandwidth of 260GB/s with the PowerEdge R930 server.
The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture facilitating parallel computation and system extensibility. The model serves a wide range of meteorological applications across scales from tens of meters to thousands of kilometers. WRF can generate atmospheric simulations using real data or idealized conditions. We used the CONUS12km and CONUS2.5km benchmark datasets for this study. CONUS12km is a single domain and small size (48hours, 12km resolution case over the Continental U.S. (CONUS) domain from October 24, 2001) benchmark with 72 seconds of time step. CONUS2.5km is a single domain and large size (Latter 3hours of a 9hours, 2.5km resolution case over the Continental U.S. (CONUS) domain from June 4, 2005) benchmark with 15 seconds of time step. WRF decomposes the domain into tasks or patches. Each patch can be further decomposed into tiles that are processed separately, but by default there is only one tile for every run. If the single tile is too large to fit into the cache of the CPU and/or core, it slows down computation due to WRF’s memory bandwidth sensitivity. In order to reduce the size of the tile, it is possible to increase the number of tiles by defining “numtile = x” in input file or defining environment variable “WRF_NUM_TILES = x”. For both CONUS 12km and CONUS 2.5km the number of tiles are chosen based on best performance which is equal to 56.
Figure 5: Comparing WRF Performance across BIOS profiles
Figure 5 demonstrates the comparison of WRF datasets on different BIOS profiles .With Conus 12KM data ,all the bios profiles performs equally well because of the smaller data size while for CONUS 2.5KM Perf.COD (Performance System Profile with Cluster-On-Die snoop mode) gives best performance. As per the figure 5, the Cluster-on-Die snoop mode is performing 2% higher than Home snoop mode, while the Performance system profile gives 1% better performance than DAPC.
Figure 6: Comparing WRF Performance over two generations of processors
Figure 6 shows the performance comparison between Intel Xeon Haswell-EX and Intel Xeon Broadwell-EX processors with PowerEdge R930 server. As shown in the graph, Broadwell-EX performs 24% better than Haswell-EX for CONUS 12KM data set and 6% better for CONUS 2.5KM.
ANSYS Fluent is a computational fluid dynamics (CFD) software tool. Fluent includes well-validated physical modeling capabilities to deliver fast and accurate results across the widest range of CFD and multi physics applications.
Figure 7: Comparing Fluent Performance across BIOS profiles
We used three different datasets for Fluent with ‘Solver Rating’ (Higher is better) as the performance metric. The above graph Figure 7 shows that all three datasets performed 4% better with Perf.COD (Performance System Profile with Cluster-On-Die snoop mode) bios profile than others. While, the DAPC.HS (DAPC system profile with Home snoop mode) bios profile shows lowest performance. For all three datasets ,the COD snoop mode performs 2% to 3% better than Home snoop mode and Performance system profile performs 2% to 4% better than DAPC. For all these three datasets the behaviour of Fluent is consistent.
Figure 8: Comparing Fluent Performance over two generations of processors
As shown above in Figure 8, for all the test cases on PowerEdge R930 with Broadwell-EX ,Fluent showed 13% to 27% performance improvement in-comparision to PowerEdge R930 with Haswell-EX.
Overall, Broadwell-EX processor makes the PowerEdge R930 server more powerful and more efficient. With Broadwell-EX, the HPL performance increses in the smae manner as increase in the number of cores in comparison to Haswell-EX. There is also increase in the performance for real time applications depending on their nature of computation. So, it can be a good choice to upgrade for those who are using compute hungry applications.
By Munira Hussain, Deepthi Cherlopalle
This blog introduces the Omni-Path Fabric from Intel® as a cluster network fabric used for intra-node communication for application, management and storage communication in High Performance Computing (HPC). It is part of the new technology referring to Intel® Scalable System framework based on IP generated from the coalition of Qlogic, Truescale and Cray Aries. The goal of Omni-Path is to eventually be able to meet the demands of the exascale data centers in performance and scalability.
Dell provides complete validated and supported solution offering which includes the Networking H-series Fabric switches and Host Fabric Interface (HFI) adapters. The Omni-Path HFI is a PCI-E Gen3 x16 adapter capable of 100 Gbps unidirectional per port. The card supports 4 lanes supporting 25Gbps per lane.
HPC Program Overview with Omni-Path:
The current solution program is based on Red Hat Linux 7.2 (kernel version 3.10.0-327.el7.x86_64). The Intel Fabric Suite (IFS) drivers are integrated in the current software solution stack Bright Cluster Manager 7.2 which helps to deploy, provision, install and configure an Omni-Path cluster seamlessly.
The following Dell servers support Intel® Omni-Path Host Fabric Interface (HFI) cards
PowerEdge R430,PowerEdge R630, PowerEdge R730, PowerEdge R730XD, PowerEdge R930, PowerEdge C4130, PowerEdge C6320
The management and monitoring of the Fabric is done using the Fabric Manager (FM) GUI available from Intel®. The FMGUI provides in-depth analysis and graphical overview of the fabric health including detailed breakdown of status of the ports, mapping as well as investigative report on the errors.
Figure 1: Fabric Manager GUI
The IFS tools include various debugging and management tools such as opareports, opainfo, opaconfig, opacaptureall, opafabricinfoall, opapingall, opafastfabric, etc. These help to capture a snapshot of the Fabric and to troubleshoot. The Host based subnet manager service known as opafm is also available with IFS and is able to scale up to 1000’s of nodes.
The Fabric relies on the PSM2 libraries to provide optimal performance. The IFS package provides precompiled versions of the open source OpenMPI and MVAPICH2 MPI along with some of the micro-benchmarks such as OSU and IMB used to test Bandwidth and Latency measurements of the cluster.
Basic Performance Benchmarking Results:
The performance numbers below were taken on Dell PowerEdge Server R630. The server configuration consisted of the dual socket Intel® Xeon® CPU E5-2697 v4 @ 2.3GHz, 18 cores with 8*16 GB @ 2400MHz. The BIOS version was 2.0.2, and the system profile was set to Performance.
OSU Micro-benchmarks were used to determine latency. These latency tests were done in Ping-Pong fashion. HPC applications need low latency and high throughput. As shown in Figure 2, the back to back latency is 0.77µs, and switch latency is 0.9µs which is on par with industry standards.
Figure 2: OSU Latency - E5-2697 v4
Figure 3 below shows the OSU Uni-directional and bi-directional bandwidth results with OpenMPI-1.10-hfi version. At 4MB Uni-directional bandwidth is around 12.3 GB/s, and bi-directional bandwidth is around 24.3GB/s which is on par with the theoretical peak values.
Figure 3: OSU Bandwidth – E5-2697 v4
Omni-Path Fabric provides a value add to the HPC solution. It is a technology that integrates well as a high speed fabric needed for designing flexible reference architectures with the growing need for computation. Users can benefit from the open source fabric tools like FMGUI, Chassis Viewer and also FastFabric that is packaged with the IFS. The solution is automated and validated with Bright cluster Manager 7.2 on Dell Servers.
More details on how Omni-Path perform in the other domains is available here. This document provides Intel® Omni-Path Fabric technology key features and provides a reference to performance data conducted on various commercial and open source applications.
Authors: Mayura Deshmukh, Ashish K Singh, Neha Kashyap
With the refresh of Dell’s 13th generation servers with the recently released Broadwell (BDW) processors, some obvious questions come to mind such as how the new processors compare with the older generation processors. This blog, fourth in the series of “Broadwell Performance for HPC,” focuses on answering this question. It compares the performance of various CAE applications for five Broadwell Intel Xeon E5-2600 v4 series processor models with previous generation Intel processors.
Last week’s blog talked about the impact of BIOS options for each of the CAE applications. Here we focus on how much better the performance of the Broadwell processors is as compared to the previous generation Haswell (HSW) and Ivy-bridge (IVB) processors for these CAE applications. Table 1 shows the applications that we are comparing and Table 2 describes the server configuration used for the study. For LS-DYNA, the benchmarks run on the IVB and HSW (sse binary) and for ANSYS Fluent, benchmarks run on Westmere (WSM), Ivy-bridge (IVB), Sandy-bridge(SB) and HSW used different software versions (whatever latest version was available at the time) than what is mentioned in Table 1. STAR-CCM+ and OpenFOAM version for benchmarks run on both HSW and BDW were same.
Table 1 - Applications and benchmarks
Platform MPI 9.1.0
Average Elapsed time
Platform MPI 9.1.3
Platform MPI 188.8.131.52
Open MPI 1.10.0
Table 2 - Server configuration
256GB - 16 x 16GB 2400 MHz DDR4 RDIMMs
6 x 300GB SAS 6Gbps 10K rpm
PERC H330 mini
Red Hat Enterprise Linux 7.2 (3.10.0-327.el7.x86_64)
System profile - Performance
Logical Processor - Disabled
Power Supply Redundant Policy - Not Redundant
Power Supply Hot Spare Policy - Disabled
I/O Non-Posted Prefetch - Disabled
Snoop Mode - Opportunistic Snoop Broadcast (OSB) for OpenFOAM and Cluster on Die (COD) for all the other applications
Node interleaving - Disabled
Figure 1 compares the performance of the five BDW Intel Xeon E5-2600 v4 series processors models with HSW Intel Xeon E5-2600 v3 series processors and IVB E5-2680 v2 for LS-DYNA car2car benchmark (with end time set to 0.02).
Figure 1: IVB vs. HSW vs BDW for LS-DYNA
The performance for all the processors is compared to E5-2680 v2, which is shown as the red baseline set at 1. The green bars show the performance for the HSW processors with LS-DYNA single precision sse binary, the grey bar represents data for HSW E5-2697 v3 with LS-DYNA single precision avx2 binary, the blue bars show the data for BDW processors with LS-DYNA single precision sse binary and the orange bars represent the BDW data with LS-DYNA single precision avx2 binary. For BDW, avx2 binaries perform 12-19% better than the sse binaries across all the processor models. The purple diamonds describe the performance per core compared to the E5-2680 v2. The percentages at the top of the BDW avx2 orange bar describe the percentage improvement of the BDW processors over HSW E5-2697 v3 avx2 (grey bar in the graph). The 12 core BDW E5-2650 v4 which has fewer cores and lower frequency understandably performs 11% lower than the Haswell E5-2697 v3 processor. The 14 core E5-2690 v4 which has same number of cores and similar avx2 frequencies performs 7% better than the E5-2697 v3 this can be accounted for due to the increase in bandwidth for Broadwell and BDW processors also measure better power efficiencies than Haswell processors. The performance for the 16core, 20core and 22core processors is 16 to 30% higher than the HSW E5-2697 v3 (avx2). Comparing the performance, performance per core and the higher memory bandwidth per core, the E5-2690 v4 14c and E5-2697Av4 16c look like attractive options for CAE/CFD codes, particularly when considering per core licensing costs.
CD-adapco’s STAR-CCM+ is another CFD application widely-used by industry for solving problems involving fluid flows, heat transfer, and other phenomena. STAR-CCM+ shows similar performance patterns to LS-DYNA.
Figure 2: HSW vs BDW for STAR-CCM+
Figure 2 compares the performance of the five BDW Intel Xeon E5-2600 v4 series processors models (shown as the five bars in the graph) with HSW E5-2697 v3 shown as the red line set at one. The numbers at the top of the bar show the per core performance relative to the E5-2697 v3. As seen from the bars the 14core, 16core, 20core and the 22core relative performance is higher by 8% to 40% across all the benchmarks. The lower core, lower frequency 12core E5-2650 performs 11-20% lower than the E5-2697 v3. Similar to LS-DYNA, the per core performance of the 14core and the 16core is 2% to 11% better than the HSW E5-2697 v3 making them good options for STAR-CCM+ as well.
ANSYS Fluent is a computational fluid dynamics application. The graph in Figure 3 shows the performance of truck_poly_14m for Sandy-bridge (SB), Ivy-bridge (IVB), HSW and BDW processors compared to the Westmere (WSM) processor shown as the redline set at one.
Figure 3: WSM vs. SB vs. IVY vs. HSW vs. BDW for ANSYS Fluent
The Fluent benchmark exhibit a similar pattern as LS-DYNA and STAR-CCM+ benchmarks. The purple diamonds in Figure 3 describe the performance per core compared to the WSM 2.93GHz processor. The percentages at the top of the BDW blue bar describe the percentage improvement of the BDW processors over HSW E5-2697 v3 (green bar in the graph). The 12 core BDW E5-2650 v4 which has fewer cores and lower frequency performs 14% lower than the Haswell E5-2697 v3 processor. With higher performance per core and the higher memory bandwidth per core, the E5-2690 v4 14c, E5-2697Av4 16c are good options, particularly when considering per core software licensing costs, and perform 11% and 21% better than the E5-2697 v3 processor. The 20 and 22core BDW processors perform 32%-39% better than the HSW E5-2697 v3.
OpenFOAM (Open source Field Operation And Manipulation) is a free, open source software for computational fluid dynamics (CFD).
Figure 4: HSW vs. BDW for OpenFOAM Motorbike 11M benchmark
As shown in Figure 4 for the OpenFOAM Motorbike 11M benchmark, all the Broadwell processors perform 12% to 21% better than the Haswell E5-2697 v3 processor, shown as the red line set at one. Per core performance for the 16 core, 14 core and 12 core is 4% to 30% better than the E5-2697 v3.The performance for the 20 core and the 22 core BDW processors are the same for the Motorbike 11M benchmark. Increase in number of cores does not provide a significant performance boost for 20 and 22 core parts likely due to lower memory bandwidth per core as explained in the first blog’s STREAM results.
Along with more cores than HSW, BDW measures better power efficiency than HSW. Looking at the absolute performance, performance per core and the higher memory bandwidth per core, the E5-2690 v4 14c, E5-2697Av4 16c are attractive options for CAE/CFD codes particularly if per-core licensing costs are involved. For applications like OpenFOAM (motorbike case) all the BDW processors performed better than Haswell E5-2697 v3, but the increase in number of cores does not provide a significant performance boost for 20 and 22 core parts due to lower memory bandwidth per core.
Last week’s blog on the “Broadwell Performance for HPC” series described the BIOS options and compared performance across generations of processors for molecular dynamic applications (NAMD) and Weather Research and Forecasting (WRF). This blog, third in the series, focuses on BIOS options for some HPC CAE applications for five different Broadwell Intel Xeon E5-2600 v4 series processor models. It aims to answer questions like, which snoop mode works best for my application and processor? Which BIOS System Profile would give the best performance?
There have been a few changes in the BIOS options for Broadwell as compared with the previous generation (Haswell). One of the major additions in the Broadwell BIOS is the “Opportunistic Snoop Broadcast” snoop mode in the Memory settings. This blog discusses performance of the applications for all four snoop modes: Opportunistic snoop broadcast (OSB), Early snoop (ES), Home snoop (HS) and Cluster on die (COD). For more information on the new BIOS options and snoop modes check blog one of this series.
The Dell BIOS “System Profile” setting can be set to either of the four pre-configured profiles: Performance Per Watt (DAPC), Performance Per Watt (OS), Performance (Perf.) and Dense Configuration or set to Custom. In the pre-configured profiles the Turbo Boost, C States, C1E, CPU Power Management, Memory Frequency, Memory Patrol Scrub, Memory Refresh Rate, Uncore Frequency are preset whereas for Custom the User can choose values for these options. For more information on System Profiles check the link. DAPC and OS have shown to perform similarly in past studies, and Dense Configuration performs lower for HPC workloads, so we will be focusing on DAPC and Performance Profiles in this study. The DAPC (Dell Active Power Control) Profile relies on a BIOS-centric power control mechanism. Energy efficient turbo, C States, C1E are enabled with the DAPC Profile. Performance Profile disables power saving features such as C-states, Energy efficient turbo and C1E. Turbo boost is enabled in both the System Profiles.
This blog discusses the performance of CAE applications with DAPC and Performance profile for each of the four snoop modes for five different Intel Xeon E5-2600 v4 series Broadwell processors. Table 1 shows the application and benchmark details and Table 2 describes the server configuration used for the study.
System Profile - Performance and Performance Per Watt (DAPC)
Snoop Mode - Opportunistic Snoop Broadcast (OSB), Early Snoop (ES), Home Snoop (HS), Cluster on Die (COD)
LS-DYNA is a general-purpose finite element program from LSTC capable of simulating complex real-world structural mechanics problems. We ran the car2car benchmark with endtime set to 0.02 with both the single precision avx2 and the single precision sse LS-DYNA binaries.
Figure 1: Comparing snoop modes and BIOS Profiles for LS-DYNA
The left graph in Figure 1 shows how better or worse the different snoop modes perform compared to the default setting of snoop mode = OSB and BIOS profile=DAPC (which is set at 1, the red line on the graph). Just changing the snoop mode to COD increases performance by 1-3% with either BIOS profiles across all the processor models. The performance with COD is closely followed by OSB followed by ES for lower core counts and HS for 16, 20 and 22 core processors. With ES mode, the system starts paying the penalty of having lower request tokens per core for higher core counts compared to the other snoop modes (for e.g. for 14 core 128/14 = 9 per core Vs. 128/22 = 5 per core for 22 core). All the snoop modes with the System Profile set to Performance follow similar pattern as DAPC. As shown in the graph on the right in Figure 1, changing the System Profile from DAPC to Performance can provide up to 2% performance benefit. The COD.Perf is the best option, about 2-4% better compared to OSB.DAPC across all processor models. The total 2-4% improvement with COD.Perf is accounted partially due to the change in snoop mode and partially due to change in the BIOS System Profile to Performance. We ran the car2car benchmark for all the combinations above with the sse LS-DYNA binary as well and noted similar behavior with the Performance System Profile and COD snoop mode being 2-6% better than the default OSB.DAPC. The avx2 binaries performed 12-19% better than the sse binaries across all the processor models.
CD-adapco® STAR-CCM+ is another CFD application widely-used by industry for solving problems involving fluid flows, heat transfer, and other phenomena. The STAR-CCM+ benchmarks results show a pattern similar to LS-DYNA in terms of snoop mode and System Profile.
Figure 2: Comparing snoop modes for STAR-CCM+
Figure 2 compares the snoop modes for the Civil_20m and Lemans_17m benchmarks. For simplicity, data for these two benchmarks are shown. The other benchmarks datasets show results similar to the patterns in Figure 2. The BIOS profile in the graphs is set to DAPC and the snoop modes are compared against the default OSB snoop mode (which is set at 1, the red line on the graph). The COD is the best option for the Civil_20m benchmark, it is about 2-3% better for DAPC. For the Performance System Profile COD is 4-6% better for the Civil_20m benchmark (not shown in the graph). COD is followed by OSB and then ES for smaller core counts. Performance with ES though starts reducing as the cores increase similar to what was observed with LS-DYNA car2car benchmark case. The HlMach10 benchmark shows similar pattern to the Civil_20m benchmark. For the HlMach10 benchmark case the COD.Perf option is 2-7% better than the default OSB.DAPC.
All the other benchmarks (EglinStoreSeparation, Kcs, Lemans_100m, Reactor9m, TurboCharger, Vtm) show similar pattern to Lemans_17m. The COD and OSB perform similarly, there is only ~1% difference between OSB and COD across the benchmark cases across all processor models. After COD and OSB, ES option is better for lower core counts and HS for 16, 20 and 22 core processors. As mentioned previously, the system in ES mode starts paying the penalty of having lower request tokens per core for higher core counts compared to the other snoop modes.
Figure 3: DAPC vs. Performance with COD snoop mode for STARCCM+
The graph in figure 3 compares the System Profile BIOS options DAPC and Performance. We are comparing the performance of COD.Perf with respect to COD.DAPC, which is the red baseline set at 1 in the graph. The Performance profile provides 2-4% benefit over the DAPC for the Civil_20m benchmark for all the processor models. Also for the high core count, E5-2699 v4 the Performance profile performs 2-5% better across all the benchmarks. For all the other processor models there is not a significant gain (only about 1%) with the Performance profile for all the benchmarks (except Civil_20m).
ANSYS Fluent is a computational fluid dynamics application. Fluent provides multiple benchmark cases. We picked four representative cases from the v16 benchmark suite: combustor_12m, combustor_71m, exhaust_system_33m and ice_2m and one from the older v15 benchmark suite: truck_poly_14m, to allow us to compare our data with previous generation processor models. The Fluent benchmarks exhibit a similar pattern as LS-DYNA and STAR-CCM+ benchmarks.
Figure 4: Comparing snoop modes for ANSYS Fluent
The graph in Figure 4 shows the performance of truck_poly_14m for all the snoop modes compared to the default OSB.DAPC which is shown as the red baseline in the graphs. All the other benchmarks show a similar pattern. COD performs up to 2% better than OSB for truck_poly_14m, combustor_12m and ice_2m. COD is about 5% better for combustor_71m and 6% better for exhaust_33m. COD is followed by OSB, followed by ES for lower core counts and HS for higher core count processors for all the benchmarks.
Figure 5: DAPC vs. Performance with COD snoop mode for ANSYS Fluent
Figure 5 shows the performance for Performance profile with respect to DAPC with COD set as the snoop mode for both options. DAPC is shown as the red baseline in the graph. The Performance BIOS profile option is about 4% better for all the processor models for the larger combustor_71m and exhaust_33m benchmark cases. The Performance profile is 1-3% better for the other benchmark cases.
OpenFOAM (Open source Field Operation And Manipulation) is a free, open source software for computational fluid dynamics (CFD). OpenFOAM was compiled with -march=native / Broadwell option. We used the cavity-1M and motorBike-11M datasets which are modifications of the OpenFOAM tutorials/incompressible/icoFoam/cavity and tutorials/incompressible/simpleFoam/motorBike models respectively.
Figure 6: Comparing snoop modes and BIOS Profiles for OpenFOAM Cavity 1M benchmark
As shown in left graph of figure 6 for DAPC System Profile, the benchmark performance increases by 3-6% when in COD snoop mode when compared to OSB. ES and HS options perform up to 3% lower than OSB across all the processor models. The pattern is similar for the Performance System Profile, where COD is better by 3-7% followed by OSB. HS is lower than OSB but better than ES for all the processors models except for the 20core E5-2698 v4 where ES is 1% better than HS for DAPC profile and 7% better than HS for Performance System Profile. There is not a lot of difference in performance for DAPC Vs Performance profile especially for the higher frequency processors 14core E5-2690v4 and the 16core E5-2697A v4. For the other models the Performance profile shows up to 4% benefit as shown in the right graph of figure 6.
Figure 7: Comparing snoop modes and BIOS Profiles for OpenFOAM Motorbike 11M benchmark
For the openFOAM motorbike 11M benchmark the OSB, COD and the HS snoop modes perform similarly with about 1% variation. The performance for ES is low across all the processor models and it keeps on dropping as the number of cores increase as shown in the left graph of figure 7. The snoop modes with BIOS System Profile set to Performance follow exactly similar trend. As shown in the right graph on figure 3, the DAPC and Performance profiles show similar performance with Performance about 1% better in most cases except for the E5-2697A where the DAPC.COD was 2% better.
Most of the data sets used in this study show advantage of COD mode, but COD benefits codes which are highly NUMA optimized and where the dataset fit into the NUMA memory (that is half of each sockets memory capacity). OSB is a close second and a good option for codes with varying level of NUMA optimization; OSB is also the default memory snoop BIOS option. HS and ES perform slightly lower than COD and OSB. ES is better than HS for lower core counts but as the core counts increase ES starts paying the penalty of having lower request tokens per core for higher core counts compared to the other snoop modes. In terms of System Profile, Performance Profile performs slightly better than DAPC in most of the cases.
Be sure to check back next week for the last blog in the series which will compare the performance of HPC CAE applications across generations (Ivy-bridge vs. Haswell vs. Broadwell)
Authors: Ashish K Singh, Mayura Deshmukh, Neha Kashyap
This blog describes the performance of Intel Broadwell processors with HPC applications, Weather Research and Forecasting (WRF) and NAnoscale Molecular Dynamics (NAMD). This is the second blog in the series of four blogs on “Performance study of Intel Broadwell”. The first blog characterizes the Broadwell-EP processors with HPC benchmarks like HPL and STREAM. This study compares five different Broadwell processors E5-2699 v4 @2.2GHz (22 cores), E5-2698 v4 @2.2GHz (20 cores), E5-2697A v4 @2.6GHz (16 cores), E5-2690 v4 @2.6GHz (14 cores), and E5-2650 v4 @2.2GHz (12 cores) in PowerEdge 13th generation servers. It characterizes the performance of the system for WRF and NAMD by comparing five Broadwell processors models with previous generations of Intel processors. For the generation over generation comparison, previous results from Intel Xeon X5600 series Westmere (WSM), Intel Xeon E5-2600 series Sandy-Bridge (SB), Intel Xeon E5-2600 v2 series Ivy-Bridge (IVY), Intel Xeon E5-2600 v3 series Haswell (HSW) and Intel Xeon E5-2600 v4 series Broadwell (BDW) processors were used. This blog also describes the impact of BIOS tuning options on WRF and NAMD performance with Broadwell. Table 1 below lists the server configuration and the application details for the Broadwell processor based tests. The software versions were different for the older generation processors and results are compared against what was best configuration at that time. Due to big architectural changes in servers and processors generation over generation, the changes in software versions is not a significant factor.
Table 1: Details of Server and HPC Applications used with Intel Broadwell processors
Dell PowerEdge R730
E5-2699 v4 @2.2GHz, 22 core, 145W
E5-2698 v4 @2.2GHz, 20 core, 135W
E5-2697A v4 @2.6GHz, 16 core, 145W
E5-2690 v4 @2.6GHz, 14 core, 135W
E5-2650 v4 @2.2GHz, 12 core, 105W
16 x 16GB DDR4 @ 2400MHz (Total=256GB)
2 x 1100W
System profile – Performance and Performance Per Watt (DAPC)
Logical Processor – Disabled
Power Supply Redundant Policy – Not Redundant
Power Supply Hot Spare Policy – Disabled
Snoop modes – COD, ES, HS and OSB
Node Interleaving - Disabled
From Intel Parallel studio 2016 update1
Intel MPI – 5.1.2
Weather Research and Forecasting (WRF) is an HPC application used for atmospheric research. The WRF model is a next-generation mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. This serves a wide range of meteorological applications across scales from tens of meters to thousands of kilometers. WRF can generate atmospheric simulations using real data or idealized conditions. We used the CONUS12km and CONUS2.5km benchmarks for this study.
CONUS12km is a single domain and small size (48hours, 12km resolution case over the Continental U.S. (CONUS) domain from October 24, 2001) benchmark with 72 seconds of time step. CONUS2.5km is a single domain and large size (Latter 3hours of a 9hours, 2.5km resolution case over the Continental U.S. (CONUS) domain from June 4, 2005) benchmark with 15 seconds of time step.
WRF decomposes the domain into tasks or patches. Each patch can be further decomposed into tiles that are processed separately, but by default there is only one tile for every run. If the single tile is too large to fit into the cache of the CPU and/or core, it slows down computation due to WRF’s memory bandwidth sensitivity. In order to reduce the size of the tile, it is possible to increase the number of tiles by defining “numtile = x” in input file or defining environment variable “WRF_NUM_TILES = x”. For both CONUS 12km and CONUS 2.5km the number of tiles are chosen based on best performance. The tiles value depend on workload and hardware configuration. Table 2 shows more detail on the number of tiles used in this study for best performance.
Table2: Parameters used in WRF for best performance
Total no. of cores
NAMD is one of the HPC applications used in molecular dynamics research. It is a portable, parallel and object oriented molecular dynamics code designed for high-performance simulations of large bio molecular systems. NAMD is developed using charm++. Molecular Dynamics simulations of bio molecular systems are an important technique for our understanding of biological systems. This study has been performed with three NAMD benchmarks ApoA1 (92,224 Atoms), F1ATPase (327,506 Atoms) and STMV (virus, 1,066,628 Atoms). In the context of number of atoms, these benchmarks lie in the category of small, medium and large size datasets.
Intel Broadwell processors
Figure 1: Performance for Intel Broadwell processors with WRF
Figure 1 compares performance among five Broadwell processors by using small and large size of WRF benchmarks. WRF was compiled with the “sm + dm” mode. The combinations of MPI and OpenMP processes that were used are mentioned in Table2.
The “X” value in the graph on top of each bar show the performance relative to the 12 core Broadwell processor (which is set as baseline, 1.0). For the small size dataset CONUS12km, the top bin processor performs 26% better than 12 core processor. While for CONUS2.5km, performance increases up to 30% due to the large dataset size, which can more efficiently utilize larger number of processors. The performance increase from 20 to 22 cores is not as significant due to the lower memory bandwidth per core as explained in the first blog’s STREAM results.
Figure 2: Performance of Intel Broadwell processors with NAMD
Figure 2 plots the simulation speed of NAMD benchmarks with Broadwell processors. The “X” value in the graph on top of each bar show the performance relative to the 12 core Broadwell processor (which is set as baseline, 1.0). As seen from the graph, the relative performance of the different processors models is nearly same irrespective of the NAMD benchmark dataset (small, medium or large). For the top bin processor, the performance improvement is 81 to 84% faster than the 12 core processor. NAMD benchmarks show significant performance improvement with additional cores for Broadwell processors.
BIOS Profiles comparison
This study was performed with all snoop modes: Home Snoop (HS), Early Snoop (ES), Cluster-on-Die (COD) and Opportunistic Snoop Broadcast (OSB) with System Profiles “Performance” and “Performance Per-Watt (DAPC)”. More details on these BIOS profiles are in first blog of this series.
Figure 3: BIOS profile comparison with WRF for CONUS 2.5 km
Figure 3 compares the available snoop modes and two BIOS profiles for the large WRF dataset “CONUS2.5km”. The left graph compares snoop modes with the default BIOS setting “OSB” snoop mode with “DAPC” System Profile, which is shown as the red line set at 1. As per the graph, the COD snoop mode performs 2 to 3% better than default OSB snoop mode. As WRF is a memory sensitive application, the ES snoop mode performance is less than the other snoop modes, up to 8% lower at 22 cores, due to having less request tokens per core compared to other snoop modes (e.g. for 14 core 128/14 = 9 per core vs. 128/22 = 5 per core for 22 core). The right graph compares “Performance” with the “DAPC” system profile for the better performing “COD” snoop mode with COD.DAPC as the baseline. There is not a significant performance difference with “Performance,” only up to 1% better than “DAPC”.
Figure 4: BIOS profile comparison with ApoA1 (92,224 Atoms)
Figure 5: BIOS profile comparison with ATPase (327,506 Atoms)
Figure 6: BIOS profile comparison with STMV (1,066,628 Atoms)
Figure 4, 5 and 6 show the performance characteristics of snoop modes available for Broadwell processors with three (small, medium and large) NAMD benchmarks. The left graphs compare snoop modes with the default BIOS Profile (OSB snoop mode with DAPC system profile, which is shown as the red line set at 1). The performance of all NAMD benchmarks across all snoop modes are almost the same for all the datasets. COD is about 1% better for some of the processors for all the data sets but it is not significantly different compared to the performance of the other snoop modes. The right graph compares “Performance” system profile with the default “DAPC” system profile, which is the baseline with COD snoop mode (red line set at 1 in the graph). It can be seen from the graph, NAMD performed up to 3% better with “Performance” profile and COD snoop mode compared to “DAPC” system profile. As seen from these three graphs, the performance with COD snoop mode and “Performance” system profile improves more with the larger datasets specifically for the 22core part.
Generation over Generation comparison
Figure 7: Generation over generation comparison of Intel processors for CONUS12km WRF benchmark
Figure 7 plots the performance characteristics of the CONUS 12km WRF benchmark over multiple generations of Intel processors. Bars in the graph show the average time step result of the CONUS12km benchmark in seconds and purple dots show the performance relative to WSM processor. It can be easily seen from the graph, the performance of the 14 core Broadwell processor is 20% better than the 14 core HSW processor. The performance of all the Broadwell processors is better than the Haswell E5-2697 v3. The performance improves up to 33% for top bin processor relative to Haswell E5-2697 v3. The performance for the 20 and 22 core Broadwell processors is the same and that is likely because of the lower memory bandwidth per core.
Figure 8: Comparing two generations of Intel processors with WRF Conus 2.5
Figure 8 shows the performance comparison among two generations of Intel processors: Haswell and Broadwell. In this graph, the bar shows the average time step value and the purple dots show the performance improvement relative to the 12 core Haswell processor. The 12 core Broadwell processor has 13% higher memory frequency than the 12 core Haswell processor (2400 MT/s vs. 2133 MT/s in Haswell), but it also has 17% lower AVX base frequency. Due to these performance parameters, the 12 core Broadwell processor performs 6% lower than the 12 core Haswell processor. As per the graph, the top bin 22 core Broadwell processor performs 14% better than the 14 core Haswell processor. Similar to what we saw earlier, there is not a significant performance improvement from the 20 core to 22 core Broadwell processors due to lower memory bandwidth per core.
Figure 9: Performance comparison of multiple generations of Intel processors with ApoA1 benchmark
Figure 9 shows the comparison of multiple generations of Intel processors IVB, HSW and BDW with small sized (92,224 Atoms) NAMD benchmark named ApoA1. The bars show the NAMD performance of the processors in “days/ns”. As seen from the graph, HSW performs 40% better than IVB. While, BDW’s performance improvement varies from 23 to 52% except for the 12 core BDW. The 12 core BDW processor performs 18% slower than the 14 core HSW processor due to 22% lower base frequency. The dots in the graph show the performance improvement over IVB processor. The graph shows that the performance increases with increasing number of cores for the BDW processors. The top bin 22 core BDW performs 112% better than 12 core IVB.
Figure10: Performance comparison of multiple generations of Intel processor with F1ATPase benchmark
Figure 10 compares performance of multiple generations of Intel processors with the medium sized (327,506 Atoms) NAMD benchmark, named F1ATPase. It can be seen from the graph that HSW performance improvement is 33% better than IVB and BDW performance is up to 62% better than HSW.
Figure 11: Performance comparison of multiple generations of Intel processors with STMV benchmark
Figure 11 plots the performance comparison graph among multiple generations of Intel processors for the large size (1,066,628 Atoms) NAMD benchmark. As per this graph, the BDW processors are performing up to 63% better than HSW processor.
It can be seen from figures 9, 10 and 11 that the larger datasets make better use of the computation power and the relative performance with additional cores is better as compared with the smaller dataset.
This blog characterizes five Intel Broadwell processors and shows performance improvement for real time HPC applications. Additional cores, along with the higher memory frequency support in Intel Broadwell processors, improve the performance of HPC workloads specifically for compute sensitive workloads like NAMD. The performance of memory bandwidth sensitive workloads like WRF increase up to the 16 core processors, but the performance improvement for the 20 and 22 core processors is not as significant due to the lower memory bandwidth per core.
Authors: Ashish Kumar Singh, Mayura Deshmukh and Neha Kashyap
The increasing demand for more compute power pushes servers to be upgraded with higher and more powerful hardware. With the release of the new Intel® Xeon® processor E5-2600 v4 family of processors (architecture codenamed “Broadwell”), Dell has refreshed the 13th generation servers to benefit from the increased number of cores and higher memory speeds thus benefiting a wide variety of HPC applications.
This blog is part one of “Broadwell performance for HPC” blog series and discusses the performance characterization of Intel Broadwell processors with High Performance LINPACK (HPL) and STREAM benchmarks. The next three blogs in the series will discuss the BIOS tuning options and the impact of Broadwell processors on Weather Research Forecast (WRF), NAMD, ANSYS® Fluent®, CD-adapco® STAR-CCM+®, OpenFOAM, LSTC LS-DYNA® HPC applications as compared to the previous generation processor models.
In this study, performance was measured across five different Broadwell processor models listed in Table2 along with 2400 MT/s DDR4 memory. This study focuses on HPL and STREAM performance for different BIOS profiles across all five Broadwell processor models and compares the results to previous generations of Intel Xeon processors. The platform we used is a PowerEdge R730, which is a 2U dual socket rack server with two processors. Each socket has four memory channels and can support up to 3 DIMMs per channel (DPC). For our study, we used 2 DPC for a total of 16 DDR4 DIMMs in the server.
Broadwell (BDW) is a tick in Intel’s tick-tock principle as the next step in semiconductor fabrication. It is a 14nm processor with the same microarchitecture as the Haswell-based (HSW, Xeon E5-2600 v3 series) processors with the same TDP range. Broadwell E5-2600 v4 series processors support up to 22 cores per socket with up to 55MB of LLC, which is 22% more cores and LLC than Haswell. Broadwell supports DDR4 memory with max memory speed of up to 2400 MT/s, 12.5% higher than the 2133 MT/s that is supported with Haswell.
Broadwell introduces a new snoop mode option in the BIOS memory setting, Directory with Opportunistic Snoop Broadcast (DIR+OSB), which is the default snoop mode for Broadwell. In this mode, the memory snoop is spawned by the Home Agent and a directory is maintained in the DRAM ECC bits. DIR+OSB mode allows for low local memory latency, high local memory bandwidth and I/O directory cache to reduce directory update overheads for I/O accesses. The other three snoop modes: Home Snoop (HS), Early Snoop (ES), and Cluster-on-Die (COD) are similar to what was available with Haswell. The Cluster-on-die (COD) is only supported on processors that have two memory controllers per processor. The Dell BIOS on systems that support both Haswell and Broadwell will display the supported snoop modes based on the processor model populated in the system.
Table 1 describes the other new features available in the Dell BIOS on systems that support Broadwell processors.
Table1: New BIOS features with Intel Xeon E5 v4 processor family (Broadwell)
Snoop Mode > Directory with Opportunistic Snoop Broadcast (DIR+OSB)
Directory with Opportunistic Snoop Broadcast, available on select processor models, works well for workloads of mixed NUMA optimization. It offers a good balance of latency and bandwidth.
System Profile Settings > Write Data CRC
When set to enabled, the DDR4 data bus issues are detected and corrected during ‘write’ operations. Two extra cycles are required for CRC bit generation which impacts the performance. Read-only unless System Profile is set to Custom.
System Profile Settings > CPU Power Management > Hardware P States
If supported by the CPU, Hardware P States is another performance-per-watt option that relies on the CPU to dynamically control individual core frequency. Read-only unless System Profile is set to Custom.
System Profile Settings > C States > Autonomous
Autonomous is a new BIOS option for C States in addition to the previous options, Enable and Disable. Autonomous (if Hardware controlled is supported), processor can operate in all available Power States to save power, but may increase memory latency and frequency jitter.
Intel Broadwell supports Intel® Advanced Vector Extensions 2 (Intel AVX2) vector technology, which allows a processor core to execute 16 FLOPs per cycle. HPL is a benchmark that solves a dense linear system. The HPL problem size (N) was chosen to be 177408 along with a block size (NB) of 192. The theoretical peak value of HPL was calculated using the AVX base frequency, which is lower than rated base frequency of the processor model. Broadwell processors consume more power when running Intel® AVX2 workloads than non-AVX workloads. Starting with the Haswell product family Intel provides two frequencies for each SKU. Table 2 lists the rated base and AVX base frequencies of each Broadwell processor used for this study. Since HPL is an AVX-enabled workload, we would calculate HPL theoretical maximum performance with AVX base frequency as (AVX base frequency of processor * number of cores * 16 FLOP/cycle)
Table 2: Base frequencies of Intel Broadwell Processors
Base Frequencies of Intel Broadwell processors
Rated base frequency (GHz)
AVX base frequency (GHz)
Theoretical Maximum Performance (GFLOPS)
E5-2699 v4, 22 core, 145W
E5-2698 v4, 20 core, 135W
E5-2697A v4, 16 core, 145W
E5-2690 v4, 14 core, 135W
E5-2650 v4, 12 core, 105W
Table 3 gives more information about the hardware configuration and the benchmarks used for this study.
Table 3: Server and Benchmark details for Intel Xeon E5 v4 processors
As described in table 2
16 x 16GB DDR4 @ 2400 MT/s (Total=256GB)
RHEL 7.2 (3.10.0-327.el7.x86_64)
Snoop modes – OSB, ES, HS and COD
From Intel Parallel Studio 2016 update1
Intel MPI - 5.1.2
Intel Broadwell Processors
Figure1: HPL performance characterization
Figure 1 shows HPL characterization of all five Intel Broadwell processors used for this study, with the PowerEdge R730 platform. Table 2 shows the TDP values for each of the Broadwell processors. The text value in each bar shows the efficiency of that processor. The “X” value on top of each bar shows the performance gain over 12 core Broadwell processor. The HPL performance improvement with top bin Broadwell processor is not correspondingly increasing as number of cores. For example, adding 83% more cores in top bin 22 core than 12 core Broadwell processor, allows HPL a 57% performance improvement. The line pattern on the graph shows the HPL performance per core. Since the HPL performance is not accelerating as per number of cores, the performance per core has decreased by 8 to 15 % for 20 and 22 core processors respectively.
Figure2: STREAM (Triad) Performance characterization
The STREAM benchmark calculates the memory bandwidth by counting only the bytes that the user program requested to be loaded or stored. This study uses the results reported by the TRIAD function of the stream bandwidth test.
Figure 2 plots the STREAM (TRIAD) performance for all Broadwell processors used for this study. The bars show the memory bandwidth in GB/s for each of the processors. As per the graph, memory bandwidth across all Broadwell processors is approximately same. Since, the memory bandwidth across all Broadwell processors are same, the memory bandwidth per core is decreasing due to more number of cores.
BIOS Profiles comparison
Figure 3: Comparing BIOS profiles with HPL
Figure 3 plots HPL performance with two BIOS system profile options for all four snoop modes across all five Broadwell processors. As Directory + Opportunistic Snoop Broadcast (DIR+OSB) snoop mode performs well for all workloads and DAPC system profile balances performance and energy efficiency, these options are set as default in the BIOS and so has been chosen as the baseline.
From this graph, it can be seen that Cluster-on-Die (COD) memory mode with the “Performance” System Profile setting performs 2 to 4 % better than other BIOS profile combinations across all Broadwell processors. The Cluster-on-die (COD) is only supported on processors that have two memory controllers per processor, i.e. 12 or more cores.
Figure 4: Comparing BIOS profiles with STREAM (TRIAD)
Figure 4 shows the STREAM performance characteristics with two BIOS system profile options for all the snoop modes. Opportunistic snoop Broadcast (OSB) snoop mode along with DAPC system profile is chosen as the baseline for this study. Memory Bandwidth with each BIOS profile combination except Early snoop (ES) mode with both system profiles are almost same. The memory bandwidth with Early snoop (ES) mode for both system profiles is lower by 8 to 20 % and the difference is more apparent for 22 core processor up to 25%. The Early Snoop (ES) mode have less Requester Transaction IDs (RTIDs) distributed across all the cores, while other snoop modes gets higher RTIDs, that is higher number of credits for local and remote traffic at the home agent.
Figure 5: Comparing HPL Performance across multiple generations of Intel processors
Figure 5 plots generation over generation performance comparison for HPL with Intel Westmere (WSM), Sandy Bridge (SB), Ivy-Bridge (IVB), Haswell (HSW) and Broadwell (BDW) Processors. The percentages on the bars shows the HPL performance improvement than their previous generation processor. The graph shows that the 14 core Broadwell processor with similar frequencies performs 16% better than 14 core Haswell processor for the HPL benchmark. Broadwell processors measure better power efficiencies than the Haswell processors. The top bin 22 core Broadwell processor performance is 49% better than 14 core Haswell processor. The purple diamonds in the graph show the performance per core. The “X” value on top of every bar shows acceleration over 6 core WSM processor.
Figure 6: Generation over generation comparison with STREAM
Figure 6 plots performance comparison of STREAM (TRIAD) for multiple generations of Intel processors. From the graph, it can be seen that the memory bandwidth on the system has increased over generations. The theoretical maximum memory frequency increased by 12.5% in Broadwell over Haswell (2133 MT/s to 2400 MT/s) and this translates into 10 to 12% better measured memory bandwidth as well. However the maximum core-count per socket has increased by up to 22% in Broadwell over Haswell, and so the memory bandwidth per core depends on the specific Broadwell SKU. The 20 core and 22 core BDW processors support only ~3 GB/s per core and that is likely to be very low for most HPC applications, the 16core BDW is on par with the 14core HSW at ~4 GB/s per core.
The performance of all Broadwell processor used for this study is higher for both HPL and STREAM benchmarks. There is ~12% increase in measured memory bandwidth for Broadwell processors compared to Haswell processors. Broadwell processors measure better power efficiencies than the Haswell processors. In conclusion, Broadwell processors may fulfill the demands of more compute power for HPC applications.
The Ninth Annual National Meeting for the South African Center for High Performance Computing was in held in early December 2015 in Pretoria, SA. South Africa has become the focus of regional and international interest in the tech and science communities due to the Square Kilometer Array (SKA) being built in the Karoo region. When completed, it will be the world’s biggest radio telescope with an expected 50-year lifespan. The investment in the SKA will benefit the area and 15 member states as a whole as a result of improvements to the power grid, high-speed networks, and workforce development. Phase One of the construction project is scheduled to begin in 2018, with early science and data generation following by 2020.
Many well-known experts were on hand at the symposium. CHPC’s Director Happy Sithole talked about the growth of the Cape Town center since launching in 2007. It was the only center of its kind on the African continent at the time, and supported 15 researchers with 2.5 teraflops. Now it supports 700 with 64 teraflops of power with expansion driven by demand. Sithole also announced the addition of a new Dell system to be added in two phases, which will increase capacity to 1,000 teraflops, operational in early 2016.
Merle Giles (National Center for Supercomputing Applications) gave the opening address, titled “HPC-Enabled Innovation and Transformational Science & Engineering: The Role of CI.” Of note, he spoke about the funding gap between the foundational research usually led by universities (or start-ups) and the commercialization phase where industry picks up. Furthermore, data supports the ROI of HPC investments. Giles also highlighted the importance of President Obama’s state of the union address, which translated HPC’s role in enabling medical advances into benefits for the average citizen. This past November he spoke to Dell at SC15 about the impact of HPC on third world countries. You can watch his observations here.
Additionally, a talk by Simon Hodson of CODATA highlighted the critical importance of allowing open access to the data behind research findings. Rudolph Pienaar of Boston Children’s Hospital discussed data challenges within the healthcare field. Specifically, hospital systems are antiquated and siloed, designed to facilitate billing and protect privacy, which obstructs research and collaboration. Children’s has designed an innovative system that overcomes these challenges, known as the Boston Children’s Hospital Research Integration System (ChRIS), a web-based research integration system that can manage any datatype, it is uniquely suited to medical image data, providing the ability to seamlessly collect data from typical sources found in hospitals.
An important aspect of the forum was a discussion regarding best practices on how to manage data sharing across national borders; which has been a point of concern for the Southern African Development Community (SADC). The organization held a meeting, which included first time delegates from Mauritius, Namibia and Seychelles, to review the collaborative framework document that last year’s forum attendees had begun to draft. The SADC delegates were counseled by the international advisers to focus on collaboration. They were also warned that finding the best way to for reliable transfer of data among SADC sites, securely and seamlessly, was integral to the success of the project. It was agreed that cybersecurity should be a first priority.
The meeting concluded with a plan for another conference. This one will be held in Botswana in April 2016. For more information, you can read the recent article on HPCWire.
By David Griggs
Ohio’s academic, science and technology communities will be getting quite a lift this year as the Ohio Supercomputer Center (OSC), an OH-TECH Consortium member, adds a powerful new supercomputer from Dell. The enhancement is part of a $9.7 million investment recently approved by Ohio’s State Controlling Board and stems from a $12 million appropriation included in the 2014-15 Ohio biennial capital budget.
OSC is a regional center, founded in 1987, that provides supercomputing services and expertise to local industries and university researchers. Currently, it offers computational services via three supercomputer clusters: the IBM/AMD Glenn Cluster, the HP/Intel Ruby Cluster, and the HP/Intel Oakley Cluster. The Dell supercomputer will be replacing the Glenn Cluster and part of the Oakley Cluster, adding a much-needed increase in computing power and storage, as the center is running near peak capacity. The center’s interim director, Dr. David Hudak, Ph.D., expects that this new addition will greatly help industrial and academic clients alike, fostering new research and innovation.
For more information, please see the recent articles on insideHPC and HPCWire.
The 2016 International Supercomputing Conference is being held in Frankfurt Germany this coming June and with it the fifth annual Student Cluster Competition . Once again Team South Africa, Co-sponsored by the Centre for High Performance Computing (CHPC) and Dell, will be competing for a winning title.
The team, led by CHPC’s David MacLeod, who is responsible for introducing the cluster competition to South Africa students, putting together the first official team in 2011 and leading all subsequent teams. David has an impressive record, with two first place teams and one second place team. His eyes are on clinching the 2016 title at ISC this year, but in addition he aims to raise awareness of HPC as a transformative technology in South Africa, and attract more students to the field.
This year’s team consists of six bright young students from the University of the Witwatersrand and two reserves from Stellenbosch University who will face off against 11 other teams from around the world. These student squads will compete over a three-day period to build a small cluster computer of their own design and run a series of HPC benchmarks and applications. In preparation for the competition, Team South Africa spent a week at Dell’s Round Rock campus to meet with HPC experts, check out our next-generation HPC and thermal labs, become familiar with the cluster systems and receive hands-on tutorials and feedback sessions. A special treat for the South African students was a sit down with Jim Ganthier, Dell’s Head of HPC, and Ed Turkel, Dell’s HPC Strategist, to learn more about pursuing a career in HPC.
Both Jim and Ed discussed the recent progress in the HPC industry, and just how far it has come from the days they worked on monolithic systems, before the advent of x86 servers and clusters. They also talked about how HPC is going beyond the world of academia and scientific research, thanks to the explosive growth in big data, and how companies like Dell are leading the charge to bring HPC to mainstream audiences, with the hopes that the students of today will help make that vision a reality. The students had many questions from the medical applications of HPC to how the democratization of HPC will affect the way business leaders look at technology, to what a career in electrical engineering would look like in relation to HPC,
For more on the South African students visit to Dell, please visit Perrin Cox’s post.
By Olumide Olusanya and Munira Hussain
This is the second part of this blog series. In the first post, we shared OSU Micro-Benchmarks (latency and bandwidth) and HPL performance between FDR and EDR Infiniband. In this part, we will further compare performance using additional real-world applications such as ANSYS Fluent, WRF, and NAS Parallel Benchmarks. For my cluster configuration, please refer to part 1.
Fluent is a Computational Fluid Dynamics (CFD) application used for engineering design and analysis. It can be used to simulate the flow of fluids, with heat transfer, turbulence and other phenomena, involved in various transportation, industrial and manufacturing processes.
For this test we ran Eddy_417k which is one of the problem sets from ANSYS Fluent Benchmark suits. It is a reaction flow case based on the eddy dissipation model. In addition, it has around 417,000 hexahedral cells and is a small dataset with a high communication overhead.
Figure 1 - ANSYS Fluent 16.0 (Eddy_417k)
From Figure 1 above, EDR shows a wide performance advantage over FDR as the number of cores increase to 80. We continue to see an even wider difference as the cluster scales. While FDR’s performance seems to gradually taper off after 80 cores, EDR’s performance continues to scale as the number of cores increase and performs 85% better than FDR on 320 cores (16 nodes).
WRF (Weather Research and Forecasting)
WRF is a modelling system for weather prediction. It is widely used in atmospheric and operational forecasting research. It contains two dynamic cores, a data assimilation system, and a software architecture that allows for parallel computation and system extensibility. For this test, we are going to study the performance of a medium size case, Conus 12km.
Conus 12km is a resolution case over the Continental US domain. The benchmark is run for 3 hours after which we take the average of the time per time step.
Figure 2 - WRF (Conus12km)
Figure 2 shows both EDR and FDR scaling almost linearly and also performing almost equally until the cluster scales to 320 cores when EDR performs better than FDR by 2.8%. This performance difference, which may seem little, is significantly higher than my highest run to run variation of 0.005% between three successive EDR and FDR 320-core tests.
HPC Advisory Council’s result here shows a similar trend with the same benchmark. From their result, we can see that the performances are neck and neck until the 8 and 16-node run where we see a small performance gap. Then the gap widens even more in the 32-node run and EDR posts a 28% better performance than FDR. Both results show that we could see an even higher performance advantage with EDR as we scale beyond 320 cores.
NAS Parallel Benchmarks
NPB contains a suite of benchmarks developed by NASA Advanced Supercomputing Division. The benchmarks are developed to test the performance of highly parallel supercomputers which all mimic large-scale and commonly used computational fluid dynamics applications in their computation and data movement. For my test, we ran four of these benchmarks: CG, MG, FT, and IS. In the figures below, the performance difference is in an oval right above the corresponding run.
Figure 3 - CG
Figure 4 - MG
Figure 5 - FT
Figure 6 - IS
CG is a benchmark which computes an approximation of the smallest eigenvalue of a large, sparse, symmetric positive-definite matrix using a conjugate gradient method. It also tests irregular long distance communication between cores. From Figure 3 above, EDR shows a 7.5% performance advantage with 256 cores.
MG solves a 3-D Poisson Partial Differential Equation. The problem in this benchmark is simplified as it has constant instead of variable coefficients to better mimic real applications. In addition to this, it tests short and long distance communication between cores. Unlike CG, the communication patterns are highly structured. From Figure 4, EDR performs better than FDR by 1.5% on our 256-core cluster.
FT is a 3-D partial differential equation solution using FFTs. It tests the long-distance communication performance as well and shows a 7.5% performance gain using EDR on 256 cores as seen in Figure 5 above.
IS, a large integer sort application, shows a high 16% performance difference between EDR and FDR on 256 cores. This application not only tests the integer computation speed, but also the communication performance between cores. From Figure 6, we can see a 12% EDR advantage with 128 cores which increases to 16% on 256 cores.
In both blogs, we have shown several micro-benchmark and real-world application results to compare FDR with EDR Infiniband. From these results, EDR has shown a higher performance and better scaling than FDR on our 16-node Dell PowerEdge C6320 cluster. Also, some applications have shown a wider performance margin between these interconnects than other applications. This is because of the nature of the applications being tested; communication intensive applications will definitely perform and scale better with a faster network when compared with compute-intensive applications. Furthermore, because of our cluster size, we were only able to test the scalability of the applications on 16 servers (320 cores). In the future, we plan on running these tests again on a larger cluster to further test the performance difference between EDR and FDR.