Authors: Mayura Deshmukh, Ashish K Singh, Neha Kashyap
With the refresh of Dell’s 13th generation servers with the recently released Broadwell (BDW) processors, some obvious questions come to mind such as how the new processors compare with the older generation processors. This blog, fourth in the series of “Broadwell Performance for HPC,” focuses on answering this question. It compares the performance of various CAE applications for five Broadwell Intel Xeon E5-2600 v4 series processor models with previous generation Intel processors.
Last week’s blog talked about the impact of BIOS options for each of the CAE applications. Here we focus on how much better the performance of the Broadwell processors is as compared to the previous generation Haswell (HSW) and Ivy-bridge (IVB) processors for these CAE applications. Table 1 shows the applications that we are comparing and Table 2 describes the server configuration used for the study. For LS-DYNA, the benchmarks run on the IVB and HSW (sse binary) and for ANSYS Fluent, benchmarks run on Westmere (WSM), Ivy-bridge (IVB), Sandy-bridge(SB) and HSW used different software versions (whatever latest version was available at the time) than what is mentioned in Table 1. STAR-CCM+ and OpenFOAM version for benchmarks run on both HSW and BDW were same.
Table 1 - Applications and benchmarks
Platform MPI 9.1.0
Average Elapsed time
Platform MPI 9.1.3
Platform MPI 22.214.171.124
Open MPI 1.10.0
Table 2 - Server configuration
256GB - 16 x 16GB 2400 MHz DDR4 RDIMMs
6 x 300GB SAS 6Gbps 10K rpm
PERC H330 mini
Red Hat Enterprise Linux 7.2 (3.10.0-327.el7.x86_64)
System profile - Performance
Logical Processor - Disabled
Power Supply Redundant Policy - Not Redundant
Power Supply Hot Spare Policy - Disabled
I/O Non-Posted Prefetch - Disabled
Snoop Mode - Opportunistic Snoop Broadcast (OSB) for OpenFOAM and Cluster on Die (COD) for all the other applications
Node interleaving - Disabled
Figure 1 compares the performance of the five BDW Intel Xeon E5-2600 v4 series processors models with HSW Intel Xeon E5-2600 v3 series processors and IVB E5-2680 v2 for LS-DYNA car2car benchmark (with end time set to 0.02).
Figure 1: IVB vs. HSW vs BDW for LS-DYNA
The performance for all the processors is compared to E5-2680 v2, which is shown as the red baseline set at 1. The green bars show the performance for the HSW processors with LS-DYNA single precision sse binary, the grey bar represents data for HSW E5-2697 v3 with LS-DYNA single precision avx2 binary, the blue bars show the data for BDW processors with LS-DYNA single precision sse binary and the orange bars represent the BDW data with LS-DYNA single precision avx2 binary. For BDW, avx2 binaries perform 12-19% better than the sse binaries across all the processor models. The purple diamonds describe the performance per core compared to the E5-2680 v2. The percentages at the top of the BDW avx2 orange bar describe the percentage improvement of the BDW processors over HSW E5-2697 v3 avx2 (grey bar in the graph). The 12 core BDW E5-2650 v4 which has fewer cores and lower frequency understandably performs 11% lower than the Haswell E5-2697 v3 processor. The 14 core E5-2690 v4 which has same number of cores and similar avx2 frequencies performs 7% better than the E5-2697 v3 this can be accounted for due to the increase in bandwidth for Broadwell and BDW processors also measure better power efficiencies than Haswell processors. The performance for the 16core, 20core and 22core processors is 16 to 30% higher than the HSW E5-2697 v3 (avx2). Comparing the performance, performance per core and the higher memory bandwidth per core, the E5-2690 v4 14c and E5-2697Av4 16c look like attractive options for CAE/CFD codes, particularly when considering per core licensing costs.
CD-adapco’s STAR-CCM+ is another CFD application widely-used by industry for solving problems involving fluid flows, heat transfer, and other phenomena. STAR-CCM+ shows similar performance patterns to LS-DYNA.
Figure 2: HSW vs BDW for STAR-CCM+
Figure 2 compares the performance of the five BDW Intel Xeon E5-2600 v4 series processors models (shown as the five bars in the graph) with HSW E5-2697 v3 shown as the red line set at one. The numbers at the top of the bar show the per core performance relative to the E5-2697 v3. As seen from the bars the 14core, 16core, 20core and the 22core relative performance is higher by 8% to 40% across all the benchmarks. The lower core, lower frequency 12core E5-2650 performs 11-20% lower than the E5-2697 v3. Similar to LS-DYNA, the per core performance of the 14core and the 16core is 2% to 11% better than the HSW E5-2697 v3 making them good options for STAR-CCM+ as well.
ANSYS Fluent is a computational fluid dynamics application. The graph in Figure 3 shows the performance of truck_poly_14m for Sandy-bridge (SB), Ivy-bridge (IVB), HSW and BDW processors compared to the Westmere (WSM) processor shown as the redline set at one.
Figure 3: WSM vs. SB vs. IVY vs. HSW vs. BDW for ANSYS Fluent
The Fluent benchmark exhibit a similar pattern as LS-DYNA and STAR-CCM+ benchmarks. The purple diamonds in Figure 3 describe the performance per core compared to the WSM 2.93GHz processor. The percentages at the top of the BDW blue bar describe the percentage improvement of the BDW processors over HSW E5-2697 v3 (green bar in the graph). The 12 core BDW E5-2650 v4 which has fewer cores and lower frequency performs 14% lower than the Haswell E5-2697 v3 processor. With higher performance per core and the higher memory bandwidth per core, the E5-2690 v4 14c, E5-2697Av4 16c are good options, particularly when considering per core software licensing costs, and perform 11% and 21% better than the E5-2697 v3 processor. The 20 and 22core BDW processors perform 32%-39% better than the HSW E5-2697 v3.
OpenFOAM (Open source Field Operation And Manipulation) is a free, open source software for computational fluid dynamics (CFD).
Figure 4: HSW vs. BDW for OpenFOAM Motorbike 11M benchmark
As shown in Figure 4 for the OpenFOAM Motorbike 11M benchmark, all the Broadwell processors perform 12% to 21% better than the Haswell E5-2697 v3 processor, shown as the red line set at one. Per core performance for the 16 core, 14 core and 12 core is 4% to 30% better than the E5-2697 v3.The performance for the 20 core and the 22 core BDW processors are the same for the Motorbike 11M benchmark. Increase in number of cores does not provide a significant performance boost for 20 and 22 core parts likely due to lower memory bandwidth per core as explained in the first blog’s STREAM results.
Along with more cores than HSW, BDW measures better power efficiency than HSW. Looking at the absolute performance, performance per core and the higher memory bandwidth per core, the E5-2690 v4 14c, E5-2697Av4 16c are attractive options for CAE/CFD codes particularly if per-core licensing costs are involved. For applications like OpenFOAM (motorbike case) all the BDW processors performed better than Haswell E5-2697 v3, but the increase in number of cores does not provide a significant performance boost for 20 and 22 core parts due to lower memory bandwidth per core.
Last week’s blog on the “Broadwell Performance for HPC” series described the BIOS options and compared performance across generations of processors for molecular dynamic applications (NAMD) and Weather Research and Forecasting (WRF). This blog, third in the series, focuses on BIOS options for some HPC CAE applications for five different Broadwell Intel Xeon E5-2600 v4 series processor models. It aims to answer questions like, which snoop mode works best for my application and processor? Which BIOS System Profile would give the best performance?
There have been a few changes in the BIOS options for Broadwell as compared with the previous generation (Haswell). One of the major additions in the Broadwell BIOS is the “Opportunistic Snoop Broadcast” snoop mode in the Memory settings. This blog discusses performance of the applications for all four snoop modes: Opportunistic snoop broadcast (OSB), Early snoop (ES), Home snoop (HS) and Cluster on die (COD). For more information on the new BIOS options and snoop modes check blog one of this series.
The Dell BIOS “System Profile” setting can be set to either of the four pre-configured profiles: Performance Per Watt (DAPC), Performance Per Watt (OS), Performance (Perf.) and Dense Configuration or set to Custom. In the pre-configured profiles the Turbo Boost, C States, C1E, CPU Power Management, Memory Frequency, Memory Patrol Scrub, Memory Refresh Rate, Uncore Frequency are preset whereas for Custom the User can choose values for these options. For more information on System Profiles check the link. DAPC and OS have shown to perform similarly in past studies, and Dense Configuration performs lower for HPC workloads, so we will be focusing on DAPC and Performance Profiles in this study. The DAPC (Dell Active Power Control) Profile relies on a BIOS-centric power control mechanism. Energy efficient turbo, C States, C1E are enabled with the DAPC Profile. Performance Profile disables power saving features such as C-states, Energy efficient turbo and C1E. Turbo boost is enabled in both the System Profiles.
This blog discusses the performance of CAE applications with DAPC and Performance profile for each of the four snoop modes for five different Intel Xeon E5-2600 v4 series Broadwell processors. Table 1 shows the application and benchmark details and Table 2 describes the server configuration used for the study.
System Profile - Performance and Performance Per Watt (DAPC)
Snoop Mode - Opportunistic Snoop Broadcast (OSB), Early Snoop (ES), Home Snoop (HS), Cluster on Die (COD)
LS-DYNA is a general-purpose finite element program from LSTC capable of simulating complex real-world structural mechanics problems. We ran the car2car benchmark with endtime set to 0.02 with both the single precision avx2 and the single precision sse LS-DYNA binaries.
Figure 1: Comparing snoop modes and BIOS Profiles for LS-DYNA
The left graph in Figure 1 shows how better or worse the different snoop modes perform compared to the default setting of snoop mode = OSB and BIOS profile=DAPC (which is set at 1, the red line on the graph). Just changing the snoop mode to COD increases performance by 1-3% with either BIOS profiles across all the processor models. The performance with COD is closely followed by OSB followed by ES for lower core counts and HS for 16, 20 and 22 core processors. With ES mode, the system starts paying the penalty of having lower request tokens per core for higher core counts compared to the other snoop modes (for e.g. for 14 core 128/14 = 9 per core Vs. 128/22 = 5 per core for 22 core). All the snoop modes with the System Profile set to Performance follow similar pattern as DAPC. As shown in the graph on the right in Figure 1, changing the System Profile from DAPC to Performance can provide up to 2% performance benefit. The COD.Perf is the best option, about 2-4% better compared to OSB.DAPC across all processor models. The total 2-4% improvement with COD.Perf is accounted partially due to the change in snoop mode and partially due to change in the BIOS System Profile to Performance. We ran the car2car benchmark for all the combinations above with the sse LS-DYNA binary as well and noted similar behavior with the Performance System Profile and COD snoop mode being 2-6% better than the default OSB.DAPC. The avx2 binaries performed 12-19% better than the sse binaries across all the processor models.
CD-adapco® STAR-CCM+ is another CFD application widely-used by industry for solving problems involving fluid flows, heat transfer, and other phenomena. The STAR-CCM+ benchmarks results show a pattern similar to LS-DYNA in terms of snoop mode and System Profile.
Figure 2: Comparing snoop modes for STAR-CCM+
Figure 2 compares the snoop modes for the Civil_20m and Lemans_17m benchmarks. For simplicity, data for these two benchmarks are shown. The other benchmarks datasets show results similar to the patterns in Figure 2. The BIOS profile in the graphs is set to DAPC and the snoop modes are compared against the default OSB snoop mode (which is set at 1, the red line on the graph). The COD is the best option for the Civil_20m benchmark, it is about 2-3% better for DAPC. For the Performance System Profile COD is 4-6% better for the Civil_20m benchmark (not shown in the graph). COD is followed by OSB and then ES for smaller core counts. Performance with ES though starts reducing as the cores increase similar to what was observed with LS-DYNA car2car benchmark case. The HlMach10 benchmark shows similar pattern to the Civil_20m benchmark. For the HlMach10 benchmark case the COD.Perf option is 2-7% better than the default OSB.DAPC.
All the other benchmarks (EglinStoreSeparation, Kcs, Lemans_100m, Reactor9m, TurboCharger, Vtm) show similar pattern to Lemans_17m. The COD and OSB perform similarly, there is only ~1% difference between OSB and COD across the benchmark cases across all processor models. After COD and OSB, ES option is better for lower core counts and HS for 16, 20 and 22 core processors. As mentioned previously, the system in ES mode starts paying the penalty of having lower request tokens per core for higher core counts compared to the other snoop modes.
Figure 3: DAPC vs. Performance with COD snoop mode for STARCCM+
The graph in figure 3 compares the System Profile BIOS options DAPC and Performance. We are comparing the performance of COD.Perf with respect to COD.DAPC, which is the red baseline set at 1 in the graph. The Performance profile provides 2-4% benefit over the DAPC for the Civil_20m benchmark for all the processor models. Also for the high core count, E5-2699 v4 the Performance profile performs 2-5% better across all the benchmarks. For all the other processor models there is not a significant gain (only about 1%) with the Performance profile for all the benchmarks (except Civil_20m).
ANSYS Fluent is a computational fluid dynamics application. Fluent provides multiple benchmark cases. We picked four representative cases from the v16 benchmark suite: combustor_12m, combustor_71m, exhaust_system_33m and ice_2m and one from the older v15 benchmark suite: truck_poly_14m, to allow us to compare our data with previous generation processor models. The Fluent benchmarks exhibit a similar pattern as LS-DYNA and STAR-CCM+ benchmarks.
Figure 4: Comparing snoop modes for ANSYS Fluent
The graph in Figure 4 shows the performance of truck_poly_14m for all the snoop modes compared to the default OSB.DAPC which is shown as the red baseline in the graphs. All the other benchmarks show a similar pattern. COD performs up to 2% better than OSB for truck_poly_14m, combustor_12m and ice_2m. COD is about 5% better for combustor_71m and 6% better for exhaust_33m. COD is followed by OSB, followed by ES for lower core counts and HS for higher core count processors for all the benchmarks.
Figure 5: DAPC vs. Performance with COD snoop mode for ANSYS Fluent
Figure 5 shows the performance for Performance profile with respect to DAPC with COD set as the snoop mode for both options. DAPC is shown as the red baseline in the graph. The Performance BIOS profile option is about 4% better for all the processor models for the larger combustor_71m and exhaust_33m benchmark cases. The Performance profile is 1-3% better for the other benchmark cases.
OpenFOAM (Open source Field Operation And Manipulation) is a free, open source software for computational fluid dynamics (CFD). OpenFOAM was compiled with -march=native / Broadwell option. We used the cavity-1M and motorBike-11M datasets which are modifications of the OpenFOAM tutorials/incompressible/icoFoam/cavity and tutorials/incompressible/simpleFoam/motorBike models respectively.
Figure 6: Comparing snoop modes and BIOS Profiles for OpenFOAM Cavity 1M benchmark
As shown in left graph of figure 6 for DAPC System Profile, the benchmark performance increases by 3-6% when in COD snoop mode when compared to OSB. ES and HS options perform up to 3% lower than OSB across all the processor models. The pattern is similar for the Performance System Profile, where COD is better by 3-7% followed by OSB. HS is lower than OSB but better than ES for all the processors models except for the 20core E5-2698 v4 where ES is 1% better than HS for DAPC profile and 7% better than HS for Performance System Profile. There is not a lot of difference in performance for DAPC Vs Performance profile especially for the higher frequency processors 14core E5-2690v4 and the 16core E5-2697A v4. For the other models the Performance profile shows up to 4% benefit as shown in the right graph of figure 6.
Figure 7: Comparing snoop modes and BIOS Profiles for OpenFOAM Motorbike 11M benchmark
For the openFOAM motorbike 11M benchmark the OSB, COD and the HS snoop modes perform similarly with about 1% variation. The performance for ES is low across all the processor models and it keeps on dropping as the number of cores increase as shown in the left graph of figure 7. The snoop modes with BIOS System Profile set to Performance follow exactly similar trend. As shown in the right graph on figure 3, the DAPC and Performance profiles show similar performance with Performance about 1% better in most cases except for the E5-2697A where the DAPC.COD was 2% better.
Most of the data sets used in this study show advantage of COD mode, but COD benefits codes which are highly NUMA optimized and where the dataset fit into the NUMA memory (that is half of each sockets memory capacity). OSB is a close second and a good option for codes with varying level of NUMA optimization; OSB is also the default memory snoop BIOS option. HS and ES perform slightly lower than COD and OSB. ES is better than HS for lower core counts but as the core counts increase ES starts paying the penalty of having lower request tokens per core for higher core counts compared to the other snoop modes. In terms of System Profile, Performance Profile performs slightly better than DAPC in most of the cases.
Be sure to check back next week for the last blog in the series which will compare the performance of HPC CAE applications across generations (Ivy-bridge vs. Haswell vs. Broadwell)
Authors: Ashish K Singh, Mayura Deshmukh, Neha Kashyap
This blog describes the performance of Intel Broadwell processors with HPC applications, Weather Research and Forecasting (WRF) and NAnoscale Molecular Dynamics (NAMD). This is the second blog in the series of four blogs on “Performance study of Intel Broadwell”. The first blog characterizes the Broadwell-EP processors with HPC benchmarks like HPL and STREAM. This study compares five different Broadwell processors E5-2699 v4 @2.2GHz (22 cores), E5-2698 v4 @2.2GHz (20 cores), E5-2697A v4 @2.6GHz (16 cores), E5-2690 v4 @2.6GHz (14 cores), and E5-2650 v4 @2.2GHz (12 cores) in PowerEdge 13th generation servers. It characterizes the performance of the system for WRF and NAMD by comparing five Broadwell processors models with previous generations of Intel processors. For the generation over generation comparison, previous results from Intel Xeon X5600 series Westmere (WSM), Intel Xeon E5-2600 series Sandy-Bridge (SB), Intel Xeon E5-2600 v2 series Ivy-Bridge (IVY), Intel Xeon E5-2600 v3 series Haswell (HSW) and Intel Xeon E5-2600 v4 series Broadwell (BDW) processors were used. This blog also describes the impact of BIOS tuning options on WRF and NAMD performance with Broadwell. Table 1 below lists the server configuration and the application details for the Broadwell processor based tests. The software versions were different for the older generation processors and results are compared against what was best configuration at that time. Due to big architectural changes in servers and processors generation over generation, the changes in software versions is not a significant factor.
Table 1: Details of Server and HPC Applications used with Intel Broadwell processors
Dell PowerEdge R730
E5-2699 v4 @2.2GHz, 22 core, 145W
E5-2698 v4 @2.2GHz, 20 core, 135W
E5-2697A v4 @2.6GHz, 16 core, 145W
E5-2690 v4 @2.6GHz, 14 core, 135W
E5-2650 v4 @2.2GHz, 12 core, 105W
16 x 16GB DDR4 @ 2400MHz (Total=256GB)
2 x 1100W
System profile – Performance and Performance Per Watt (DAPC)
Logical Processor – Disabled
Power Supply Redundant Policy – Not Redundant
Power Supply Hot Spare Policy – Disabled
Snoop modes – COD, ES, HS and OSB
Node Interleaving - Disabled
From Intel Parallel studio 2016 update1
Intel MPI – 5.1.2
Weather Research and Forecasting (WRF) is an HPC application used for atmospheric research. The WRF model is a next-generation mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. This serves a wide range of meteorological applications across scales from tens of meters to thousands of kilometers. WRF can generate atmospheric simulations using real data or idealized conditions. We used the CONUS12km and CONUS2.5km benchmarks for this study.
CONUS12km is a single domain and small size (48hours, 12km resolution case over the Continental U.S. (CONUS) domain from October 24, 2001) benchmark with 72 seconds of time step. CONUS2.5km is a single domain and large size (Latter 3hours of a 9hours, 2.5km resolution case over the Continental U.S. (CONUS) domain from June 4, 2005) benchmark with 15 seconds of time step.
WRF decomposes the domain into tasks or patches. Each patch can be further decomposed into tiles that are processed separately, but by default there is only one tile for every run. If the single tile is too large to fit into the cache of the CPU and/or core, it slows down computation due to WRF’s memory bandwidth sensitivity. In order to reduce the size of the tile, it is possible to increase the number of tiles by defining “numtile = x” in input file or defining environment variable “WRF_NUM_TILES = x”. For both CONUS 12km and CONUS 2.5km the number of tiles are chosen based on best performance. The tiles value depend on workload and hardware configuration. Table 2 shows more detail on the number of tiles used in this study for best performance.
Table2: Parameters used in WRF for best performance
Total no. of cores
NAMD is one of the HPC applications used in molecular dynamics research. It is a portable, parallel and object oriented molecular dynamics code designed for high-performance simulations of large bio molecular systems. NAMD is developed using charm++. Molecular Dynamics simulations of bio molecular systems are an important technique for our understanding of biological systems. This study has been performed with three NAMD benchmarks ApoA1 (92,224 Atoms), F1ATPase (327,506 Atoms) and STMV (virus, 1,066,628 Atoms). In the context of number of atoms, these benchmarks lie in the category of small, medium and large size datasets.
Intel Broadwell processors
Figure 1: Performance for Intel Broadwell processors with WRF
Figure 1 compares performance among five Broadwell processors by using small and large size of WRF benchmarks. WRF was compiled with the “sm + dm” mode. The combinations of MPI and OpenMP processes that were used are mentioned in Table2.
The “X” value in the graph on top of each bar show the performance relative to the 12 core Broadwell processor (which is set as baseline, 1.0). For the small size dataset CONUS12km, the top bin processor performs 26% better than 12 core processor. While for CONUS2.5km, performance increases up to 30% due to the large dataset size, which can more efficiently utilize larger number of processors. The performance increase from 20 to 22 cores is not as significant due to the lower memory bandwidth per core as explained in the first blog’s STREAM results.
Figure 2: Performance of Intel Broadwell processors with NAMD
Figure 2 plots the simulation speed of NAMD benchmarks with Broadwell processors. The “X” value in the graph on top of each bar show the performance relative to the 12 core Broadwell processor (which is set as baseline, 1.0). As seen from the graph, the relative performance of the different processors models is nearly same irrespective of the NAMD benchmark dataset (small, medium or large). For the top bin processor, the performance improvement is 81 to 84% faster than the 12 core processor. NAMD benchmarks show significant performance improvement with additional cores for Broadwell processors.
BIOS Profiles comparison
This study was performed with all snoop modes: Home Snoop (HS), Early Snoop (ES), Cluster-on-Die (COD) and Opportunistic Snoop Broadcast (OSB) with System Profiles “Performance” and “Performance Per-Watt (DAPC)”. More details on these BIOS profiles are in first blog of this series.
Figure 3: BIOS profile comparison with WRF for CONUS 2.5 km
Figure 3 compares the available snoop modes and two BIOS profiles for the large WRF dataset “CONUS2.5km”. The left graph compares snoop modes with the default BIOS setting “OSB” snoop mode with “DAPC” System Profile, which is shown as the red line set at 1. As per the graph, the COD snoop mode performs 2 to 3% better than default OSB snoop mode. As WRF is a memory sensitive application, the ES snoop mode performance is less than the other snoop modes, up to 8% lower at 22 cores, due to having less request tokens per core compared to other snoop modes (e.g. for 14 core 128/14 = 9 per core vs. 128/22 = 5 per core for 22 core). The right graph compares “Performance” with the “DAPC” system profile for the better performing “COD” snoop mode with COD.DAPC as the baseline. There is not a significant performance difference with “Performance,” only up to 1% better than “DAPC”.
Figure 4: BIOS profile comparison with ApoA1 (92,224 Atoms)
Figure 5: BIOS profile comparison with ATPase (327,506 Atoms)
Figure 6: BIOS profile comparison with STMV (1,066,628 Atoms)
Figure 4, 5 and 6 show the performance characteristics of snoop modes available for Broadwell processors with three (small, medium and large) NAMD benchmarks. The left graphs compare snoop modes with the default BIOS Profile (OSB snoop mode with DAPC system profile, which is shown as the red line set at 1). The performance of all NAMD benchmarks across all snoop modes are almost the same for all the datasets. COD is about 1% better for some of the processors for all the data sets but it is not significantly different compared to the performance of the other snoop modes. The right graph compares “Performance” system profile with the default “DAPC” system profile, which is the baseline with COD snoop mode (red line set at 1 in the graph). It can be seen from the graph, NAMD performed up to 3% better with “Performance” profile and COD snoop mode compared to “DAPC” system profile. As seen from these three graphs, the performance with COD snoop mode and “Performance” system profile improves more with the larger datasets specifically for the 22core part.
Generation over Generation comparison
Figure 7: Generation over generation comparison of Intel processors for CONUS12km WRF benchmark
Figure 7 plots the performance characteristics of the CONUS 12km WRF benchmark over multiple generations of Intel processors. Bars in the graph show the average time step result of the CONUS12km benchmark in seconds and purple dots show the performance relative to WSM processor. It can be easily seen from the graph, the performance of the 14 core Broadwell processor is 20% better than the 14 core HSW processor. The performance of all the Broadwell processors is better than the Haswell E5-2697 v3. The performance improves up to 33% for top bin processor relative to Haswell E5-2697 v3. The performance for the 20 and 22 core Broadwell processors is the same and that is likely because of the lower memory bandwidth per core.
Figure 8: Comparing two generations of Intel processors with WRF Conus 2.5
Figure 8 shows the performance comparison among two generations of Intel processors: Haswell and Broadwell. In this graph, the bar shows the average time step value and the purple dots show the performance improvement relative to the 12 core Haswell processor. The 12 core Broadwell processor has 13% higher memory frequency than the 12 core Haswell processor (2400 MT/s vs. 2133 MT/s in Haswell), but it also has 17% lower AVX base frequency. Due to these performance parameters, the 12 core Broadwell processor performs 6% lower than the 12 core Haswell processor. As per the graph, the top bin 22 core Broadwell processor performs 14% better than the 14 core Haswell processor. Similar to what we saw earlier, there is not a significant performance improvement from the 20 core to 22 core Broadwell processors due to lower memory bandwidth per core.
Figure 9: Performance comparison of multiple generations of Intel processors with ApoA1 benchmark
Figure 9 shows the comparison of multiple generations of Intel processors IVB, HSW and BDW with small sized (92,224 Atoms) NAMD benchmark named ApoA1. The bars show the NAMD performance of the processors in “days/ns”. As seen from the graph, HSW performs 40% better than IVB. While, BDW’s performance improvement varies from 23 to 52% except for the 12 core BDW. The 12 core BDW processor performs 18% slower than the 14 core HSW processor due to 22% lower base frequency. The dots in the graph show the performance improvement over IVB processor. The graph shows that the performance increases with increasing number of cores for the BDW processors. The top bin 22 core BDW performs 112% better than 12 core IVB.
Figure10: Performance comparison of multiple generations of Intel processor with F1ATPase benchmark
Figure 10 compares performance of multiple generations of Intel processors with the medium sized (327,506 Atoms) NAMD benchmark, named F1ATPase. It can be seen from the graph that HSW performance improvement is 33% better than IVB and BDW performance is up to 62% better than HSW.
Figure 11: Performance comparison of multiple generations of Intel processors with STMV benchmark
Figure 11 plots the performance comparison graph among multiple generations of Intel processors for the large size (1,066,628 Atoms) NAMD benchmark. As per this graph, the BDW processors are performing up to 63% better than HSW processor.
It can be seen from figures 9, 10 and 11 that the larger datasets make better use of the computation power and the relative performance with additional cores is better as compared with the smaller dataset.
This blog characterizes five Intel Broadwell processors and shows performance improvement for real time HPC applications. Additional cores, along with the higher memory frequency support in Intel Broadwell processors, improve the performance of HPC workloads specifically for compute sensitive workloads like NAMD. The performance of memory bandwidth sensitive workloads like WRF increase up to the 16 core processors, but the performance improvement for the 20 and 22 core processors is not as significant due to the lower memory bandwidth per core.
Authors: Ashish Kumar Singh, Mayura Deshmukh and Neha Kashyap
The increasing demand for more compute power pushes servers to be upgraded with higher and more powerful hardware. With the release of the new Intel® Xeon® processor E5-2600 v4 family of processors (architecture codenamed “Broadwell”), Dell has refreshed the 13th generation servers to benefit from the increased number of cores and higher memory speeds thus benefiting a wide variety of HPC applications.
This blog is part one of “Broadwell performance for HPC” blog series and discusses the performance characterization of Intel Broadwell processors with High Performance LINPACK (HPL) and STREAM benchmarks. The next three blogs in the series will discuss the BIOS tuning options and the impact of Broadwell processors on Weather Research Forecast (WRF), NAMD, ANSYS® Fluent®, CD-adapco® STAR-CCM+®, OpenFOAM, LSTC LS-DYNA® HPC applications as compared to the previous generation processor models.
In this study, performance was measured across five different Broadwell processor models listed in Table2 along with 2400 MT/s DDR4 memory. This study focuses on HPL and STREAM performance for different BIOS profiles across all five Broadwell processor models and compares the results to previous generations of Intel Xeon processors. The platform we used is a PowerEdge R730, which is a 2U dual socket rack server with two processors. Each socket has four memory channels and can support up to 3 DIMMs per channel (DPC). For our study, we used 2 DPC for a total of 16 DDR4 DIMMs in the server.
Broadwell (BDW) is a tick in Intel’s tick-tock principle as the next step in semiconductor fabrication. It is a 14nm processor with the same microarchitecture as the Haswell-based (HSW, Xeon E5-2600 v3 series) processors with the same TDP range. Broadwell E5-2600 v4 series processors support up to 22 cores per socket with up to 55MB of LLC, which is 22% more cores and LLC than Haswell. Broadwell supports DDR4 memory with max memory speed of up to 2400 MT/s, 12.5% higher than the 2133 MT/s that is supported with Haswell.
Broadwell introduces a new snoop mode option in the BIOS memory setting, Directory with Opportunistic Snoop Broadcast (DIR+OSB), which is the default snoop mode for Broadwell. In this mode, the memory snoop is spawned by the Home Agent and a directory is maintained in the DRAM ECC bits. DIR+OSB mode allows for low local memory latency, high local memory bandwidth and I/O directory cache to reduce directory update overheads for I/O accesses. The other three snoop modes: Home Snoop (HS), Early Snoop (ES), and Cluster-on-Die (COD) are similar to what was available with Haswell. The Cluster-on-die (COD) is only supported on processors that have two memory controllers per processor. The Dell BIOS on systems that support both Haswell and Broadwell will display the supported snoop modes based on the processor model populated in the system.
Table 1 describes the other new features available in the Dell BIOS on systems that support Broadwell processors.
Table1: New BIOS features with Intel Xeon E5 v4 processor family (Broadwell)
Snoop Mode > Directory with Opportunistic Snoop Broadcast (DIR+OSB)
Directory with Opportunistic Snoop Broadcast, available on select processor models, works well for workloads of mixed NUMA optimization. It offers a good balance of latency and bandwidth.
System Profile Settings > Write Data CRC
When set to enabled, the DDR4 data bus issues are detected and corrected during ‘write’ operations. Two extra cycles are required for CRC bit generation which impacts the performance. Read-only unless System Profile is set to Custom.
System Profile Settings > CPU Power Management > Hardware P States
If supported by the CPU, Hardware P States is another performance-per-watt option that relies on the CPU to dynamically control individual core frequency. Read-only unless System Profile is set to Custom.
System Profile Settings > C States > Autonomous
Autonomous is a new BIOS option for C States in addition to the previous options, Enable and Disable. Autonomous (if Hardware controlled is supported), processor can operate in all available Power States to save power, but may increase memory latency and frequency jitter.
Intel Broadwell supports Intel® Advanced Vector Extensions 2 (Intel AVX2) vector technology, which allows a processor core to execute 16 FLOPs per cycle. HPL is a benchmark that solves a dense linear system. The HPL problem size (N) was chosen to be 92% of the system memory along with a block size (NB) of 192. The theoretical peak value of HPL was calculated using the AVX base frequency, which is lower than rated base frequency of the processor model. Broadwell processors consume more power when running Intel® AVX2 workloads than non-AVX workloads. Starting with the Haswell product family Intel provides two frequencies for each SKU. Table 2 lists the rated base and AVX base frequencies of each Broadwell processor used for this study. Since HPL is an AVX-enabled workload, we would calculate HPL theoretical maximum performance with AVX base frequency as (AVX base frequency of processor * number of cores * 16 FLOP/cycle)
Table 2: Base frequencies of Intel Broadwell Processors
Base Frequencies of Intel Broadwell processors
Rated base frequency (GHz)
AVX base frequency (GHz)
Theoretical Maximum Performance (GFLOPS)
E5-2699 v4, 22 core, 145W
E5-2698 v4, 20 core, 135W
E5-2697A v4, 16 core, 145W
E5-2690 v4, 14 core, 135W
E5-2650 v4, 12 core, 105W
Table 3 gives more information about the hardware configuration and the benchmarks used for this study.
Table 3: Server and Benchmark details for Intel Xeon E5 v4 processors
As described in table 2
16 x 16GB DDR4 @ 2400 MT/s (Total=256GB)
RHEL 7.2 (3.10.0-327.el7.x86_64)
Snoop modes – OSB, ES, HS and COD
From Intel Parallel Studio 2016 update1
Intel MPI - 5.1.2
Intel Broadwell Processors
Figure1: HPL performance characterization
Figure 1 shows HPL characterization of all five Intel Broadwell processors used for this study, with the PowerEdge R730 platform. Table 2 shows the TDP values for each of the Broadwell processors. The text value in each bar shows the efficiency of that processor. The “X” value on top of each bar shows the performance gain over 12 core Broadwell processor. The HPL performance improvement with top bin Broadwell processor is not correspondingly increasing as number of cores. For example, adding 83% more cores in top bin 22 core than 12 core Broadwell processor, allows HPL a 57% performance improvement. The line pattern on the graph shows the HPL performance per core. Since the HPL performance is not accelerating as per number of cores, the performance per core has decreased by 8 to 15 % for 20 and 22 core processors respectively.
Figure2: STREAM (Triad) Performance characterization
The STREAM benchmark calculates the memory bandwidth by counting only the bytes that the user program requested to be loaded or stored. This study uses the results reported by the TRIAD function of the stream bandwidth test.
Figure 2 plots the STREAM (TRIAD) performance for all Broadwell processors used for this study. The bars show the memory bandwidth in GB/s for each of the processors. As per the graph, memory bandwidth across all Broadwell processors is approximately same. Since, the memory bandwidth across all Broadwell processors are same, the memory bandwidth per core is decreasing due to more number of cores.
BIOS Profiles comparison
Figure 3: Comparing BIOS profiles with HPL
Figure 3 plots HPL performance with two BIOS system profile options for all four snoop modes across all five Broadwell processors. As Directory + Opportunistic Snoop Broadcast (DIR+OSB) snoop mode performs well for all workloads and DAPC system profile balances performance and energy efficiency, these options are set as default in the BIOS and so has been chosen as the baseline.
From this graph, it can be seen that Cluster-on-Die (COD) memory mode with the “Performance” System Profile setting performs 2 to 4 % better than other BIOS profile combinations across all Broadwell processors. The Cluster-on-die (COD) is only supported on processors that have two memory controllers per processor, i.e. 12 or more cores.
Figure 4: Comparing BIOS profiles with STREAM (TRIAD)
Figure 4 shows the STREAM performance characteristics with two BIOS system profile options for all the snoop modes. Opportunistic snoop Broadcast (OSB) snoop mode along with DAPC system profile is chosen as the baseline for this study. Memory Bandwidth with each BIOS profile combination except Early snoop (ES) mode with both system profiles are almost same. The memory bandwidth with Early snoop (ES) mode for both system profiles is lower by 8 to 20 % and the difference is more apparent for 22 core processor up to 25%. The Early Snoop (ES) mode have less Requester Transaction IDs (RTIDs) distributed across all the cores, while other snoop modes gets higher RTIDs, that is higher number of credits for local and remote traffic at the home agent.
Figure 5: Comparing HPL Performance across multiple generations of Intel processors
Figure 5 plots generation over generation performance comparison for HPL with Intel Westmere (WSM), Sandy Bridge (SB), Ivy-Bridge (IVB), Haswell (HSW) and Broadwell (BDW) Processors. The percentages on the bars shows the HPL performance improvement than their previous generation processor. The graph shows that the 14 core Broadwell processor with similar frequencies performs 16% better than 14 core Haswell processor for the HPL benchmark. Broadwell processors measure better power efficiencies than the Haswell processors. The top bin 22 core Broadwell processor performance is 49% better than 14 core Haswell processor. The purple diamonds in the graph show the performance per core. The “X” value on top of every bar shows acceleration over 6 core WSM processor.
Figure 6: Generation over generation comparison with STREAM
Figure 6 plots performance comparison of STREAM (TRIAD) for multiple generations of Intel processors. From the graph, it can be seen that the memory bandwidth on the system has increased over generations. The theoretical maximum memory frequency increased by 12.5% in Broadwell over Haswell (2133 MT/s to 2400 MT/s) and this translates into 10 to 12% better measured memory bandwidth as well. However the maximum core-count per socket has increased by up to 22% in Broadwell over Haswell, and so the memory bandwidth per core depends on the specific Broadwell SKU. The 20 core and 22 core BDW processors support only ~3 GB/s per core and that is likely to be very low for most HPC applications, the 16core BDW is on par with the 14core HSW at ~4 GB/s per core.
The performance of all Broadwell processor used for this study is higher for both HPL and STREAM benchmarks. There is ~12% increase in measured memory bandwidth for Broadwell processors compared to Haswell processors. Broadwell processors measure better power efficiencies than the Haswell processors. In conclusion, Broadwell processors may fulfill the demands of more compute power for HPC applications.
The Ninth Annual National Meeting for the South African Center for High Performance Computing was in held in early December 2015 in Pretoria, SA. South Africa has become the focus of regional and international interest in the tech and science communities due to the Square Kilometer Array (SKA) being built in the Karoo region. When completed, it will be the world’s biggest radio telescope with an expected 50-year lifespan. The investment in the SKA will benefit the area and 15 member states as a whole as a result of improvements to the power grid, high-speed networks, and workforce development. Phase One of the construction project is scheduled to begin in 2018, with early science and data generation following by 2020.
Many well-known experts were on hand at the symposium. CHPC’s Director Happy Sithole talked about the growth of the Cape Town center since launching in 2007. It was the only center of its kind on the African continent at the time, and supported 15 researchers with 2.5 teraflops. Now it supports 700 with 64 teraflops of power with expansion driven by demand. Sithole also announced the addition of a new Dell system to be added in two phases, which will increase capacity to 1,000 teraflops, operational in early 2016.
Merle Giles (National Center for Supercomputing Applications) gave the opening address, titled “HPC-Enabled Innovation and Transformational Science & Engineering: The Role of CI.” Of note, he spoke about the funding gap between the foundational research usually led by universities (or start-ups) and the commercialization phase where industry picks up. Furthermore, data supports the ROI of HPC investments. Giles also highlighted the importance of President Obama’s state of the union address, which translated HPC’s role in enabling medical advances into benefits for the average citizen. This past November he spoke to Dell at SC15 about the impact of HPC on third world countries. You can watch his observations here.
Additionally, a talk by Simon Hodson of CODATA highlighted the critical importance of allowing open access to the data behind research findings. Rudolph Pienaar of Boston Children’s Hospital discussed data challenges within the healthcare field. Specifically, hospital systems are antiquated and siloed, designed to facilitate billing and protect privacy, which obstructs research and collaboration. Children’s has designed an innovative system that overcomes these challenges, known as the Boston Children’s Hospital Research Integration System (ChRIS), a web-based research integration system that can manage any datatype, it is uniquely suited to medical image data, providing the ability to seamlessly collect data from typical sources found in hospitals.
An important aspect of the forum was a discussion regarding best practices on how to manage data sharing across national borders; which has been a point of concern for the Southern African Development Community (SADC). The organization held a meeting, which included first time delegates from Mauritius, Namibia and Seychelles, to review the collaborative framework document that last year’s forum attendees had begun to draft. The SADC delegates were counseled by the international advisers to focus on collaboration. They were also warned that finding the best way to for reliable transfer of data among SADC sites, securely and seamlessly, was integral to the success of the project. It was agreed that cybersecurity should be a first priority.
The meeting concluded with a plan for another conference. This one will be held in Botswana in April 2016. For more information, you can read the recent article on HPCWire.
By David Griggs
Ohio’s academic, science and technology communities will be getting quite a lift this year as the Ohio Supercomputer Center (OSC), an OH-TECH Consortium member, adds a powerful new supercomputer from Dell. The enhancement is part of a $9.7 million investment recently approved by Ohio’s State Controlling Board and stems from a $12 million appropriation included in the 2014-15 Ohio biennial capital budget.
OSC is a regional center, founded in 1987, that provides supercomputing services and expertise to local industries and university researchers. Currently, it offers computational services via three supercomputer clusters: the IBM/AMD Glenn Cluster, the HP/Intel Ruby Cluster, and the HP/Intel Oakley Cluster. The Dell supercomputer will be replacing the Glenn Cluster and part of the Oakley Cluster, adding a much-needed increase in computing power and storage, as the center is running near peak capacity. The center’s interim director, Dr. David Hudak, Ph.D., expects that this new addition will greatly help industrial and academic clients alike, fostering new research and innovation.
For more information, please see the recent articles on insideHPC and HPCWire.
The 2016 International Supercomputing Conference is being held in Frankfurt Germany this coming June and with it the fifth annual Student Cluster Competition . Once again Team South Africa, Co-sponsored by the Centre for High Performance Computing (CHPC) and Dell, will be competing for a winning title.
The team, led by CHPC’s David MacLeod, who is responsible for introducing the cluster competition to South Africa students, putting together the first official team in 2011 and leading all subsequent teams. David has an impressive record, with two first place teams and one second place team. His eyes are on clinching the 2016 title at ISC this year, but in addition he aims to raise awareness of HPC as a transformative technology in South Africa, and attract more students to the field.
This year’s team consists of six bright young students from the University of the Witwatersrand and two reserves from Stellenbosch University who will face off against 11 other teams from around the world. These student squads will compete over a three-day period to build a small cluster computer of their own design and run a series of HPC benchmarks and applications. In preparation for the competition, Team South Africa spent a week at Dell’s Round Rock campus to meet with HPC experts, check out our next-generation HPC and thermal labs, become familiar with the cluster systems and receive hands-on tutorials and feedback sessions. A special treat for the South African students was a sit down with Jim Ganthier, Dell’s Head of HPC, and Ed Turkel, Dell’s HPC Strategist, to learn more about pursuing a career in HPC.
Both Jim and Ed discussed the recent progress in the HPC industry, and just how far it has come from the days they worked on monolithic systems, before the advent of x86 servers and clusters. They also talked about how HPC is going beyond the world of academia and scientific research, thanks to the explosive growth in big data, and how companies like Dell are leading the charge to bring HPC to mainstream audiences, with the hopes that the students of today will help make that vision a reality. The students had many questions from the medical applications of HPC to how the democratization of HPC will affect the way business leaders look at technology, to what a career in electrical engineering would look like in relation to HPC,
For more on the South African students visit to Dell, please visit Perrin Cox’s post.
By Olumide Olusanya and Munira Hussain
This is the second part of this blog series. In the first post, we shared OSU Micro-Benchmarks (latency and bandwidth) and HPL performance between FDR and EDR Infiniband. In this part, we will further compare performance using additional real-world applications such as ANSYS Fluent, WRF, and NAS Parallel Benchmarks. For my cluster configuration, please refer to part 1.
Fluent is a Computational Fluid Dynamics (CFD) application used for engineering design and analysis. It can be used to simulate the flow of fluids, with heat transfer, turbulence and other phenomena, involved in various transportation, industrial and manufacturing processes.
For this test we ran Eddy_417k which is one of the problem sets from ANSYS Fluent Benchmark suits. It is a reaction flow case based on the eddy dissipation model. In addition, it has around 417,000 hexahedral cells and is a small dataset with a high communication overhead.
Figure 1 - ANSYS Fluent 16.0 (Eddy_417k)
From Figure 1 above, EDR shows a wide performance advantage over FDR as the number of cores increase to 80. We continue to see an even wider difference as the cluster scales. While FDR’s performance seems to gradually taper off after 80 cores, EDR’s performance continues to scale as the number of cores increase and performs 85% better than FDR on 320 cores (16 nodes).
WRF (Weather Research and Forecasting)
WRF is a modelling system for weather prediction. It is widely used in atmospheric and operational forecasting research. It contains two dynamic cores, a data assimilation system, and a software architecture that allows for parallel computation and system extensibility. For this test, we are going to study the performance of a medium size case, Conus 12km.
Conus 12km is a resolution case over the Continental US domain. The benchmark is run for 3 hours after which we take the average of the time per time step.
Figure 2 - WRF (Conus12km)
Figure 2 shows both EDR and FDR scaling almost linearly and also performing almost equally until the cluster scales to 320 cores when EDR performs better than FDR by 2.8%. This performance difference, which may seem little, is significantly higher than my highest run to run variation of 0.005% between three successive EDR and FDR 320-core tests.
HPC Advisory Council’s result here shows a similar trend with the same benchmark. From their result, we can see that the performances are neck and neck until the 8 and 16-node run where we see a small performance gap. Then the gap widens even more in the 32-node run and EDR posts a 28% better performance than FDR. Both results show that we could see an even higher performance advantage with EDR as we scale beyond 320 cores.
NAS Parallel Benchmarks
NPB contains a suite of benchmarks developed by NASA Advanced Supercomputing Division. The benchmarks are developed to test the performance of highly parallel supercomputers which all mimic large-scale and commonly used computational fluid dynamics applications in their computation and data movement. For my test, we ran four of these benchmarks: CG, MG, FT, and IS. In the figures below, the performance difference is in an oval right above the corresponding run.
Figure 3 - CG
Figure 4 - MG
Figure 5 - FT
Figure 6 - IS
CG is a benchmark which computes an approximation of the smallest eigenvalue of a large, sparse, symmetric positive-definite matrix using a conjugate gradient method. It also tests irregular long distance communication between cores. From Figure 3 above, EDR shows a 7.5% performance advantage with 256 cores.
MG solves a 3-D Poisson Partial Differential Equation. The problem in this benchmark is simplified as it has constant instead of variable coefficients to better mimic real applications. In addition to this, it tests short and long distance communication between cores. Unlike CG, the communication patterns are highly structured. From Figure 4, EDR performs better than FDR by 1.5% on our 256-core cluster.
FT is a 3-D partial differential equation solution using FFTs. It tests the long-distance communication performance as well and shows a 7.5% performance gain using EDR on 256 cores as seen in Figure 5 above.
IS, a large integer sort application, shows a high 16% performance difference between EDR and FDR on 256 cores. This application not only tests the integer computation speed, but also the communication performance between cores. From Figure 6, we can see a 12% EDR advantage with 128 cores which increases to 16% on 256 cores.
In both blogs, we have shown several micro-benchmark and real-world application results to compare FDR with EDR Infiniband. From these results, EDR has shown a higher performance and better scaling than FDR on our 16-node Dell PowerEdge C6320 cluster. Also, some applications have shown a wider performance margin between these interconnects than other applications. This is because of the nature of the applications being tested; communication intensive applications will definitely perform and scale better with a faster network when compared with compute-intensive applications. Furthermore, because of our cluster size, we were only able to test the scalability of the applications on 16 servers (320 cores). In the future, we plan on running these tests again on a larger cluster to further test the performance difference between EDR and FDR.
Congratulations to Jim Ganthier, Dell’s vice president and general manager of Cloud, HPC and Engineered Solutions, who was recently selected by HPCWire as a “2016 Person to Watch.” In an interview as part of this recognition, Jim offered his insights, perspective and vision on the role of HPC, seeing it as a critical segment of focus driving Dell’s business. He also discussed initiatives Dell is employing to inspire greater adoption through innovation, as HPC becomes more mainstream.
There has been a shift in the industry, with newfound appreciation of advanced-scale computing as a strategic business advantage. As it expands, organizations and enterprises of all sizes are becoming more aware of HPC’s value to increase economic competitiveness and drive market growth. However, Jim believes greater availability of HPC is still needed for the full benefits to be realized across all industries and verticals.
As such, one of Dell’s goals for 2016 is to help more people in more industries to use HPC by offering more innovative products and discoveries than any other vendor. This includes developing domain-specific HPC solutions, extending HPC-optimized and enabled platforms, and enabling a broader base of HPC customers to deploy, manage and support HPC solutions. Further, Dell is investing in vertical expertise by bringing on HPC experts in specific areas including life sciences, manufacturing and oil and gas.
Dell is also offering its own brand muscle to draw more attention to HPC at the C-suite level, and will thus accelerate mainstream adoption - this includes leveraging the company’s leading IT portfolio, services and expertise. Most importantly, the company is championing the democratization of HPC, meaning minimizing complexities and mitigating risk associated with traditional HPC while making data more accessible to an organization’s users.
Here are a few of the trends Jim sees powering adoption for the year ahead:
A great example of HPC outside the world of government and academic research is aircraft and automotive design. HPC has long been used for structural mechanics and aerodynamics of vehicles, but now that the electronics content of aircraft and automobiles is increasing dramatically, HPC techniques are now being used to prevent electromagnetic interference from impacting performance of those electronics. At the same time, HPC has enabled vehicles to be lighter, safer and fuel efficient than ever before. Other examples of HPC applications include everything from oil exploration to personalized medicine, from weather forecasting to the creation of animated movies, and from predicting the stock market to assuring homeland security. HPC is also being used by the likes of FINRA to help detect and deter fraud, as well as helping stimulate emerging markets by enabling growth of analytics applied to big data.
Again, our sincerest congratulations to Jim Ganthier! To read the full Q&A, visit http://bit.ly/1PYFSv2.
The goal of this blog is to evaluate the performance of Mellanox Technologies’ FDR (Fourteen Data Rate) Infiniband and their latest EDR (Enhanced Data Rate) Infiniband with speeds of 56Gb/s and 100Gb/s respectively. This is the first of our two series blog and we will be showing how these interconnects perform on a cluster using industry-wide micro-level benchmarks and applications on HPC cluster configuration. In this part, we will show latency, bandwidth and HPL results for FDR vs EDR and in part 2 we will share more results with other applications which include ANSYS Fluent, WRF, and NAS Parallel Benchmarks. You should also keep in mind that while some applications would benefit from the higher bandwidth in EDR, other applications which have low communication overhead would show little performance improvement in comparison.
Mellanox EDR adapters are based on a new generation ASIC also known as ConnectX-4 while the FDR adapters are based on ConnectX-3. The theoretical uni-directional bandwidth for EDR is 100 Gb/s versus FDR which is 56Gb/s. Another difference is that EDR adapters are x16 adapters while FDR adapters are available in x8 and x16. Both of these adapters operate at a bus width of 4X link. The messaging rate for EDR can reach up to 150 million messages per second compared with FDR ConnectX-3 adapters which deliver more than 90 million messages per second.
Table 1 below shows the difference between EDR and FDR and Table 2 describes the configuration of the cluster used in the test while Table 3 lists the applications and benchmarks used for this test.
Table 1 - Difference between EDR and FDR
x8 and x16 Gen3
Table 2 - Cluster configuration
16 nodes x PowerEdge C6320 [ 4 chassis ]
Intel®Xeon®Intel Xeon E5-2660 v3 @2.6/2.2 GHz , 10 cores, 105W
128 GB – 8 x16 GB @ 2133MHz
Red Hat Enterprise Linux Server release 6.6.z (Santiago)
Intel® MPI 5.0.3.048
Table 3 - Applications and Benchmarks
Efficiency of MPI implementation
From Mellanox OFED 3.1
Random dense linear
From Intel MKL
Problem size 90% of total memory
Weather Research and
CG, MG, IS, FT
To find the latency and bandwidth, we used the tests from the OSU Micro-Benchmark suite. These tests use the MPI message passing performance to check the quality of a network fabric. Using the same system configuration for EDR and FDR fabrics, we got latency results as shown in Figure 1 below.
Figure 1 - OSU Latency (using MPI from Mellanox HPC-X Toolkit)
Figure 1 shows a simple OSU node-to-node latency result for EDR vs FDR. Latency numbers are typically taken from the lowest data points (usually the point with the lowest message size). Hence, the lower the data points, the better. In the above OSU latency graph, EDR shows a latency of 0.80us while FDR shows 0.81us. As the message size increases past 512 Bytes, EDR provides an even lower latency of 2.75us compared with FDR’s 2.84us for a 4KB message size. When we did a further latency study using RDMA, EDR measured 0.61us and FDR measured 0.65us.
Figure 2 below plots the OSU unidirectional and bidirectional bandwidth achieved by both EDR and FDR at different message sizes from 1- 4MB.
OSU unidirectional bandwidth is a ping-pong type of communication test where the sender sends a fixed size of messages back-to-back to a receiver and then the receiver responds only after receiving all the messages. This test measures the maximum data rate of the network one–way or the unidirectional bandwidth. The result is taken from the achieved bandwidth of the maximum message size which is 4MB. In the above test, EDR achieves a maximum unidirectional data rate of 12.4GB/s (99.2Gb/s) and FDR achieves 6.3GB/s (50.4Gb/s). This is a 97% performance improvement in EDR over FDR.
OSU bidirectional bandwidth is very similar to the unidirectional test, but in this case, both nodes send messages to each other and await a reply. From the above graph, EDR achieves a bidirectional data rate of 24.2GB/s (193.6Gb/s) compared with FDR’s 10.8GB/s (86.4Gb/s) which gives us a 124% improvement with EDR over FDR.
Figure 3 below shows the HPL performance between EDR and FDR using COD (Cluster on Die) snoop mode. Previous studies have shown that COD gives the best performance over Home and Early snoop.
Figure 3 - HPL Performance
HPL benchmark is a compute-intensive application. It could spend more than 80% of its runtime on computation depending on how you tune it. During the bulk of its communication time, it sends messages of small sizes across the cluster which may not benefit from a higher speed network. Hence, you should not expect a huge performance difference between EDR and FDR. Even though EDR seems to perform slightly better than FDR by 0.33% in the 80-core run, this difference is within our run-run variation for successive tests with either EDR or FDR. As a result, this performance gain cannot be attributed to an EDR advantage. This also makes it is difficult to test accurately the effect of one interconnect over the other with HPL.
From our tests so far, EDR has shown a clear bandwidth advantage when compared with FDR – 97% in unidirectional and 124% in bidirectional bandwidth. In the second part of this blog, we will share more results from other applications (ANSYS Fluent, WRF, and NAS Parallel Benchmarks) to compare performance between EDR and FDR.