Authors: Mayura Deshmukh, Ashish K Singh, Neha Kashyap
With the refresh of Dell’s 13th generation servers with the recently released Broadwell (BDW) processors, some obvious questions come to mind such as how the new processors compare with the older generation processors. This blog, fourth in the series of “Broadwell Performance for HPC,” focuses on answering this question. It compares the performance of various CAE applications for five Broadwell Intel Xeon E5-2600 v4 series processor models with previous generation Intel processors.
Last week’s blog talked about the impact of BIOS options for each of the CAE applications. Here we focus on how much better the performance of the Broadwell processors is as compared to the previous generation Haswell (HSW) and Ivy-bridge (IVB) processors for these CAE applications. Table 1 shows the applications that we are comparing and Table 2 describes the server configuration used for the study. For LS-DYNA, the benchmarks run on the IVB and HSW (sse binary) and for ANSYS Fluent, benchmarks run on Westmere (WSM), Ivy-bridge (IVB), Sandy-bridge(SB) and HSW used different software versions (whatever latest version was available at the time) than what is mentioned in Table 1. STAR-CCM+ and OpenFOAM version for benchmarks run on both HSW and BDW were same.
Table 1 - Applications and benchmarks
Platform MPI 9.1.0
Average Elapsed time
Platform MPI 9.1.3
Platform MPI 126.96.36.199
Open MPI 1.10.0
Table 2 - Server configuration
256GB - 16 x 16GB 2400 MHz DDR4 RDIMMs
6 x 300GB SAS 6Gbps 10K rpm
PERC H330 mini
Red Hat Enterprise Linux 7.2 (3.10.0-327.el7.x86_64)
System profile - Performance
Logical Processor - Disabled
Power Supply Redundant Policy - Not Redundant
Power Supply Hot Spare Policy - Disabled
I/O Non-Posted Prefetch - Disabled
Snoop Mode - Opportunistic Snoop Broadcast (OSB) for OpenFOAM and Cluster on Die (COD) for all the other applications
Node interleaving - Disabled
Figure 1 compares the performance of the five BDW Intel Xeon E5-2600 v4 series processors models with HSW Intel Xeon E5-2600 v3 series processors and IVB E5-2680 v2 for LS-DYNA car2car benchmark (with end time set to 0.02).
Figure 1: IVB vs. HSW vs BDW for LS-DYNA
The performance for all the processors is compared to E5-2680 v2, which is shown as the red baseline set at 1. The green bars show the performance for the HSW processors with LS-DYNA single precision sse binary, the grey bar represents data for HSW E5-2697 v3 with LS-DYNA single precision avx2 binary, the blue bars show the data for BDW processors with LS-DYNA single precision sse binary and the orange bars represent the BDW data with LS-DYNA single precision avx2 binary. For BDW, avx2 binaries perform 12-19% better than the sse binaries across all the processor models. The purple diamonds describe the performance per core compared to the E5-2680 v2. The percentages at the top of the BDW avx2 orange bar describe the percentage improvement of the BDW processors over HSW E5-2697 v3 avx2 (grey bar in the graph). The 12 core BDW E5-2650 v4 which has fewer cores and lower frequency understandably performs 11% lower than the Haswell E5-2697 v3 processor. The 14 core E5-2690 v4 which has same number of cores and similar avx2 frequencies performs 7% better than the E5-2697 v3 this can be accounted for due to the increase in bandwidth for Broadwell and BDW processors also measure better power efficiencies than Haswell processors. The performance for the 16core, 20core and 22core processors is 16 to 30% higher than the HSW E5-2697 v3 (avx2). Comparing the performance, performance per core and the higher memory bandwidth per core, the E5-2690 v4 14c and E5-2697Av4 16c look like attractive options for CAE/CFD codes, particularly when considering per core licensing costs.
CD-adapco’s STAR-CCM+ is another CFD application widely-used by industry for solving problems involving fluid flows, heat transfer, and other phenomena. STAR-CCM+ shows similar performance patterns to LS-DYNA.
Figure 2: HSW vs BDW for STAR-CCM+
Figure 2 compares the performance of the five BDW Intel Xeon E5-2600 v4 series processors models (shown as the five bars in the graph) with HSW E5-2697 v3 shown as the red line set at one. The numbers at the top of the bar show the per core performance relative to the E5-2697 v3. As seen from the bars the 14core, 16core, 20core and the 22core relative performance is higher by 8% to 40% across all the benchmarks. The lower core, lower frequency 12core E5-2650 performs 11-20% lower than the E5-2697 v3. Similar to LS-DYNA, the per core performance of the 14core and the 16core is 2% to 11% better than the HSW E5-2697 v3 making them good options for STAR-CCM+ as well.
ANSYS Fluent is a computational fluid dynamics application. The graph in Figure 3 shows the performance of truck_poly_14m for Sandy-bridge (SB), Ivy-bridge (IVB), HSW and BDW processors compared to the Westmere (WSM) processor shown as the redline set at one.
Figure 3: WSM vs. SB vs. IVY vs. HSW vs. BDW for ANSYS Fluent
The Fluent benchmark exhibit a similar pattern as LS-DYNA and STAR-CCM+ benchmarks. The purple diamonds in Figure 3 describe the performance per core compared to the WSM 2.93GHz processor. The percentages at the top of the BDW blue bar describe the percentage improvement of the BDW processors over HSW E5-2697 v3 (green bar in the graph). The 12 core BDW E5-2650 v4 which has fewer cores and lower frequency performs 14% lower than the Haswell E5-2697 v3 processor. With higher performance per core and the higher memory bandwidth per core, the E5-2690 v4 14c, E5-2697Av4 16c are good options, particularly when considering per core software licensing costs, and perform 11% and 21% better than the E5-2697 v3 processor. The 20 and 22core BDW processors perform 32%-39% better than the HSW E5-2697 v3.
OpenFOAM (Open source Field Operation And Manipulation) is a free, open source software for computational fluid dynamics (CFD).
Figure 4: HSW vs. BDW for OpenFOAM Motorbike 11M benchmark
As shown in Figure 4 for the OpenFOAM Motorbike 11M benchmark, all the Broadwell processors perform 12% to 21% better than the Haswell E5-2697 v3 processor, shown as the red line set at one. Per core performance for the 16 core, 14 core and 12 core is 4% to 30% better than the E5-2697 v3.The performance for the 20 core and the 22 core BDW processors are the same for the Motorbike 11M benchmark. Increase in number of cores does not provide a significant performance boost for 20 and 22 core parts likely due to lower memory bandwidth per core as explained in the first blog’s STREAM results.
Along with more cores than HSW, BDW measures better power efficiency than HSW. Looking at the absolute performance, performance per core and the higher memory bandwidth per core, the E5-2690 v4 14c, E5-2697Av4 16c are attractive options for CAE/CFD codes particularly if per-core licensing costs are involved. For applications like OpenFOAM (motorbike case) all the BDW processors performed better than Haswell E5-2697 v3, but the increase in number of cores does not provide a significant performance boost for 20 and 22 core parts due to lower memory bandwidth per core.