Dell recently updated the 12th generation PowerEdge server line with the Intel Xeon E5-2600 v2 series processors. In this blog we compare the performance of the Intel Xeon E5-2600 v2 processors against the previous E5-2600 series processors across a variety of HPC benchmarks and applications. We also compare the performance of 1600MT/s DIMMs with 1866MT/s DIMMs; 1866MT/s is only supported with Intel Xeon E5-2600 v2 series processors. Intel Xeon E5-2600 v2 series processors are supported on Dell PowerEdge R620, R720, M620, C6220 II, C8220 and C8220x platforms with the latest firmware and BIOS updates.
Intel Xeon E5-2600 series processors use a 32 nanometer based manufacturing process, CPU on planar double-gate transistors. They fall under the tock process of Intel’s tick-tock model of development and included a new microarchitecture (codenamed Sandy Bridge) to replace the Intel Xeon 5500 series processors that were built on the architecture code named Nehalem.
Intel Xeon E5-2600 v2 series processors (codenamed Ivy Bridge) are based on the 22 nm manufacturing process. There is a die shrink, known as the "tick" step of Intel’s tick-tock model and is based on 3D tri-gate transistors.
To maintain consistency across the server configurations having Intel Xeon E5-2695 v2 and Intel Xeon E5-2665 processors, we have used processors of the same frequency and wattage across both processor families.
Dual Intel Xeon E5-2665 2.4GHz (8 cores) 115W
Dual Intel Xeon E5-2695 v2 2.4GHz (12 cores) 115W
Total 16 cores per server
Total 24 cores per server
128GB memory, total per server
Configuration - 1-16GB Dual Rank DDR3 RDIMM per channel (8 * 1600MT/s 16GB DIMMs or 8 * 1866MT/s 16GB DIMMs)
Mellanox InfiniBand - ConnectX3 FDR connected back-to-back
Red Hat Enterprise Linux 6.4 (kernel version 2.6.32-358.el6 x86_64)
Bright Cluster Manager 6.1
Mellanox OFED 2.0.3
System Profile set to Max Performance
(Logical Processor disabled ,Turbo enabled, C states disabled, Node Interleave disabled)
v2.1 From Intel MKL v11.1
v5.10, Array Size 160000000, Iterations 100
NAS Parallel Benchmarks
v3.2, Problem Size=D Class
v2.2, Input Data Conus 12K
The Linpack benchmark measures how fast a computer solves linear equations and measures a system's floating-point computing power; requires a software library for performing numerical linear algebra on digital computers.
The STREAM benchmark is a simple synthetic program that measures sustained memory bandwidth in MegaBytes. It uses COPY, SCALE, SUM and TRIAD programs to evaluate the memory bandwidth. Operations of these programs are shown below:
COPY: a(i) = b(i) SCALE: a(i) = q*b(i) SUM: a(i) = b(i) + c(i) TRIAD: a(i) = b(i) + q*c(i)
The NAS Parallel Benchmarks (NPB) is a set of benchmarks targeting performance evaluation of highly parallel supercomputers.
LU: Solves a synthetic system of nonlinear PDEs using three different algorithms involving block tri-diagonal, scalar penta-diagonal and symmetric successive over-relaxation (SSOR) solver kernels, respectively.
EP: Generate independent Gaussian random deviates using the Marsaglia polar method.
FFT: Solve a three-dimensional partial differential equation using the fast Fourier transform (FFT).
The Weather Research and Forecasting (WRF) Model is a numerical weather prediction system designed to serve atmospheric research and weather forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture allowing for parallel computation and system extensibility.
The tests conducted with the Intel Xeon E5-2665 configuration are labeled SB-16c. Tests conducted with 16 cores on the Intel Xeon E5-2695 v2 are labeled IVB-16c. Finally, tests with all 24 cores on the Intel Xeon E5-2695 v2 are referenced as IVB-24c.
Tests that used 1600MT/s DIMMS are designated with the suffix 1600, while tests that used 1866MT/s DIMMs are designated with the suffix 1866.
Single Node Performance:
For single node runs, we have compared the performance obtained with the server’s default configurations with both SB and IVB processors, using all the cores available in the system. In addition, for WRF and NPB-EP, we have also compared the performance of the server after turning off 4 cores per processor for Intel Xeon E5-2600 v2 configurations.
The following graph shows the single node performance gain with the Intel Xeon E5-2600 v2 series when compared to then E5-2600 series. For HPL, only the out of box sustained performance is compared when utilizing all the cores in the server. Since NPB-LU and NPB-FT require processor cores to be in order of a power of 2 for them to run, their runs are not shown for IVB-24c.
Relative performance is plotted using the SB-16c-1600 configuration as the baseline.
HPL yielded 1.53x sustained performance on IVB-24c as compared to SB-16c. This is primarily due to the increase in the number of cores. NPB and WRF yielded up to ~7 – 10% improvement when executed on 16 cores on Intel Xeon E5-2695 v2 when compared to SB-16c. WRF performs 22% better with IVB-24c when compared to SB-16c; NPB-EP shows ~38% improvement with IVB-24c when compared to SB-16c. NPB-EP shows improved results compared to WRF, because of its parallel nature which requires less communication among MPI processes thus greatly benefitting from the increase in number of cores.
The performance increase of WRF, NPB-EP and NPB-FT on 1866MT/s DIMMs over 1600MT/s DIMMS is 2.35%, 0.26% and 2.73% respectively. NPB-LU shows 10% increase in performance. This behavior is due to the large problem size required for NPB-LU which helps it show a considerable performance increase with the faster memory as compared to NPB-EP, NPB-FT or WRF.
For dual node tests we have compared the performance obtained with server’s default configurations with both SB and IVB processors. In addition, for WRF and NPB-EP we have also compared the performance of the server after turning off 4 cores per processor for Intel Xeon E5-2600 v2 configurations.
The two-node cluster in these tests is connected back-to-back via InfiniBand FDR. All dual node tests were conducted with 1600MT/s memory DIMMs.
The following graph shows single node performance gains with the Intel Xeon E5-2600 V2 series when compared to then E5-2600 series processors plotted as IVB-48c and SB-32c respectively. Since E5-2600 has 8 cores per socket compared to 12 cores in Intel Xeon E5-2600 V2, one set of results was taken with four cores of Intel Xeon E5-2600 V2 shut down through BIOS and it is plotted as IVB-32c.
HPL was executed on a two node cluster with E5-2665 (total 32 cores, 16 cores per server) and E5-2695 v2 (total 48 cores, 24 cores per server). HPL yielded 1.52x sustained performance on IVB-48c as compared to SB-32c whereas WRF, NPB-EP and NPB-LU have shown a performance improvement of ~2.5%. There is ~7 %- 8% increase in the performance with WRF and NPB-EP on 32 cores. It is ~ 22 – 32% difference when compared between 48 cores E5-2695 v2 and 32 cores E5-2665 runs.
The graph compares the memory bandwidth of the E5-2600 v2 processor to its predecessor, the E5-2600. With E5-2600, the maximum supported memory speed is 1600MT/s. With E5-2600 v2, that maximum is 1866MT/s. We’ve compared 1600MT/s DIMMs for SB-16c and IVB-24c, and also plotted the improved memory bandwidth with 1866MT/s on IVB-24c.
The IVB-24c test shows up ~15% increase in memory bandwidth with 1866MT/s DIMMs when compared to IVB-24c with 1600MT/s DIMMs due to the higher frequency of the 1866MT/s DIMMs. And it shows ~27% increase when compared to SB-16c with 1600MT/s. This increase is because of the dual memory controllers on E5-2695 v2 processor that support 2 memory channels each as compared to the single memory controller with 4 Channels on E5-2665 processor.
In this study, we have found that E5-2600 v2 processors have significant performance improvement over E5-2600 processors. Increase in number of cores, Larger L3 cache and dual memory controller are contributing to performance. We could see huge improvement in performance with embarrassingly parallel applications like NPB-EP. We also see an increase in performance with 1866MT/s DIMMs over 1600MT/s DIMMs.