This blog describes, in detail, the performance study carried out on the E7-8800 v3 family of processors (architecture codenamed as Haswell-EX). The performance on Intel Xeon E7-8800 v3 has been compared to Intel Xeon E7-4800 v2 to ascertain the generation over generation performance improvement. The applications used for this study are HPL, STREAM, WRF and ANSYS Fluent. The Intel Xeon E7-8890v3 processors have 18 cores/36 threads with 45MB of L3 cache (2.5MB/slice). With AVX workloads the clock speed of Intel E7-8890 v3 reduced from 2.5GHz to 2.1GHz. These processors support QPI speed of 9.6 GT/s.
4 x Intel Xeon E7-4870v2 @ 2.3GHz (15 cores) 30MB L3 cache 130W
4 x Intel Xeon E7- 8890v3 @2.5GHz (18 cores) 45MB L3 cache 165W
512GB = 32 x 16GB DDR3 @ 1333MHz RDIMMS
1024 GB = 64 x 16GB DDR4 @1600MHz RDIMMS
Processor Settings > Logical Processors
Processor Settings > QPI Speed
Maximum Data Rate
Processor Settings > System Profile
Software and Firmware
RHEL 6.6 x86_64
Benchmark and Applications
V2.1 from MKL 11.1
V2.1 from MKL 11.2
v5.10, Array Size 1800000000, Iterations 100
v3.5.1, Input Data Conus12KM, Netcdf-126.96.36.199
V3.6.1, Input Data Conus12K, Netcdf-4.3.2
v15, Input Data: eddy_417k, truck_poly_14m, sedan_4m, aircraft_2m
The objective of this comparison was to show the generation-over-generation performance improvement in the enterprise 4S platforms. The performance differences between two server generations were because of the improvement in system architecture, greater number of cores and higher frequency memory. The software versions were not a significant factor.
High Performance LINPACK is a benchmark that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed memory systems. HPL benchmark was run on both PowerEdge R930 and PowerEdge R920 with block size of NB=192 and problem size of N=90% of total memory size.
As shown in the graph above, LINPACK showed 1.95X performance improvement with four Intel Xeon E7-8890 v3 processors on R930 server in comparison to four Intel Xeon E7-4870 v2 processors on R920 server. This was due to substantial increase in number of cores, memory speed, flop/second of the processor and processor architecture.
STREAM is a simple synthetic program to measure sustained memory bandwidth used COPY, SCALE, SUM and TRAID programs to measure memory bandwidth.
Operations of these programs are shown below:
COPY: a(i) = b(i) SCALE: a(i) = q*b(i) SUM: a(i) = b(i) + c(i) TRIAD: a(i) = b(i) + q*c(i)
This chart showed the comparison of sustained memory bandwidth between PowerEdge R920 and PowerEdge R930 servers. STREAM showed 231GB/s on PowerEdge R920 and 260GB/s on PowerEdge R930, which is 12% improvement in memory bandwidth. This increase is because of the improvement in DIMM speed available on PowerEdge R930.
The WRF (Weather Research and Forecasting) model is a next-generation mesoscale numerical weather prediction system designed to serve both atmospheric research and operational forecasting needs. The model serves a wide range of metrological applications across scales from tens of meters to thousands of kilometers. WRF allows to generate atmospheric simulations based on real data (observations, analysis) or idealized conditions.
WRF performance analysis was run for conus12KM dataset. Conus12KM data is a single domain, medium size 48-hours 12KM resolution case over continental US (CONUS) domain with a time step of 72seconds.
With Conus12KM dataset, WRF showed 0.22seconds average time on PowerEdge R930 server, while 0.26seconds on PowerEdge R930 server, which is an 18% improvement.
ANSYS Fluent contains the broad physical modeling capabilities for model flow, turbulence, heat transfer, and reactions for industrial applications ranging from air flow over an aircraft wing to combustion in a furnace, from bubble columns to oil platforms, from blood flow to semiconductor manufacturing, and from clean room design to wastewater treatment plants.
We used four different datasets for Fluent. We considered ‘Solver rating’ (higher is better) as the performance metric. For all the test cases with PowerEdge R930 Fluent showed 24% to 29% performance improvement in-comparision to PowerEdge R920.
PowerEdge R930 server outperforms its previous generation PowerEdge R920 server in both benchmarks and application comparison. Due to latest processors with higher number of cores, higher frequency memory and CPU architecture improvement PowerEdge R930 gave better performance than PowerEdge R920. PowerEdge R930 platform with four Intel Xeon EX processors is very good choice for those HPC applications, which can scale up to the large number of cores and memory.