Author: Somanath Moharana and Ashish Kumar Singh, Dell EMC HPC Innovation Lab, September 2017
This blog presents analysis of the High Performance Conjugate Gradient (HPCG) benchmark on the Intel(R) Xeon(R) Gold 6150 CPU codename “Skylake”. It also compares the performance of Intel(R) Xeon(R) Gold 6150 processors with its previous generation Intel(R) Xeon(R) CPU E5-2697 v4 Codename “Broadwell-EP” processors.
Introduction to HPCG
The High Performance Conjugate Gradients (HPCG) Benchmark is a metric for ranking HPC systems. HPCG can be considered as a complement to the High Performance LINPACK (HPL) benchmark. HPCG is designed to exercise computational and data access patterns that more closely match a different and broad set of applications that have impact on the collective performance of these applications.
The HPCG benchmark is based on a 3D regular 27-point discretization of an elliptic partial differential equation. The 3D domain is scaled to fill a 3D virtual process grid for all of the available MPI ranks. The preconditioned conjugate gradient (CG) algorithm is used to solve the intermediate systems of equations and incorporates a local and symmetric Gauss-Seidel pre-conditioning step that requires a triangular forward solve and a backward solve. The benchmark exhibits irregular accesses to memory and fine-grain recursive computations.
HPCG has four computational blocks: Sparse Matrix-vector multiplication (SPMV), Symmetric Gauss-Seidel (SymGS), vector update phase (WAXPBY) and Dot Product (DDOT), while two communication blocks MPI_Allreduce and Halos Exchange.
Introduction to Intel Skylake processor
Intel Skylake is a microarchitecture redesign using the same 14 nm manufacturing process technology with support for up to 28 cores per socket, serving as a "tock" in Intel's "tick-tock" manufacturing and design model. It supports 6 DDR4 memory channels per socket with 2 DPC (DIMMs per channel), where supported full memory bandwidth is up to 2666 MT/s.
Please visit BIOS characteristics of Skylake processor-blog for a better understanding of Skylake processors and their bios features on Dell EMC platforms.
Table 1: Details of Servers used for HPCG analysis
2 x Intel(R) Xeon(R) Gold 6150 @2.7GHz, 18c
2 x Intel(R) Xeon(R) CPU E5-2697 v4 @2.3GHz, 18c
192GB (12 x 16GB) DDR4
128GB( 8 x 16GB ) DDR4
Intel Omni Path
Intel Omni path
Red Hat Enterprise Linux Server release 7.3
Red Hat Enterprise Linux Server release 7.2
Intel® MKL 2017.0.3
Intel® MKL 2017.0.0
Processor Settings > Logical Processors
Processor Settings > Sub NUMA cluster
HPCG Performance analysis with Intel Skylake
In HPCG we have to set the problem size to get the best results out of it. For a valid run, the problem size should be large enough so that the arrays accessed in the CG iteration loop does not fit in the cache of the device. The problem size should be large enough to occupy the significant fraction of “main memory”, at least 1/4th of the total.
Adjusting local domain dimensions can affect global problem size. For HPCG performance characterization, we have chosen the local domain dimension of 160^3,192^3 and 224^3 with the execution time of t=30 seconds. The local domain dimension defines the global domain dimension by (NR*Nx) x (NR*Ny) x (NR*Nz), where Nx=Ny=Nz=160 or 192 or 224 and NR is the number of MPI processes used for the benchmark.
Figure 1: HPCG Performance on multiple grid sizes with Intel Xeon Gold 6150 processors
As shown in figure 1, we can observe that the local dimension grid size of 192^3 gives the best performance compared to other local dimension grid sizes i.e. 160^3 and 224^3. Here we are getting a performance of 36.14 GFLOP/s for a single node and we can observe a linear increase in performance with the increase in number of nodes. All these tests have been carried out with 4 MPI processes and 9 OpenMP threads per MPI process.
Figure 2: Time consumed by HPCG computational routines Intel Xeon Gold 6150 processors
Time spent by each routine is mentioned in the HPCG output file as shown in the figure 2. As per the above graph, HPCG spends its most of the time in the compute intensive pre-conditioning of SymGS function and matrix vector multiplication of sparse matrix (SPMV). The vector update phase (WAXPBY) consumes very less time in comparison to SymGS and least time by residual calculation (DDOT) out of all four computation routines. As the local grid size is same across all multi-node runs, the time spent by all four compute kernels for each multi-node run are approximately same.
Figure 3: HPCG performance over multiple generation of Intel processors
Figure 3 compares HPCG performance between Intel Broadwell-EP processors and Intel Skylake processors. Dots in the figure shows the performance improvement of Intel Skylake over Broadwell-EP processors. For a single node, we can observe ~65% better performance with Skylake processor than Broadwell-EP processors and ~67% better performance for both two nodes and four nodes.
HPCG with Intel(R) Xeon(R) Gold 6150 processor shows ~65% higher performance over Intel(R) Xeon(R) CPU E5-2697 v4 processors. HPCG scales out well with more number of nodes and shows a linear increase in performance with the increase in number of nodes.