Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Mar. 2017
Introduction to P40 GPU and TensorRT
Deep Learning (DL) has two major phases: training and inference/testing/scoring. The training phase builds a deep neural network (DNN) model with the existing large amount of data. And the inference phase uses the trained model to make prediction from new data. The inference can be done in the data center, embedded system, auto and mobile devices, etc. Usually inference must respond to user request as quickly as possible (often in real time). To meet the low-latency requirement of inference, NVIDIA® launched Tesla® P4 and P40 GPUs. Aside from high floating point throughput and efficiency, both GPUs introduce two new optimized instructions designed specifically for inference computations. The two new instructions are 8-bit integer (INT8) 4-element vector dot product (DP4A) and 16-bit 2-element vector dot product (DP2A) instructions. Deep learning researchers have found using FP16 is able to achieve the same inference accuracy as FP32 and many applications only require INT8 or lower precision to keep an acceptable inference accuracy. Tesla P4 delivers a peak of 21.8 INT8 TIOP/s (Tera Integer Operations per Second), while P40 delivers a peak of 47.0 INT8 TIOP/s. This blog only focuses on P40 GPU.
TensorRTTM, previously called GIE (GPU Inference Engine), is a high performance deep learning inference engine for production deployment of deep learning applications that maximizes inference throughput and efficiency. TensorRT provides users the ability to take advantage of fast reduced precision instructions provided in the Pascal GPUs. TensorRT v2 supports the INT8 reduced precision operations that are available on the P40.
This blog quantifies the performance of deep learning inference using TensorRT on Dell’s PowerEdge C4130 server which is equipped with 4 Tesla P40 GPUs. Since TensorRT is only available for Ubuntu OS, all the experiments were done on Ubuntu. Table 1 shows the hardware and software details. The inference benchmark we used was giexec in TensorRT sample codes. The synthetic images which were filled with random non-zero numbers to simulate real images were used in this sample code. Two classic neural networks were tested: AlexNet (2012 ImageNet winner) and GoogLeNet (2014 ImageNet winner) which is much deeper and complicated than AlexNet.
We measured the inference performance in images/sec which means the number of images that can be processed per second. To measure the performance improvement of the current generation GPU P40, we also compared its performance with the previous generation GPU M40. The most important goal of this testing is to measure the inference performance in INT8 mode, compared to FP32 mode. P40 uses the new Pascal architecture and supports the new INT8 instructions. The previous generation GPU M40 uses Maxwell architecture and does not support INT8 instructions. The theoretical performance of INT8, FP32 in both M40 and P40 is shown in Table 2. We measured the performance FP32 on both devices and both FP32 and INT8 on the P40.
Table 1: Hardware configuration and software details
PowerEdge C4130 (configuration G)
2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)
256GB DDR4 @ 2400MHz
4x Tesla P40 with 24GB GPU memory
Software and Firmware
CUDA and driver version
Table 2: Comparison between Tesla M40 and P40
In this section, we will present the inference performance with TensorRT on GoogLeNet and AlexNet. We also implemented the benchmark with MPI so that it can be run on multiple P40 GPUs within a node. We will also compare the performance of P40 with M40. Lastly we will show the performance impact when using different batch sizes.
Figure 1 shows the inference performance with TensorRT library for both GoogLeNet and AlexNet. We can see that INT8 mode is ~3x faster than FP32 in both neural networks. This is expected since the theoretical speedup of INT8 is 4x compared to FP32 if only multiplications are performed and no other overhead is incurred. However, there are kernel launches, occupancy limits, data movement and math other than multiplications, so the speedup is reduced to about 3x faster.
Figure 1: Inference performance with TensorRT library
Dell’s PowerEdge C4130 supports up to 4 GPUs in a server. To make use of all GPUs, we implemented the inference benchmark using MPI so that each MPI process runs on each GPU. Figure 2 and Figure 3 show the multi-GPU inference performance on GoogLeNet and AlexNet, respectively. When using multiple GPUs, linear speedup were achieved for both neural networks. This is because each GPU processes its own images and there is no communications and synchronizations among used GPUs.
Figure 2: Multi-GPU inference performance with TensorRT GoogLeNet
Figure 3: Multi-GPU inference performance with TensorRT AlexNet
To highlight the performance advantage of P40 GPU and its native support for INT8, we compared the inference performance between P40 with the previous generation GPU M40. The result is shown in Figure 5 and Figure 6 for GoogLeNet and AlexNet, respectively. In FP32 mode, P40 is 1.7x faster than M40. And the INT8 mode in P40 is 4.4x faster than FP32 mode in M40.
Figure 4: Inference performance comparison between P40 and M40
Figure 5: Inference performance comparison between P40 and M40
Deep learning inference can be applied in different scenarios. Some scenarios require large batch size and some scenarios even requires no batching at all (i.e. batch size is 1). Therefore we also measured the performance difference when using different batch sizes and the result is shown in Figure 6. Note that the purpose here is not comparing the performance of GoogLeNet and AlexNet, instead the purpose is to check how the performance changes with different batch sizes for each neural network. It can be seen that without batch processing the inference performance is very low. This is because the GPU is not assigned enough workloads to keep it busy. The larger the batch size is, the higher the inference performance is, although the rate of the speed increasing becomes slower. When batch size is 4096, GoogLeNet stopped running because the required GPU memory for this neural network exceeds the GPU memory limit. But AlexNet was able to run because it is a less complicated neural network than GoogLeNet and therefore it requires less GPU memory. So the largest batch size is only limited by GPU memory.
Figure 6: Inference performance with different batch sizes
Conclusions and Future Work
In this blog, we presented the inference performance in deep learning with NVIDIA® TensorRT library on P40 and M40 GPUs. As a result, the INT8 support in P40 is about 3x faster than FP32 mode in P40 and 4.4x faster than FP32 mode in the previous generation GPU M40. Multiple GPUs can increase the inferencing performance linearly because of no communications and synchronizations. We also noticed that higher batch size leads to higher inference performance and the largest batch size is only limited by GPU memory size. In the future work, we will evaluate the inference performance with real world deep learning applications.
Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Feb 2017
Introduction to P100-PCIe GPU
This blog describes the performance analysis on NVIDIA® Tesla® P100™ GPUs on a cluster of Dell PowerEdge C4130 servers. There are two types of P100 GPUs: PCIe-based and SXM2-based. In PCIe-based server, GPUs are connected by PCIe buses and one P100 delivers around 4.7 and 9.3 TeraFLOPS of double and single precision performance, respectively. And in P100-SXM2, GPUs are connected by NVLink and one P100 delivers around 5.3 and 10.6 TeraFLOPS of double and single precision performance, respectively. This blog focuses on P100 for PCIe-based servers, i.e. P100-PCIe. We have already analyzed the P100 performance for several deep learning frameworks in this blog. The objective of this blog is to compare the performance of HPL, LAMMPS, NAMD, GROMACS, HOOMD-BLUE, Amber, ANSYS Mechanical and RELION. The hardware configuration of the cluster is the same as in the deep learning blog. Briefly speaking, we used a cluster of four C4130 nodes, each node has dual Intel Xeon E5-2690 v4 CPUs and four NVIDIA P100-PCIe GPUs and all nodes are connected with EDR Infiniband. Table 1 shows the detailed information about the hardware and software used in every compute node.
Table 1: Experiment Platform and Software Details
P100-PCIe with 16GB GPU memory
Mellanox ConnectX-4 VPI (EDR 100Gb/s Infiniband)
RHEL 7.2 x86_64
Linux Kernel Version
CUDA version and driver
CUDA 8.0.44 (375.20)
High Performance Linpack (HPL)
HPL is a multicomputer parallel application to measure how fast computers solve a dense n by n system of linear equations using LU decomposition with partial row pivoting and designed to be run at very large scale. The HPL running on the experimented cluster uses the double precision floating point operations. Figure 1 shows the HPL performance on the tested P100-PCIe cluster. It can be seen that 1 P100 is 3.6x faster than 2 x E5-2690 v4 CPUs. HPL also scales very well with more GPUs within nodes or across nodes. Recall that 4 P100 is within a server and therefore 8, 12 and 16 P100 are in 2, 3 and 4 servers. 16 P100 GPUs has the speedup of 14.9x compared to 1 P100. Note that the overall efficiency is calculated as: HPL Efficiency = rMax / (CPUs rPeak + GPUs rPeak), where rPeak is the highest theoretical FLOPS result that could be achieved with base clock, and the number reported by HPL is rMax and is the real performance that can be achieved. HPL cannot be run at the max boost clock. It is typically run at some number in between but the average is close to the base clock then to the max boost clock. That is why we used base clock for rPeak calculation. Although we also included CPU rPeak in the efficiency calculation, when running HPL on P100 we set DGEMM_SPLIT=1.0 which means CPU is not really contributing to the DGEMM, but only handling other overhead so it is not actually contributing a lot of FLOPS. Although we observed that CPUs stayed fully utilized they were just handling the overhead and data movement to keep the GPUs fed. What is the most important for P100 is that rMax is really big.
Figure 1: HPL performance on P100-PCIe
NAMD (for NAnoscale Molecular Dynamics) is a molecular dynamics application designed for high-performance simulation of large biomolecular systems. The dataset we used is Satellite Tobacco Mosaic Virus (STMV) which is a small, icosahedral plant virus that worsens the symptoms of infection by Tobacco Mosaic Virus (TMV). This dataset has 1,066,628 atoms and it is the largest dataset on NAMD utilities website. The performance metric in the output log of this application is “days/ns” (the lower the better). But its inverted metric “ns/day” is used in our plot since that is what most molecular dynamics users focus on. The average of all occurrences of this value in the output log was used. Figure 2 shows the performance within 1 node. It can be seen that the performance of using 2 P100 is better than that of using 4 P100. This is probably because of the communications among different CPU threads. This application launches a set of workers threads that handle the computation and communication threads that handle the data communication. As more GPUs are used, more communication threads are used and more synchronization is needed. In addition, based on the profiling result from NVIDIA’s CUDA profiler called nvprof, with 1 P100 the GPU computation takes less than 50% of the whole application time. According to Amdahl’s law, the speedup with more GPUs will be limited by another 50% work that is not parallelized by GPU. Based on this observation, we further ran this application on multiple nodes with two different settings (2 GPUs/node and 4 GPUs/node) and the result is shown in Figure 3. The result shows that no matter how many nodes are used, the performance of 2 GPUs/node is always better than 4 GPUs/node. Within a node, 2 P100 GPUs is 9.5x faster than dual CPUs.
Figure 2: NAMD Performance within 1 P100-PCIe node
Figure 3: NAMD Performance across Nodes
GROMACS (for GROningen MAchine for Chemical Simulations) primarily does simulations for biochemical molecules (bonded interactions). But because of its efficiency in calculating non-bonded interactions (atoms not linked by covalent bonds), the user base is expanding to non-biological systems. Figure 4 shows the performance of GROMACS on CPU, K80 GPUs and P100-PCIe GPUs. Since one K80 has two internal GPUs, from now on when we mention one K80 it always refers to two internal GPUs instead of one of the two internal GPUs. When testing with K80 GPUs, the same P100-PCIe GPUs based servers were used. Therefore, the CPUs and memory were kept the same and the only difference is that P100-PCIe GPUs were replaced to K80 GPUs. In all tests, there were four GPUs per server and all GPUs were utilized. For example, the 3 node data point is with 3 servers and 12 total GPUs. The performance of P100-PCIe is 4.2x – 2.8x faster than CPU from 1 node to 4 nodes, and is 1.5x – 1.1x faster than K80 GPU from 1 node to 4 nodes.
Figure 4: GROMACS Performance on P100-PCIe
LAMMPS (for Large Scale Atomic/Molecular Massively Parallel Simulator) is a classic molecular dynamics code, capable of simulations for solid-state materials (metals, semi-conductors), soft matter (biomolecules, polymers) and coarse-grained or mesoscopic systems. It can be used to model atoms or more generically as a parallel particle simulator at the atomic, meso or continuum scale. The dataset we used was LJ (Lennard-Jones liquid benchmark) which contains 512000 atoms. There are two GPU implementations in LAMMPS: GPU library version and kokkos version. In the experiment, we used kokkos version since it was much faster than the GPU library version.
Figure 5 shows LAMMPS performance on CPU and P100-PCIe GPUs. Using 16 P100 GPUs is 5.8x faster than using 1 P100. The reason that this application did not scale linearly is that the data transfer (CPU->GPU, GPU->CPU and GPU->GPU) time increases as more GPUs are used although the computation part reduces linearly. And the reason that the data transfer time increases is because this application requires the data communication among all GPUs used. However, the configuration G we used only allows Peer-to-Peer (P2P) access for two pairs of GPUs: GPU 1 - GPU 2 and GPU 3 - GPU 4. GPU 1/2 cannot communicate with GPU 3/4 directly. If the communication is needed, the data must go through CPU which slows the communication. The configuration B is able to ease this issue as it allows P2P access among all four GPUs within a node. The comparison between configuration G and configuration B is shown in Figure 6. By running LAMMPS on a configuration B server with 4 P100, the performance metric “timesteps/s” was improved to 510 compared to 505 in configuration G, resulting in 1% improvement. The reason why the improvement is not significant is because the data communication takes only less than 8% of the whole application time when running on configuration G with 4 P100. Figure 7 also compared the performance of P100-PCIe with that of CPU and K80 GPUs for this application. It is shown that within 1 node, 4 P100-PCIe is 6.6x faster than 2 E5-2690 v4 CPUs and 1.4x faster than 4 K80 GPUs.
Figure 5: LAMMPS Performance on P100-PCIe
Figure 6 : Comparison between Configuration G and Configuration B
Figure 7: LAMMPS Performance Comparison
HOOMD-blue (for Highly Optimized Object-oriented Many-particle Dynamics - blue) is a general purpose molecular dynamic simulator. Figure 8 shows the HOOMD-blue performance. Note that the y-axis is in logarithmic scale. It is observed that 1 P100 is 13.4x faster than dual CPU. The speedup of using 2 P100 is 1.5x compared to using only 1 P100. This is a reasonable speedup. However, with 4 P100 to 16 P100, the speedup is from 2.1x to 3.9x which is not high. The reason is that similar to LAMMPS, this application also involves lots of communications among all used GPUs. Based on the analysis in LAMMPS, using configuration B should reduce this communication bottleneck significantly. To verify this, we ran the same application again on a configuration B server. With 4 P100, the performance metric “hours for 10e6 steps” was reduced to 10.2 compared to 11.73 in configuration G, resulting in 13% performance improvement and the speedup compared to 1 P100 was improved to 2.4x from 2.1x.
Figure 8: HOOMD-blue Performance on CPU and P100-PCIe
Amber is the collective name for a suite of programs that allow users to carry out molecular dynamics simulations, particularly on biomolecules. The term Amber is also used to refer to the empirical force fields that are implemented in this suite. Figure 9 shows the performance of Amber on CPU and P100-PCIe. It can be seen that 1 P100 is 6.3x faster than dual CPU. Using 2 P100 GPUs is 1.2x faster than using 1 P100. However, the performance drops significantly when 4 or more GPUs are used. The reason is that similar to LAMMPS and HOOMD-blue, this application heavily relies on P2P access but configuration G only supports that between 2 pair GPUs. We verified this by again testing this application on a configuration B node. As a result, the performance of using 4 P100 was improved to 791 ns/day compared to 315 ns/day in configuration G, resulting in 151% performance improvement and the speedup of 2.5x. But even in configuration B, the multi-GPU scaling is still not good enough. This is because when the Amber multi-GPU support was originally designed the PCI-E bus speed was gen 2 x 16 and the GPUs were C1060 or C2050s. However, the current Pascal generation GPUs are > 16x faster than the C1060s while the PCI-E bus speed has only increased by 2x (PCI Gen2 x 16 to PCI Gen3 x 16) and Infiniband interconnects by about the same amount. Amber website explicitly states that “It should be noted that while the legacy MPI and GPU-Direct methods of multi-GPU communication are still supported, and will be used by the code automatically if peer to peer communication is not available, you are very unlikely to see any speedup by using multiple GPUs for a single job if the GPUs are newer than C2050s. Multi-node runs are almost impossible to get to scale.” This is consistent with our results on multi-node. Because it is obvious to see that in Figure 9, the more nodes are used, the worse the performance is.
Figure 9: Amber Performance on CPU and P100-PCIe
ANSYS® Mechanical™ software is a comprehensive finite element analysis (FEA) tool for structural analysis, including linear, nonlinear dynamic, hydrodynamic and explicit studies. It provides a complete set of element behavior, material models and equation solvers for a wide range of mechanical design problems. The finite element method is used to solve the partial differential equations which is a compute and memory intensive task. Our testing focused on the Power Supply Module (V17cg-1) benchmark. This is a medium sized job for iterative solvers and a good test for memory bandwidth. Figure 10 shows the performance of ANSYS Mechanical on CPU and P100-PCIe. It is shown that within a node, 4 P100 is 3.8x faster than dual CPUs. And with 4 nodes, 16 P100 is 2.3x faster than 8 CPUs. The figure also shows that the performance scales well with more nodes. The speedup with 4 nodes is 2.8x compared to 1 node.
Figure 10: ANSYS Mechanical Performance on CPU and P100-PCIe
RELION (for REgularised Likelihood OptimisationN) is a program that employs an empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM). Figure 11 shows the performance of RELION on CPU and P100-PCIe. Note that y-axis is in logarithmic scale. It demonstrates that 1 P100 is 8.8x faster than dual CPU. From the figure we also notice that it does not scale well starting from 4 P100 GPUs. Because of the long execution time, we did not perform the profiling for this application. But it is possible that the reason of the weak scaling is similar to LAMMPS, HOOMD-blue and Amber.
Figure 11: RELION Performance on CPU and P100-PCIe
In this blog, we presented and analyzed the performance of different applications on Dell PowerEdge C4130 servers with P100-PCIe GPUs. In all of the tested applications, HPL, GROMACS and ANSYS Mechanical benefit from the balanced CPU-GPU configuration in configuration G, because they do not require P2P access among GPUs. However, LAMMPS, HOOMD-blue, Amber (and possibly RELION) rely on P2P accesses. Therefore, with configuration G, they scale well up to 2 P100 GPUs, then scale weakly with 4 or more P100 GPUs. But with Configuration B, they scale better than G with 4 GPUs, so configuration B is more suitable and recommended for applications implemented with P2P accesses.
In the future work, we will run these applications on P100-SXM2 and compare the performance difference between P100-PCIe and P100-SXM2.
By Ashish Kumar Singh. January 2017 (HPC Innovation Lab)
This blog presents an in-depth analysis of the High Performance Conjugate Gradient (HPCG) benchmark on the Intel Xeon Phi processor, which is based on Intel Xeon Phi architecture codenamed “Knights Landing”. The analysis has been performed on PowerEdge C6320p platform with the new Intel Xeon Phi 7230 processor.
Introduction to HPCG and Intel Xeon Phi 7230 processor
The HPCG benchmark constructs a logically global, physically distributed sparse linear system using a 27-point stencil at each grid point in 3D domain such that the equation at the point (I, j, k) depend on its values and 26 surrounding neighbors. The global domain computed by benchmark is (NRx * Nx) X (NRy*Ny) X (NRz*Nz), where Nx, Ny and Nz are dimensions of local subgrids, assigned to each MPI process and number of MPI ranks are NR = (NRx X NRy X NRz). These values can be defined in hpcg.dat file or passed in the command line arguments.
The HPCG benchmark is based on conjugate gradient solver, where the pre-conditioner is a three level hierarchical multi-grid (MG) method with Gauss-Seidel. The algorithm starts with MG and contains Symmetric Gauss-Seidel (SymGS) and Sparse Matrix-vector multiplication (SPMV) routines for each level. Both SYMGS and SPMV require data from their neighbor as data is distributed across nodes which is provided by their predecessor, the Exchange Halos routine. The residual should be lower than 1-6 which is locally computed by Dot Product (DDOT), while MPI_Allreduce follows the DDOT and completes the global operation. WAXPBY only updates a vector with sum of two scaled vectors. Scaled vector addition is a simple operation that calculates the output vector by scaling the input vectors with a constant and performing an addition on the values of the same index. So, HPCG has four computational blocks SPMV, SymGS, WAXPBY and DDOT, while two communication blocks MPI_Allreduce and Halos Exchange.
Intel Xeon Phi Processor is a new generation of processors from the Intel Xeon Phi family. Previous generations of Intel Xeon Phi were available as a coprocessor, in a PCI card form factor and required an Intel Xeon processor. The Intel Xeon Phi 7230 contains 64 cores @ 1.3GHz of core frequency along with the turbo speed of 1.5GHz and 32MB of L2 cache. It supports DDR4-2400MHz memory up to 384GB and instruction set of AVX512. Intel Xeon Phi processor also encloses 16GB of MCDRAM memory on socket with a sustained memory bandwidth of up to ~480GB/s measured by the Stream benchmark. Intel Xeon Phi 7230 delivers up to ~1.8TFLOPS of double precision HPL performance.
This blog showcases the performance of HPCG benchmark on the Intel KNL processor and compares the performance to that on the Intel Broadwell E5-2697 v4 processor. The Intel Xeon Phi cluster comprises of one head node which is PowerEdge R630 and 12 PowerEdge C6320p as compute nodes. While Intel Xeon processor cluster includes one PowerEdge R720 as head node and 12 PowerEdge R630 as compute nodes. All compute nodes are connected by Intel Omni-Path of 100GB/s. The cluster shares the storage of head node over NFS. The detailed information of the clusters are mentioned below in table1. All HPCG tests on Intel Xeon Phi has been performed with the BIOS settings of “quadrant” cluster mode and “Memory” memory mode.
Table1: Cluster Hardware and software details
HPCG Performance analysis with Intel KNL
Choosing the right problem size for HPCG should follow the following rules. The problem size should be large enough not to fit in the cache of the device. The problem size should be able to occupy the significant fraction of main memory, at least 1/4th of total. For HPCG performance characterization, we have chosen the local domain dimension of 128^3, 160^3, and 192^3 with the execution time of t=30 seconds. The local domain dimension defines the global domain dimension by (NR*Nx) x (NR*Ny) x (NR*Nz), where Nx=Ny=Nz=160 and NR is the number of MPI processes.
Figure 1: HPCG Performance comparison with multiple local dimension grid size
As shown in figure 1, the local dimension grid size of 160^3 gives the best performance of 48.83GFLOPS. The problem size bigger than 128^3 allows for more parallelism and it fits well inside the MCDRAM while 192^3 does not. All these tests have been carried out with 4 MPI processes and 32 OpenMP threads per MPI process on a single Intel KNL server.
Figure 2: HPCG performance comparison with multiple execution time.
Figure 2 demonstrates HPCG performance with multiple execution times for grid size of 160^3 on a single Intel KNL server. As per the graph, HPCG performance doesn’t change even by changing the execution time. It means execution time does not appear to be a factor for HPCG performance. So, we may not need to spend hours or days of time to benchmark large clusters, which in result, will save both time and power. Although, the official execution time should be >=1800 seconds as reported in the output file. If you decide to submit your results to TOP 500 ranking list, execution time should be not less than 1800seconds.
Figure 3: Time consumed by HPCG computational routines.
Figure 3 shows the time consumed by each computational routine from 1 to 12 KNL nodes. Time spent by each routine is mentioned in HPCG output file as shown in the figure 4. As per the above graph, HPCG spends its most of the time in the compute intensive pre-conditioning of SYMGS function and matrix vector multiplication of sparse matrix (SPMV). The vector update phase (WAXPBY) consumes very less time in comparison to SymGS and least time by residual calculation (DDOT) out of all four computation routines. As the local grid size is same across all multi-node runs, the time spent by all four compute kernels for each multi-node run are approximately same. The output file shown in figure 4, shows performance of all four computation routines. In which, MG consists both SymGS and SPMV.
Figure 4: A slice of HPCG output file
Here is the HPCG multi-nodes performance comparison between Intel Xeon E5-2697 v4 @2.3GHz (Broadwell processor) and Intel KNL processor 7230 with Intel Omni-path interconnect.
Figure 5: HPCG performance comparison between Intel Xeon Broadwell processor and Intel Xeon Phi processor
Figure 5 shows HPCG performance comparison between dual Intel Broadwell 18 cores processors and one Intel Xeon phi 64 cores processor. Dots in figure 5 show the performance acceleration of KNL servers over Broadwell dual socket servers. For single KNL node, HPCG performs 2.23X better than Intel Broadwell node. For Intel KNL multi-nodes also HPCG show more than 100% performance increase over Broadwell processor nodes. With 12 Intel KNL nodes, HPCG performance scales out well and shows performance up to ~520 GFLOPS.
Overall, HPCG shows ~2X higher performance with Intel KNL processor on PowerEdge C6320p over Intel Broadwell processor server. HPCG performance scales out well with more number of nodes. So, PowerEdge C6320p platform will be a prominent choice for HPC applications like HPCG.
By Garima Kochhar. HPC Innovation Lab. January 2016.
The Intel Xeon Phi bootable processor (architecture codenamed “Knights Landing” – KNL) is ready for prime time. The HPC Innovation Lab has had access to a few engineering test units, and this blog presents the results of our initial benchmarking study. [We also published our results with Cryo-EM workloads on these systems, and that study is available here.]
The KNL processor is from the Intel Xeon Phi product line but is a bootable processor, i.e., the system does not need another processor in it to power on, just the KNL. Unlike the Xeon Phi coprocessors or the NVIDIA K80 and P100 GPU cards that are housed in a system that has a Xeon processor as well, the KNL is the only processor in the server. This necessitates a new server board design and the PowerEdge C6320p is the Dell EMC platform that supports the KNL line of processors. A C6320p server includes support for one KNL processor and six DDR4 memory DIMMs. The network choices include Mellanox InfiniBand EDR, Intel Omni-Path, or choices of add-in 10GbE Ethernet adapters. The platform has the other standard components you’d expect from the PowerEdge line including a 1GbE LOM, iDRAC and systems management capabilities. Further information on C6320p is available here.
The KNL processor models include 16GB of on-package memory called MCDRAM. The MCDRAM can be used in three modes – memory mode, cache mode or hybrid mode. The 16GB of MCDRAM is visible to the OS as addressable memory and must be addressed explicitly by the application when used in memory mode. In cache mode, the MCDRAM is used as the last level cache of the processor. And in hybrid mode, a portion of the MCDRAM is available as memory and the other portion is used as cache. The default setting is cache mode as this is expected to benefit most applications. This setting is configurable in the server BIOS.
The architecture of the KNL processor allows the processor cores + cache and home agent directory + memory to be organized into different clustering modes. These modes are called all2all, quadrant and hemisphere, Sub-NUMA Clustering-2 and Sub-NUMA Clustering 4. They are described in this Intel article. The default setting in the Dell EMC BIOS is quadrant mode and can be changed in the Dell EMC BIOS. All tests below are with the quadrant mode.
The configuration of the systems used in this study is described in Table 1.
Table 1 - Test configuration
12 * Dell EMC PowerEdge C6320p
Intel Xeon Phi 7230. 64 cores @ 1.3 GHz, AVX base 1.1 GHz.
96 GB at 2400 MT/s [16 GB * 6 DIMMS]
Intel Omni-Path and Mellanox EDR
Red Hat Enterprise Linux 7.2
Intel 2017, 17.0.0.098 Build 20160721
Intel MPI 5.1.3
Intel XPPSL 1.4.1
The first check was to measure the memory bandwidth on the KNL system. To measure memory bandwidth to the MCDRAM, the system must be in “memory” mode. A snippet of the OS view when the system is in quadrant + memory mode is in Figure 1.
Note that the system presents two NUMA nodes. One NUMA node contains all the cores (64 cores * 4 logical siblings per physical core) and the 96 GB of DDR4 memory. The second NUMA node, node1, contains the 16GB of MCDRAM.
Figure 1 – NUMA layout in quadrant+memory mode
On this system, the dmidecode command shows six DDR4 memory DIMMs, and eight 2GB MCDRAM memory chips that make up the 16GB MCDRAM.
STREAM Triad results to the MCDRAM on the Intel Xeon Phi 7230 measured between 474-487 GB/s across 16 servers. The memory bandwidth to the DDR4 memory is between 83-85 GB/s. This is expected performance for this processor model. This link includes information on running stream on KNL.
When the system has MCDRAM in cache mode, the STREAM binary used for DDR4 performance above reports memory bandwidth of 330-345 GB/s.
XPPSL includes a micprun utility that makes it easy to run this micro-benchmark on the MCDRAM. “micprun –k stream –p <num cores>” is the command to run a quick stream test and this will pick the MCDRAM (NUMA node1) automatically if available.
The KNL processor architecture supports AVX512 instructions. With two vector units per core, this allows the processor to execute 32 DP floating point operations per cycle. For the same core count and processor speed, this doubles the floating point capabilities of KNL when compared to Xeon v4 or v3 processors (Broadwell or Haswell) that can do only 16 FLOPS/cycle.
HPL performance on KNL is slightly better (up to 5%) with the MCDRAM in memory mode when compared to cache mode and when using the HPL binary packaged with Intel MKL. Therefore the tests below are with the system in quadrant+memory mode.
On our test systems, we measured between 1.7 – 1.9 TFLOP/s HPL performance per server across 16 test servers. The micprun utility mentioned above is an easy way to run single server HPL tests. “micprun –k hplinpack –p <problem size>” is the command to run a quick HPL test. However for cluster-level tests, the Intel MKL HPL binary is best.
HPL cluster level performance over the Intel Omni-Path interconnect is plotted in Figure 2. These tests were run using the HPL binary that is packaged with Intel MKL. The results with InfiniBand EDR are similar.
Figure 2 - HPL performance over Intel Omni-Path
The KNL-based system is a good platform for highly parallel vector applications. The on-package MCDRAM helps balance the enhanced processing capability with additional memory bandwidth to the applications. KNL introduces the AVX512 instruction set which further improves the performance of vector operations. The PowerEdge C6320p provides a complete HPC server with multiple network choices, disk configurations and systems management.
This blog presents initial system benchmark results. Look for upcoming studies with HPCG and applications like NAMD and Quantum Espresso. We have already published our results with Cryo-EM workloads on KNL and that study is available here.
December 2016 – HPC Innovation Lab
In order to build a balanced cluster ecosystem and eliminate bottle-necks, the need for powerful and dense server node configurations is essential to support parallel computing. The challenge is to provide maximum compute power with efficient I/O subsystem performance, including memory and networking. Some of the emerging technologies along with traditional computing that are needed for intense compute power are advanced parallel algorithms in the areas of research, life science and financial application.
Dell PowerEdge C6320p
The introduction of the Dell EMC C6320p platform, which is one of the densest and greatest maximum core capacity platform offerings in HPC solutions, provides a leap in this direction.
The PowerEdge C6320p platform is Dell EMC’s first self-bootable Intel Xeon Phi platform. The previously available versions of Intel Xeon Phi were PCIE adapters that required a host system to be plugged into. From the core perspective, it supports up to 72 processing cores, with each core supporting two vector processing units capable of AVX-512 instructions. This increases the computation of floating point operations requiring longer vector instructions unlike Intel Xeon® v4 processors that support up to AVX-2 instructions. The Intel Xeon Phi in Dell EMC C6320p also features on-package 16GB of fast MCDRAM that is stacked on the processor. The availability of MCDRAM helps out-of-order execution in applications that are sensitive to high memory bandwidth. This is in addition to the six channels of DDR4 memory hosted on the server. Being a single socket server, the C6320p provides a low power consumption compute node compared to traditional two socket nodes in HPC.
The following table shows platform differences as we compare the current Dell EMC PowerEdge C6320 and Dell EMC PowerEdge C6320p server offerings in HPC.
Server Form Factor
2U Chassis with four sleds
Intel ® Xeon Phi
Max cores in a sled
Up to 44 physical cores, 88 logical cores
(with two * Intel ® Xeon E5-2699 v4, 2.2 GHz, 55MB, 22 cores, 145W)
Up to 72 physical cores, 288 logical cores
(with the Intel ®Xeon Phi Processor 7290 (16GB, 1.5GHz, 72 core, 245W)
Theoretical DP Flops per sled
16 DDR4 DIMM slots
6 DDR4 DIMM slots +
on-die 16GB MCDRAM
MCDRAM BW (Memory mode)
~ 475-490 GB/s
~ 135 GB/s
Dual port 1Gb/10GbE
Single port 1GbE
Intel Omni-Path Fabric (100Gbps)
Mellanox Infiniband (100Gbps)
Intel Omni-PathFabric (100Gbps)
On-board Mellanox Infiniband (100Gbps)
Up to 24 x 2.5” or 12 x 3.5” HD
6 x 2.5” HD per node +
Internal 1.8” SSD option for boot
Integrated Dell EMC Remote Access Controller
Dedicated and shared iDRAC8
Table 1: Comparing the C6320 and C6320p offering in HPC
Dell EMC Supported HPC Solution:
Dell EMC offers a complete, tested, verified and validated solution offering on the C6320p servers. This is based on Bright Cluster Manger 7.3 with RHEL 7.2 that includes specific highly recommended kernel and security updates. It will also provide support for the upcoming RHEL 7.3 operating system. The solution provides automated deployment, configuration, management and monitoring of the cluster. It also integrates recommended Intel performance tweaks, as well as required software drivers and other development toolkits to support the Intel Xeon Phi programming model.
The solution provides the latest networking support for both InfiniBand and Intel Omni-Path Fabric. It also includes Dell EMC-supported System Management tools that are bundled to provide customers with the ease of cluster management on Dell EMC hardware.
*Note: As a continuation to this blog, there will be follow-on micro-level benchmarking and application study published on C6320p.
Just before the kick-off of the opening gala for the SC16 international supercomputing conference, HPCwire unveiled the winners of the 2016 HPCwire Editors’ Choice Awards. Each year, this awards program recognizes the best and the brightest developments that have happened in high performance computing over the past 12 months. Selected by a panel of HPCwire editors and thought leaders in HPC, these awards are highly coveted as prestigious recognition of achievements by the HPC community.
Traditionally revealed and presented each year to kick off the Supercomputing Conference (SC16), which showcases high performance computing, networking, storage, and data analysis, the awards are an annual feature of the publication and spotlight outstanding breakthroughs and achievements in HPC.
Tom Tabor, CEO of Tabor Communications, the publisher of HPCwire, announced the list of winners in Salt Lake City, UT.
“From thought leaders to end users, the HPCwire readership reaches and engages every corner of the high performance computing community,” said Tabor. “Receiving their recognition signifies community support across the entire HPC space, as well as the breadth of industries it serves.
Dell EMC was honored to be presented with two 2016 HPCwire Editors’ Choice Awards:
Best Use of High Performance Data Analytics: The Best Use of High Performance Data Analytics award was presented to UNC-Chapel Hill Institute of Marine Sciences (IMS) and Coastal Resilience Center of Excellence (CRC), Renaissance Computing Institute (RENCI), and Dell EMC. UNC-Chapel Hill IMS and CRC work with the Dell EMC-powered RENCI Hatteras Supercomputer to predict dangerous coastal storm surges, including Hurricane Matthew, a long-lived, powerful and deadly tropical cyclone which became the first Category 5 Atlantic hurricane since 2007.
Top 5 Vendors to Watch Dell EMC was recognized by the 2016 HPCwire Editors’ Choice Awards panel, along with Fujitsu, HPE, IBM and NVIDIA, as one of the Top 5 Vendors to Watch in high performance computing. As the only true end-to-end solutions provider in the HPC market, Dell EMC is committed to serving customer needs. And with the combination of Dell, EMC and VMware, we are a leader in the technology of today, with the world’s greatest franchises in servers, storage, virtualization, cloud software and PCs. Looking forward, we will occupy a very strong position in the most strategic areas of technology of tomorrow: digital transformation, software defined data center, hybrid cloud, converged infrastructure, mobile and security.
To learn more about HPC at Dell EMC, join the Dell EMC HPC Community at www.Dellhpc.org, or visit us online at www.Dell.com/hpc and www.HPCatDell.com.
Twice each year, the TOP500 list ranks the 500 most powerful general-purpose computer systems known. In the present list, released at the SC16 conference in Salt Lake City, UT, computers in common use for high-end applications are ranked by their performance on the LINPACK Benchmark. Sixteen of these world-class systems are powered by Dell EMC. Collectively, these customers are accomplishing amazing results, continually innovating and breaking new ground to solve the biggest, most important challenges of today and tomorrow while also making major contributions to the advancement of HPC.
Here are just a few examples:
Dell EMC has partnered with Scientific Computing, publisher of HPC Source, and NVIDIA to produce an exclusive high performance computing supplement that takes a look at some of today’s cool new HPC technologies, as well as some of the work being done to extend HPC capabilities and opportunities.
This special publication, “New Technologies in HPC,” highlights topics such as innovative technologies in HPC and the impact they are having on the industry, HPC trends to watch, and advancing science with AI. It also looks at how organizations are extending supercomputing with cloud, machine learning technologies for the modern data center, and getting starting with deep learning.
This digital supplement can be viewed on-screen or downloaded as a PDF.
Taking our dive into new HPC technologies a bit deeper — we also brought together technology experts Paul Teich, Principal Analyst at TIRIAS Research, and Will Ramey, Senior Product Manager for GPU Computing at NVIDIA, for a live, interactive discussion with contributing editor Tim Studt: “Accelerate Your Big Data Strategy with Deep Learning.”
Paul and Will share their unique perspectives on where artificial intelligence is leading the next wave of industry transformation, helping companies go from data deluge to data-hungry. They provide insights on how organizations can accelerate their big data strategies with deep learning, the fastest growing field in AI, and discuss how, by using data-driven algorithms powered by GUP accelerators, companies can get faster insights, as well as how companies can see dynamic correlations, and achieve actionable knowledge about their business.
For those who couldn't make the live broadcast, it is available for on-demand viewing.
Dell China has been honored with an “Innovation Award of Artificial Intelligence in Technology & Practice” in recognition of Dell’s collaboration with the Institute of Automation, Chinese Academy of Sciences (CASIA) in establishing the Artificial Intelligence and Advanced Computing Joint-Lab. The advanced computing platform was jointly unveiled by Dell China and CASIA in November 2015, and the AI award was presented by the Technical Committee of High Performance Computing (TCHPC), China Computer Federation (CCF), at the China HPC 2016 conference in Xi’an City, Shanxi Province China, on October 27, 2016. About a half dozen additional awards were presented at HPC China, an annual national conference on high performance computing organized by TCHPC. However, Dell China was the only vendor to receive an award in the emerging field of artificial intelligence in HPC.
The Artificial Intelligence and Advanced Computing Joint-Lab’s focus is on research and applications of new computing architectures in brain information processing and artificial intelligence, including cognitive function simulation, deep learning, brain computer simulation, and related new computing systems. The lab also supports innovation and development of brain science and intellect technology research, promoting Chinese innovation and breakthroughs at the forefront of science, and working to produce and industrialize these core technologies in accordance with market and industry development needs.
CASIA, a leading AI research organization in China, has huge requirements for computing and storage, and the new advanced computing platform — designed and set up by engineers and professors from Dell and CASIA — is just the tip of the iceberg with respect to CASIA’s research requirements. It features leading Dell HPC systems components designed by the Dell USA team, including servers, storage, networking and software, as well as leading global HPC partner products, including Intel CPU, NVIDIA GPU, Mellanox IB Network and Bright Computing software. The Dell China Services team implemented installation and deployment of the system, which was completed in February 2016.
The November 3, 2015, unveiling ceremony for the Artificial Intelligence and Advanced Computing Joint-Lab was held in Beijing. Marius Haas, Chief Commercial Officer and President, Enterprise Solutions of Dell; Dr. Chenhong Huang, President of Dell Greater China; and Xu Bo, Director of CASIA attended the ceremony and addressed the audience.
“As a provider of end-to-end solutions and services, Dell has been focusing on and promoting the development of frontier science and technologies, and applying the latest technologies to its solutions and services to help customers achieve business transformation and meet their ever-changing demands,” Haas said at the unveiling. “We’re glad to cooperate with CASIA in artificial intelligence, which once again shows Dell’s commitment to China’s market and will drive innovation in China’s scientific research.”
“Dell is well-positioned to provide innovative end-to-end solutions. Under the new 4.0 strategy of ‘In China, For China’, we will strengthen the cooperation with Chinese research institutes and advance the development of frontier technologies,” Huang explained. “Dell’s cooperation with CASIA represents a combination of computing and scientific research resources, which demonstrates a major trend in artificial intelligence and industrial development.”
China is a role model for emerging market development and practice sharing for other emerging countries. Partnering with CASIA and other strategic partners is Dell’s way of embracing the “Internet+” national strategy, promoting Chinese innovation and breakthroughs at the forefront of science.
“China’s strategy in innovation-driven development raises the bar for scientific research institutes. The fast development of information technologies in recent years also brings unprecedented challenges to CASIA,” added Bo. “CASIA always has intelligence technologies in mind as their main focus of strategic development. The cooperation with Dell China on the lab will further the computing advantages of the Institute of Automation, strengthen the integration between scientific research and industries, and advance artificial intelligence innovation.”
Dell China is looking forward to continued cooperation with CASIA in driving artificial intelligence across many more fields, such as meteorology, biology and medical research, transportation, and manufacturing.
By Garima Kochhar and Kihoon Yoon. Dell EMC HPC Innovation Lab. October 2016
This blog presents performance results for the 2D alignment and 2D classification phases of the Cryo-electron microscopy (Cryo-EM) data processing workflow using the new Intel Knights Landing architecture, and compares these results to the performance of the Intel Xeon E5-2600 v4 family. A quick description of Cryo-EM and the different phases in the process of reconstructing 3D molecular structures with electron microscopy is provided below, followed by the specific tests conducted in this study and the performance results.
Cryo-EM allows molecular samples to be studied in near-native states and down to nearly atomic resolutions. Studying the 3D structure of these biological specimens can lead to new insights into their functioning and interactions, especially with proteins and nucleic acids, and allows structural biologists to examine how alterations in their structures affect their functions. This information can be used in system biology research to understand the cell signaling network which is part of a complex communication system. This communication system controls fundamental cell activities and actions to maintain normal cell homeostasis. Errors in the cellular signaling process can lead to diseases such as cancer, autoimmune disorders, and diabetes. Studying the functioning of the proteins responsible for an illness enables a biologist to develop specific drugs that can interact with the protein effectively, thus improving the efficacy of treatment.
The workflow from the time a molecular sample is created to the creation of a 3D model of its molecular structure involves multiple steps. These steps are briefly (and simplistically!) described below.
As is now clear, the Cryo-EM processing workflow must comprehend a lot of data, requires rich compute algorithms and considerable compute power for the 2D and 3D phases, and must move data efficiently across the multiple phases in the workflow. Our goal is to design a complete HPC system that can support the Cryo-EM workflow from start to finish and is optimized for performance, energy efficiency and data efficiency.
Performance Tests and Configuration
Focusing for now on the 2D phases of the workflow, this blog presents results for the steps #7 and #8 listed above - the 2D alignment and 2D classification phases. Two software packages in this domain, ROME and RELION were benchmarked on the Knights Landing (KNL, code name for the Intel Xeon Phi 7200 family) and Broadwell (BDW, code name for Intel Xeon E5-2600 v4 family) processors.
The tests were run on systems with the following configuration.
12 * Dell PowerEdge C6320
Intel Xeon E5-2697 v4. 18 cores per socket, 2.3 GHz
128 GB at 2400 MT/s
Intel Omni-Path fabric
12 * Dell PowerEdge C6320p
Intel Xeon Phi 7230. 64 cores, 1.3 GHz
96 GB at 2400 MT/s
Set1. Inflammasome data: 16306 images of NLRC4/NAIP2 inflammasome with a size of 2502 pixels
Set4. RP-a: 57001 images of proteasome regulatory particles (RP) with a size of 1602 pixels
Set2. RP-b: 35407 images of proteasome regulatory particles (RP) with a size of 1602 pixels
ROME performs the 2D alignment (step #7 above) and the 2D classification (step #8 above) in two separate phases called the MAP phase and the SML phase respectively. For our tests we used “-k” for MAP equal to 50 (i.e. 50 initial classes) and “-k” for SML equal to 1000 (i.e. 1000 final 2D classes).
The first set of graphs below, Figure 1 and Figure 2 show the performance of the SML phase on KNL. The compute portion of the SML phase scales linearly as more KNL systems are added into the test bed, from 1 to 12 servers as shown in Figure 1. The total time to run shown in Figure 2 is slightly lower than linear, and includes an I/O component as well as the compute component. The test bed used in this study did not have a parallel file system and used just local disks on the KNL servers. Future work for this project includes evaluating the impact of adding a Lustre parallel file system to this test bed and its effect on total time for SML.
Figure 1 - ROME SML scaling on KNL, compute time
Figure 2 - ROME SML scaling on KNL, total time
The next set of graphs compare the ROME SML performance on KNL and Broadwell. Figure 3, Figure 4 and Figure 5 plot the compute time for SML on 1 to 12 servers. The black circle on the graph shows the improvement in KNL runtime when compared to BDW. For all three datasets that were benchmarked, KNL is about 3x faster than BDW. Note we’re comparing one single-socket KNL server to a dual-socket Broadwell server, so this is a server to server comparison (not socket to socket). KNL is 3x faster than BDW across different numbers of servers, showing that ROME SML scales well on Omni-Path on both KNL and BDW, but the absolute compute time on KNL is 3x faster irrespective of the number of servers in test.
Considering total time to run on KNL versus BDW, we measured KNL to be 2.4x to 3.3x faster than BDW at all node counts. Specifically, DATA6 is ~ 2.4x faster on KNL, DATA8 is 3x faster on KNL and RING11_ALL is 3.4x faster on KNL when considering total time to run. As mentioned before, the total time includes an I/O component and one of the next step in this study is to evaluate the performance improvement if adding a parallel file system to the test bed.
Figure 3 - DATA8 ROME SML on KNL and BDW
Figure 4 - DATA6 ROME SML on KNL and BDW.
Figure 5 - RING11_ALL ROME SML on KNL and BDW
RELION accomplishes the 2D alignment and classification steps mentioned above in one phase. Figure 6 shows our preliminary results on RELION on KNL across 12 servers and on two of the test datasets. The “--K” parameter for RELION was set to 300, i.e., 300 classes for 2D classification. There are several things to be still tried here – the impact of a parallel file system on RELION (as we discussed for ROME earlier) and dataset sensitivity to the parallel file system. Additionally we plan to benchmark RELION on Broadwell, across different node counts and with different input parameters.
Figure 6 - RELION 2D alignment and classification on KNL
The next steps in this project include adding a parallel file system to measure the impact on the workflow, tuning the test parameters for ROME MAP, SML and RELION, and testing on more datasets. We also plan to measure the power consumption of the cluster when running Cryo-EM workloads to analyze performance per watt and performance per dollar metrics for KNL.