Dell Community
High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • HPCG Performance study with Intel Skylake processors

    Author: Somanath Moharana and Ashish Kumar Singh, Dell EMC HPC Innovation Lab, September 2017

    This blog presents analysis of the High Performance Conjugate Gradient (HPCG) benchmark on the Intel(R) Xeon(R) Gold 6150 CPU codename “Skylake”. It also compares the performance of Intel(R) Xeon(R) Gold 6150 processors with its previous generation Intel(R) Xeon(R) CPU E5-2697 v4 Codename “Broadwell-EP” processors.

    Introduction to HPCG

    The High Performance Conjugate Gradients (HPCG) Benchmark is a metric for ranking HPC systems. HPCG can be considered as a complement to the High Performance LINPACK (HPL) benchmark. HPCG is designed to exercise computational and data access patterns that more closely match a different and broad set of applications that have impact on the collective performance of these applications.

    The HPCG benchmark is based on a 3D regular 27-point discretization of an elliptic partial differential equation. The 3D domain is scaled to fill a 3D virtual process grid for all of the available MPI ranks. The preconditioned conjugate gradient (CG) algorithm is used to solve the intermediate systems of equations and incorporates a local and symmetric Gauss-Seidel pre-conditioning step that requires a triangular forward solve and a backward solve. The benchmark exhibits irregular accesses to memory and fine-grain recursive computations.


    HPCG has four computational blocks: Sparse Matrix-vector multiplication (SPMV), Symmetric Gauss-Seidel (SymGS), vector update phase (WAXPBY) and Dot Product (DDOT), while two communication blocks MPI_Allreduce and Halos Exchange.


    Introduction to Intel Skylake processor


    Intel Skylake is a microarchitecture redesign using the same 14 nm manufacturing process technology with support for up to 28 cores per socket, serving as a "tock" in Intel's "tick-tock" manufacturing and design model. It supports 6 DDR4 memory channels per socket with 2 DPC (DIMMs per channel), where supported full memory bandwidth is up to 2666 MT/s.

    Please visit BIOS characteristics of Skylake processor-blog for a better understanding of Skylake processors and their bios features on Dell EMC platforms.

    Table 1: Details of Servers used for HPCG analysis


    PowerEdge C6420

    PowerEdge R630


    2 x Intel(R) Xeon(R) Gold 6150 @2.7GHz, 18c

    2 x Intel(R) Xeon(R) CPU E5-2697 v4 @2.3GHz, 18c


    192GB (12 x 16GB) DDR4


    128GB( 8 x 16GB ) DDR4


    Inter Connect

    Intel Omni Path

    Intel Omni path

    Operating System                                

    Red Hat Enterprise Linux Server release 7.3

    Red Hat Enterprise Linux Server release 7.2


    version 2017.0.4.196

    version 2017.0.098


    Intel® MKL 2017.0.3

    Intel® MKL 2017.0.0

    Processor Settings > Logical Processors



    Processor Settings > Sub NUMA cluster



    System Profiles




    Version 3.0

    Version 3.0

    HPCG Performance analysis with Intel Skylake

    In HPCG we have to set the problem size to get the best results out of it. For a valid run, the problem size should be large enough so that the arrays accessed in the CG iteration loop does not fit in the cache of the device. The problem size should be large enough to occupy the significant fraction of “main memory”, at least 1/4th of the total.

    Adjusting local domain dimensions can affect global problem size. For HPCG performance characterization, we have chosen the local domain dimension of 160^3,192^3 and 224^3 with the execution time of t=30 seconds. The local domain dimension defines the global domain dimension by (NR*Nx) x (NR*Ny) x (NR*Nz), where Nx=Ny=Nz=160 or 192 or 224 and NR is the number of MPI processes used for the benchmark.

    Figure 1: HPCG Performance on multiple grid sizes with Intel Xeon Gold 6150 processors

    As shown in figure 1, we can observe that the local dimension grid size of 192^3 gives the best performance compared to other local dimension grid sizes i.e. 160^3 and 224^3. Here we are getting a performance of 36.14 GFLOP/s for a single node and we can observe a linear increase in performance with the increase in number of nodes. All these tests have been carried out with 4 MPI processes and 9 OpenMP threads per MPI process.

    Figure 2: Time consumed by HPCG computational routines Intel Xeon Gold 6150 processors

    Time spent by each routine is mentioned in the HPCG output file as shown in the figure 2. As per the above graph, HPCG spends its most of the time in the compute intensive pre-conditioning of SymGS function and matrix vector multiplication of sparse matrix (SPMV). The vector update phase (WAXPBY) consumes very less time in comparison to SymGS and least time by residual calculation (DDOT) out of all four computation routines. As the local grid size is same across all multi-node runs, the time spent by all four compute kernels for each multi-node run are approximately same.

    Figure 3: HPCG performance over multiple generation of Intel processors

    Figure 3 compares HPCG performance between Intel Broadwell-EP processors and Intel Skylake processors. Dots in the figure shows the performance improvement of Intel Skylake over Broadwell-EP processors. For a single node, we can observe ~65% better performance with Skylake processor than Broadwell-EP processors and ~67% better performance for both two nodes and four nodes.


    HPCG with Intel(R) Xeon(R) Gold 6150 processor shows ~65% higher performance over Intel(R) Xeon(R) CPU E5-2697 v4 processors. HPCG scales out well with more number of nodes and shows a linear increase in performance with the increase in number of nodes.




  • Performance study of four Socket PowerEdge R940 Server with Intel Skylake processors

    Author:    Somanath Moharana, Dell EMC HPC Innovation Lab, August 2017

    This blog explores the performance of the four socket Dell EMC PowerEdge R940 server with Intel Skylake processors. The latest Dell EMC 14th generation servers supports the new Intel® Xeon® Processor Scalable Family (processor architecture codenamed “Skylake”), and the increased number of cores and higher memory speed benefit a wide variety of HPC applications.

    The PowerEdge R940 is Dell EMC’s latest 4-socket, 3U rack server designed to run complex workloads, which supports up to 6TB of DDR4 memory and up to 122 TB of storage. The system features the Intel® Xeon® Scalable Processor Family, 48 DDR4 DIMMs, up to 13 PCI Express® (PCIe) 3.0 enabled expansion slots and a choice of embedded NIC technologies. It is a general-purpose platform capable of handling demanding workloads and applications, such as data warehouses, ecommerce, databases, and high-performance computing (HPC). With the increase in storage capacity the PowerEdge R940 makes it well-suited for data intensive applications that require greater storage.

    This blog also describes the impact of BIOS tuning options on HPL, STREAM and scientific applications ANSYS Fluent and WRF and compares performance of the new PowerEdge R940 to the previous generation PowerEdge R930 platform. It also analyses the performance with Sub NUMA Cluster (SNC) modes (SNC=Enabled and SNC=Disabled). SNC enabled will expose eight NUMA nodes to the OS on a four socket PowerEdge R940. Each NUMA node can communicate with seven other remote NUMA nodes, six in other three sockets and one within same socket. NUMA domains on different sockets communicate over the UPI interconnect. Please visit BIOS characteristics of Skylake processor-blog for more details on BIOS options. Table 1 lists the server configuration and the application details used for this study.

     Table 1: Details of Server and HPC Applications used for R940 analysis


    PowerEdge R930

    PowerEdge R930

    PowerEdge R940


    4 x Intel Xeon E7-8890 v3@2.5GHz (18 cores) 45MB L3 cache 165W


    4 x Intel Xeon E7-8890 v4@2.2GHz (24 cores) 60MB L3 cache 165W


    4 x Intel Xeon Platinum 8180@2.5GHz, 10.4GT/s (Cross-bar connection)



    1024 GB = 64 x 16GB DDR4 @1866MHz

    1024 GB = 32 x 32GB DDR4 @1866 MHz

    384GB = (24 x 16GB) DDR4@2666MT/s

    CPU Interconnect

    Intel QuickPath Interconnect (QPI) 8GT/s

    Intel QuickPath Interconnect (QPI) 8GT/s

    Intel Ultra Path Interconnect (UPI) 10.4GT/s

    BIOS Settings


    Version 1.0.9

    Version 2.0.1

    Version 1.0.7

    Processor Settings > Logical Processors




    Processor Settings > UPI Speed

    Maximum Data Rate

    Maximum Data Rate

    Maximum Data Rate

    Processor Settings > Sub NUMA cluster



    Enabled, Disabled

    System Profiles

    PerfOptimized (Performance),

    PerfPerWattOptimizedDapc (DAPC),

    PerfOptimized (Performance),

    PerfPerWattOptimizedDapc (DAPC),
    PerfOptimized (Performance),

    PerfPerWattOptimizedDapc (DAPC),)

    Software and Firmware

    Operating System

    Red Hat Enterprise Linux Server release 6.6

    Red Hat Enterprise Linux Server release 7.2

    Red Hat Enterprise Linux Server release 7.3 (3.10.0-514.el7.x86_64)

    Intel Compiler

    Version 15.0.2

    Version 16.0.3

    version 17.0.4

    Intel MKL

    Version 11.2

    Version 11.3

    2017 Update 3

    Benchmark and Applications


    V2.1 from MKL 11.3

    V2.1 from MKL 11.3

    V2.1 from MKL update 3


    v5.10, Array Size 1800000000, Iterations 100

    v5.10, Array Size 1800000000, Iterations 100

    v5.4, Array Size 1800000000, Iterations 100


    V3.5.1 Input Data Conus12KM, Netcdf-4.3.1

    V3.8 Input Data Conus12KM, Netcdf-4.4.0

    v3.8.1, Input Data Conus12KM, Conus2.5KM, Netcdf-4.4.2

     ANSYS Fluent

    v15, Input Data: truck_poly_14m

    v16, Input Data: truck_poly_14m

    v17.2, Input Data: truck_poly_14m, aircraft_wing_14m, ice_2m, combustor_12m, exhaust_system_33m


    Note: The software versions were different for the older generation processors and results are compared against what was best configuration at that time. Due to big architectural changes in servers and processors generation over generation, the changes in software versions is not a significant factor.

    The High Performance Linpack (HPL) Benchmark is a measure of a system's floating point computing power. It measures how fast a computer solves a dense n by n system of linear equations Ax = b, which is a common task in engineering. HPL performed with block size of NB=384 and problem size of N=217754. Since HPL is an AVX-512-enabled workload, we would calculate HPL theoretical maximum performance as (rated base frequency of processor * number of cores * 32 FLOP/second).


                     Figure 1: Comparing HPL Performance across BIOS profiles                                        

    Figure 1 depicts the performance of the PowerEdge R940 server described in Table 1 with different BIOS options. Here the “Performance SNC = Disabled” gives better performance compared to other bios profiles. With “SNC=Disabled” we can observe 1-2% better performance as compared to “SNC = Enabled” for all the BIOS profiles.

    Figure 2: HPL performance with AVX2 and AVX512 instructions sets          Figure 3: HPL Performance over multiple generations of processors

    Figure 2 compares the performance of HPL ran with AVX2 instructions sets and AVX512 instructions sets on PowerEdge R940 (where AVX=Advanced Vector Instructions). AVX-512 are 512-bit extensions to the 256-bit AVX SIMD instructions for x86 instruction set architecture. By Setting “MKL_ENABLE_INSTRUCTIONS=AVX2/AVX512” environment variable we can set the AVX instructions. Here we can observe by running HPL with AVX512 instruction set gives around 75% improvement in performance compared to AVX2 instruction set.

    Figure 3: Compares the results of four socket R930 powered by Haswell-EX processors and Broadwell-EX processors with R940 powered by Skylake processors. For HPL, R940 server performed ~192% better in comparison to R930 server with four Haswell-EX processors and ~99% better with Broadwell-EX processors. The performance improvement we observed in Skylake over Broadwell-EX is due to a 27% increase in the number of cores and 75% increase in performance for AVX 512 vector instructions.

    The Stream benchmark is a synthetic benchmark program that measures sustainable memory bandwidth and the corresponding computation rate for simple vector kernels. The STREAM benchmark calculates the memory bandwidth by counting only the bytes that the user program requested to be loaded or stored. This study uses the results reported by the TRIAD function of the stream bandwidth test.

    Figure 4: STREAM Performance across BIOS profiles                                Figure 5: STREAM Performance over multiple generations of processors

    As per Figure 4, With “SNC = Enabled” we are getting up to 3% better bandwidth in comparison to “SNC = Disabled” across all bios profiles. Figure 5, shows the comparison of memory bandwidth of PowerEdge R930 server with Haswell-EX, Broadwell-EX processors and PowerEdge R940 server with Skylake processors. Both Haswell-EX and Broadwell-EX support DDR3 and DDR4 memories respectively, while the platform with this configuration supports 1600MT/s of memory frequency for both generation of processors. Due to use of DIMMs of same memory frequency for both generation of processors on PowerEdge R930, both Broadwell-EX and Haswell-EX processors have same memory bandwidth but there is ~51% increase in memory bandwidth with Skylake compared to Broadwell- EX processors. This is due to use of 2666MHz RDIMMS, which gives around ~66% increase in maximum memory bandwidth compared to Broadwell-EX and the second factor is ~50% increase in the number of memory channels per socket which is 6 channels per socket for Skylake Processors and 4 channels per socket in Broadwell-EX processors.

    Figure 6: Comparing STREAM Performance with “SNC = Enabled”                    Figure 7: Comparing STREAM Performance with “SNC = Disabled

    Figure 6 and Figure 7 describe the impact of traversing the UPI link to go across sockets on memory bandwidth for the PowerEdge R940 servers. With “SNC = Enabled” the local memory bandwidth and remote to same socket memory bandwidth is nearly same (0-1% variations) but in case of remote to other socket, the memory bandwidth shows ~57% decrease in comparison to local memory bandwidth. With “SNC = Disabled” remote memory bandwidth is 77% lower compared to local memory bandwidth.

    The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture facilitating parallel computation and system extensibility. The model serves a wide range of meteorological applications across scales from tens of meters to thousands of kilometers. WRF can generate atmospheric simulations using real data or idealized conditions. We used the CONUS12km and CONUS2.5km benchmark datasets for this study. 

    CONUS12km is a single domain and small size (48hours, 12km resolution case over the Continental U.S. (CONUS) domain from October 24, 2001) benchmark with 72 seconds of time step. CONUS2.5km is a single domain and large size (Latter 3hours of a 9hours, 2.5km resolution case over the Continental U.S. (CONUS) domain from June 4, 2005) benchmark with 15 seconds of time step.

    WRF decomposes the domain into tasks or patches. Each patch can be further decomposed into tiles that are processed separately, but by default there is only one tile for every run. If the single tile is too large to fit into the cache of the CPU and/or core, it slows down computation due to WRF’s memory bandwidth sensitivity. In order to reduce the size of the tile, it is possible to increase the number of tiles by defining “numtile = x” in input file or defining environment variable “WRF_NUM_TILES = x”. For both CONUS 12km and CONUS 2.5km the number of tiles are chosen based on best performance.

    Figure 8: WRF Performance across BIOS profiles (Conus12KM)              Figure 9: WRF Performance across BIOS profiles (Conus2.5KM)                                                 

    Figure 8 and Figure 9 demonstrates the comparison of WRF datasets on different BIOS profiles .With Conus 12KM data and CONUS 2.5KM “Performance SNC = Enabled” gives best performance. For both conus12km and conus2.5km, the “SNC = Enabled” performs ~1%-2% better than “SNC = Disabled”. The performance difference across different bios profile is nearly equal for Conus12Km as it uses a smaller dataset size, while in CONUS2.5km which is having larger dataset we can observe 1-2% performance variations across the system profiles as it utilizes larger number of processors more efficiently.

    Figure 10: Comparison over multiple generations of processors                     Figure 11: Comparison over multiple generations of processors

    Figure 10 and Figure 11 shows the performance comparison between PowerEdge R940 powered by Skylake processors and PowerEdge R930 powered by Broadwell-EX processors and Haswell-EX processors. From the Figure 11 we can observe that for Conus12KM performance of PowerEdge R940 with Skylake is ~18% better as compared to PowerEdge R930 with Broadwell EX and ~45% better compared to Haswell-EX processors. In case of for Conus2.5 Skylake performs ~29% better than Broadwell-EX and ~38% better than Haswell-EX processors.

    ANSYS Fluent is a computational fluid dynamics (CFD) software tool. Fluent includes well-validated physical modeling capabilities to deliver fast and accurate results across the widest range of CFD and multi physics applications.

    Figure 12: Ansys Fluent Performance across BIOS profiles

    We used five different datasets for our analysis which are truck_poly_14m, combuster_12m, exhaust_system_33m, ice_2m and aircraft_wing_14m. Fluent with ‘Solver Rating’ (Higher is better) as the performance metric. Figure 12 shows that all datasets performed better with “Performance SNC = Enabled” BIOS option than others. For all datasets, the “SNC = Enabled” performs 2% to 4% better than “SNC = Disabled”.


    Figure 13: Ansys Fluent (truck_poly_14m) performance over multiple generations of Intel processors

    Figure 13, shows the performance comparison of truck poly on PowerEdge R940 with Skylake processors and PowerEdge R930 with Broadwell-EX and Haswell-EX processors. For PowerEdge R940 fluent showed 46% better performance in-comparison to PowerEdge R930 with Broadwell-EX and 87% better performance compared to Haswell-EX processors.


    The PowerEdge R940 is a highly efficient 4 socket next generation platform which provides up to 122TB of storage capacity with 6.3TF of computing power options, making it well-suited for data intensive applications, while not sacrificing performance. The Skylake processors gives PowerEdge R940 a performance boost in comparison to its previous generation of server (PowerEdge R930), we can observe more than 45% performance improvement across all the applications.

    Considering our above analysis we can observe that if we compare system profiles “Performance” gives better performance with respect other system profiles.

    In conclusion, PowerEdge R940 with Skylake processors is good platform for all variety of applications and may fulfill the demands of more compute power for HPC applications. 

  • Application Performance on P100-PCIe GPUs

    Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Feb 2017

    Introduction to P100-PCIe GPU

    This blog describes the performance analysis on NVIDIA® Tesla® P100™ GPUs on a cluster of Dell PowerEdge C4130 servers. There are two types of P100 GPUs: PCIe-based and SXM2-based. In PCIe-based server, GPUs are connected by PCIe buses and one P100 delivers around 4.7 and 9.3 TeraFLOPS of double and single precision performance, respectively. And in P100-SXM2, GPUs are connected by NVLink and one P100 delivers around 5.3 and 10.6 TeraFLOPS of double and single precision performance, respectively. This blog focuses on P100 for PCIe-based servers, i.e. P100-PCIe. We have already analyzed the P100 performance for several deep learning frameworks in this blog. The objective of this blog is to compare the performance of HPL, LAMMPS, NAMD, GROMACS, HOOMD-BLUE, Amber, ANSYS Mechanical and RELION. The hardware configuration of the cluster is the same as in the deep learning blog. Briefly speaking, we used a cluster of four C4130 nodes, each node has dual Intel Xeon E5-2690 v4 CPUs and four NVIDIA P100-PCIe GPUs and all nodes are connected with EDR Infiniband. Table 1 shows the detailed information about the hardware and software used in every compute node.


    Table 1: Experiment Platform and Software Details


    PowerEdge C4130 (configuration G)


    2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)


    256GB DDR4 @ 2400MHz


    9TB HDD


    P100-PCIe with 16GB GPU memory

    Nodes Interconnects

    Mellanox ConnectX-4 VPI (EDR 100Gb/s Infiniband)

    Infiniband Switch

    Mellanox SB7890

    Software and Firmware

    Operating System

    RHEL 7.2 x86_64

    Linux Kernel Version



    Version 2.3.3

    CUDA version and driver

    CUDA 8.0.44 (375.20)

    OpenMPI compiler

    Version 2.0.1

    GCC compiler


    Intel Compiler

    Version 2017.0.098



    Version hpl_cuda_8_ompi165_gcc_485_pascal_v1


    Version Lammps-30Sep16


    Version NAMD_2.12_Source


    Version 2016.1


    Version 2.1.2


    Version 16update7

    ANSYS Mechanical

    Version 17.0


    Version 2.0.3

    High Performance Linpack (HPL)

    HPL is a multicomputer parallel application to measure how fast computers solve a dense n by n system of linear equations using LU decomposition with partial row pivoting and designed to be run at very large scale. The HPL running on the experimented cluster uses the double precision floating point operations. Figure 1 shows the HPL performance on the tested P100-PCIe cluster. It can be seen that 1 P100 is 3.6x faster than 2 x E5-2690 v4 CPUs. HPL also scales very well with more GPUs within nodes or across nodes. Recall that 4 P100 is within a server and therefore 8, 12 and 16 P100 are in 2, 3 and 4 servers. 16 P100 GPUs has the speedup of 14.9x compared to 1 P100. Note that the overall efficiency is calculated as: HPL Efficiency = rMax / (CPUs rPeak + GPUs rPeak), where rPeak is the highest theoretical FLOPS result, and the number reported by HPL is rMax and is the real performance that can be achieved. HPL cannot be run at the max boost clock. It is typically run at some number in between but the average is close to the base clock than to the max boost clock. That is why the efficiency is not very high. Although we also included CPU rPeak in the efficiency calculation, when running HPL on P100 we set DGEMM_SPLIT=1.0 which means CPU is not really contributing to the DGEMM, but only handling other overhead so it is not actually contributing a lot of FLOPS. Although we observed that CPUs stayed fully utilized they were just handling the overhead and data movement to keep the GPUs fed. What is the most important for P100 is that rMax is really big.

    Figure 1: HPL performance on P100-PCIe


    NAMD (for NAnoscale Molecular Dynamics) is a molecular dynamics application designed for high-performance simulation of large biomolecular systems. The dataset we used is Satellite Tobacco Mosaic Virus (STMV) which is a small, icosahedral plant virus that worsens the symptoms of infection by Tobacco Mosaic Virus (TMV). This dataset has 1,066,628 atoms and it is the largest dataset on NAMD utilities website. The performance metric in the output log of this application is “days/ns” (the lower the better). But its inverted metric “ns/day” is used in our plot since that is what most molecular dynamics users focus on. The average of all occurrences of this value in the output log was used. Figure 2 shows the performance within 1 node. It can be seen that the performance of using 2 P100 is better than that of using 4 P100. This is probably because of the communications among different CPU threads. This application launches a set of workers threads that handle the computation and communication threads that handle the data communication. As more GPUs are used, more communication threads are used and more synchronization is needed. In addition, based on the profiling result from NVIDIA’s CUDA profiler called nvprof, with 1 P100 the GPU computation takes less than 50% of the whole application time. According to Amdahl’s law, the speedup with more GPUs will be limited by another 50% work that is not parallelized by GPU. Based on this observation, we further ran this application on multiple nodes with two different settings (2 GPUs/node and 4 GPUs/node) and the result is shown in Figure 3. The result shows that no matter how many nodes are used, the performance of 2 GPUs/node is always better than 4 GPUs/node. Within a node, 2 P100 GPUs is 9.5x faster than dual CPUs.

     Figure 2: NAMD Performance within 1 P100-PCIe node

     Figure 3: NAMD Performance across Nodes


    GROMACS (for GROningen MAchine for Chemical Simulations) primarily does simulations for biochemical molecules (bonded interactions). But because of its efficiency in calculating non-bonded interactions (atoms not linked by covalent bonds), the user base is expanding to non-biological systems. Figure 4 shows the performance of GROMACS on CPU, K80 GPUs and P100-PCIe GPUs. Since one K80 has two internal GPUs, from now on when we mention one K80 it always refers to two internal GPUs instead of one of the two internal GPUs. When testing with K80 GPUs, the same P100-PCIe GPUs based servers were used. Therefore, the CPUs and memory were kept the same and the only difference is that P100-PCIe GPUs were replaced to K80 GPUs. In all tests, there were four GPUs per server and all GPUs were utilized. For example, the 3 node data point is with 3 servers and 12 total GPUs. The performance of P100-PCIe is 4.2x – 2.8x faster than CPU from 1 node to 4 nodes, and is 1.5x – 1.1x faster than K80 GPU from 1 node to 4 nodes.

      Figure 4: GROMACS Performance on P100-PCIe


    LAMMPS (for Large Scale Atomic/Molecular Massively Parallel Simulator) is a classic molecular dynamics code, capable of simulations for solid-state materials (metals, semi-conductors), soft matter (biomolecules, polymers) and coarse-grained or mesoscopic systems. It can be used to model atoms or more generically as a parallel particle simulator at the atomic, meso or continuum scale. The dataset we used was LJ (Lennard-Jones liquid benchmark) which contains 512000 atoms. There are two GPU implementations in LAMMPS: GPU library version and kokkos version. In the experiment, we used kokkos version since it was much faster than the GPU library version.

    Figure 5 shows LAMMPS performance on CPU and P100-PCIe GPUs. Using 16 P100 GPUs is 5.8x faster than using 1 P100. The reason that this application did not scale linearly is that the data transfer (CPU->GPU, GPU->CPU and GPU->GPU) time increases as more GPUs are used although the computation part reduces linearly. And the reason that the data transfer time increases is because this application requires the data communication among all GPUs used. However, the configuration G we used only allows Peer-to-Peer (P2P) access for two pairs of GPUs: GPU 1 - GPU 2 and GPU 3 - GPU 4. GPU 1/2 cannot communicate with GPU 3/4 directly. If the communication is needed, the data must go through CPU which slows the communication. The configuration B is able to ease this issue as it allows P2P access among all four GPUs within a node. The comparison between configuration G and configuration B is shown in Figure 6. By running LAMMPS on a configuration B server with 4 P100, the performance metric “timesteps/s” was improved to 510 compared to 505 in configuration G, resulting in 1% improvement. The reason why the improvement is not significant is because the data communication takes only less than 8% of the whole application time when running on configuration G with 4 P100. Figure 7 also compared the performance of P100-PCIe with that of CPU and K80 GPUs for this application. It is shown that within 1 node, 4 P100-PCIe is 6.6x faster than 2 E5-2690 v4 CPUs and 1.4x faster than 4 K80 GPUs.

     Figure 5: LAMMPS Performance on P100-PCIe



    Figure 6 : Comparison between Configuration G and Configuration B

    Figure 7: LAMMPS Performance Comparison


    HOOMD-blue (for Highly Optimized Object-oriented Many-particle Dynamics - blue) is a general purpose molecular dynamic simulator. Figure 8 shows the HOOMD-blue performance. Note that the y-axis is in logarithmic scale. It is observed that 1 P100 is 13.4x faster than dual CPU. The speedup of using 2 P100 is 1.5x compared to using only 1 P100. This is a reasonable speedup. However, with 4 P100 to 16 P100, the speedup is from 2.1x to 3.9x which is not high. The reason is that similar to LAMMPS, this application also involves lots of communications among all used GPUs. Based on the analysis in LAMMPS, using configuration B should reduce this communication bottleneck significantly. To verify this, we ran the same application again on a configuration B server. With 4 P100, the performance metric “hours for 10e6 steps” was reduced to 10.2 compared to 11.73 in configuration G, resulting in 13% performance improvement and the speedup compared to 1 P100 was improved to 2.4x from 2.1x.

     Figure 8: HOOMD-blue Performance on CPU and P100-PCIe


    Amber is the collective name for a suite of programs that allow users to carry out molecular dynamics simulations, particularly on biomolecules. The term Amber is also used to refer to the empirical force fields that are implemented in this suite. Figure 9 shows the performance of Amber on CPU and P100-PCIe. It can be seen that 1 P100 is 6.3x faster than dual CPU. Using 2 P100 GPUs is 1.2x faster than using 1 P100. However, the performance drops significantly when 4 or more GPUs are used. The reason is that similar to LAMMPS and HOOMD-blue, this application heavily relies on P2P access but configuration G only supports that between 2 pair GPUs. We verified this by again testing this application on a configuration B node. As a result, the performance of using 4 P100 was improved to 791 ns/day compared to 315 ns/day in configuration G, resulting in 151% performance improvement and the speedup of 2.5x. But even in configuration B, the multi-GPU scaling is still not good enough. This is because when the Amber multi-GPU support was originally designed the PCI-E bus speed was gen 2 x 16 and the GPUs were C1060 or C2050s. However, the current Pascal generation GPUs are > 16x faster than the C1060s while the PCI-E bus speed has only increased by 2x (PCI Gen2 x 16 to PCI Gen3 x 16) and Infiniband interconnects by about the same amount. Amber website explicitly states that “It should be noted that while the legacy MPI and GPU-Direct methods of multi-GPU communication are still supported, and will be used by the code automatically if peer to peer communication is not available, you are very unlikely to see any speedup by using multiple GPUs for a single job if the GPUs are newer than C2050s. Multi-node runs are almost impossible to get to scale.” This is consistent with our results on multi-node. Because it is obvious to see that in Figure 9, the more nodes are used, the worse the performance is.

      Figure 9: Amber Performance on CPU and P100-PCIe

    ANSYS Mechanical

     ANSYS® Mechanicalsoftware is a comprehensive finite element analysis (FEA) tool for structural analysis, including linear, nonlinear dynamic, hydrodynamic and explicit studies. It provides a complete set of element behavior, material models and equation solvers for a wide range of mechanical design problems. The finite element method is used to solve the partial differential equations which is a compute and memory intensive task. Our testing focused on the Power Supply Module (V17cg-1) benchmark. This is a medium sized job for iterative solvers and a good test for memory bandwidth. Figure 10 shows the performance of ANSYS Mechanical on CPU and P100-PCIe. It is shown that within a node, 4 P100 is 3.8x faster than dual CPUs. And with 4 nodes, 16 P100 is 2.3x faster than 8 CPUs. The figure also shows that the performance scales well with more nodes. The speedup with 4 nodes is 2.8x compared to 1 node.

     Figure 10: ANSYS Mechanical Performance on CPU and P100-PCIe


    RELION (for REgularised Likelihood OptimisationN) is a program that employs an empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM). Figure 11 shows the performance of RELION on CPU and P100-PCIe. Note that y-axis is in logarithmic scale. It demonstrates that 1 P100 is 8.8x faster than dual CPU. From the figure we also notice that it does not scale well starting from 4 P100 GPUs. Because of the long execution time, we did not perform the profiling for this application. But it is possible that the reason of the weak scaling is similar to LAMMPS, HOOMD-blue and Amber.

     Figure 11: RELION Performance on CPU and P100-PCIe

    Conclusions and Future Work

    In this blog, we presented and analyzed the performance of different applications on Dell PowerEdge C4130 servers with P100-PCIe GPUs. In all of the tested applications, HPL, GROMACS and ANSYS Mechanical benefit from the balanced CPU-GPU configuration in configuration G, because they do not require P2P access among GPUs. However, LAMMPS, HOOMD-blue, Amber (and possibly RELION) rely on P2P accesses. Therefore, with configuration G, they scale well up to 2 P100 GPUs, then scale weakly with 4 or more P100 GPUs. But with Configuration B, they scale better than G with 4 GPUs, so configuration B is more suitable and recommended for applications implemented with P2P accesses.

    In the future work, we will run these applications on P100-SXM2 and compare the performance difference between P100-PCIe and P100-SXM2.

  • NAMD Performance Analysis on Skylake Architecture

    Author: Joseph Stanfield

    The purpose of this blog is to provide a comparative performance analysis of the Intel® Xeon® Gold 6150 processor and the previous generation Xeon® E5-2697 v4 processors using the NAMD benchmark. The Xeon® Gold 6150 CPU features 18 physical cores or 36 logical cores when utilizing hyper threading. This processor is based on Intel’s new micro-architecture codenamed “Skylake”. Intel significantly increased the L2 cache per core from 256 KB on Broadwell to 1 MB on Skylake. The 6150 also touts 24.75 MB of L3 cache and a six channel DDR4 memory interface.


    Nanoscale Molecular Dynamics (NAMD) is an application developed using the Charm++ parallel programming model for molecular dynamics simulation. It is popular due to its parallel efficiency, scalability, and the ability to simulate millions of atoms.

    Test Cluster Configurations:


    Dell EMC PowerEdge C6420


    Dell EMC PowerEdge C6320


    2x Xeon® Gold 6150 18c 2.7 GHz

    2x Xeon® E5-2697 v4 16c 2.3 GHz


    12x 16GB @2666 MHz

    8x 16GB @2400 MHz


    1TB SATA

    1 TB SATA


    RHEL 7.3

    RHEL 7.3


    EDR ConnectX-4

    EDR ConnectX-4





    BIOS Settings

    BIOS Options


    System Profile

    Performance Optimized

    Logical Processor


    Virtualization Technology


    The benchmark dataset selected for this series of tests was the Satellite Tobacco Mosaic Virus, or STMV. STMV contains 1,066,628 atoms, which makes it ideal for demonstrating scaling to large clustered environments. The performance is measured in nanoseconds per day (ns/day), which is the number of days required to simulate 1 nanosecond of real-time. A larger value indicates faster performance.


    The first series of benchmark tests conducted were to measure the CPU performance. The test environment consisted of a single node, two nodes, four nodes, and eight nodes with the NAMD STMV dataset run three times for each configuration. The network interconnect between the nodes used was EDR InfiniBand as noted in the table above. Average results from a single node showed 0.70 ns/day. While for a two-node run performance increased by 80% to 1.25 ns/days. The trend of an average of 80% increase in performance for each doubling of node count remained relatively consistent as the environment was scaled to eight nodes, as seen in Figure 1.

    Figure 1.


    The second series of benchmarks were run to compare the Xeon® Gold 6150 against the previous generation Xeon® E5-2697v4. The same dataset, STMV was used for both benchmark environments. As you can see below in Figure 2, the Xeon® Gold CPU results surpass the Xeon E5 V4 by 111% on a single node, and the relative performance advantage decreases to 63% at eight nodes.


    Figure 2.




    In this blog, we analyzed and presented the performance of a Dell EMC PowerEdge C6420 cluster scaling from a single node to eight nodes running NAMD with the STMV dataset. Results show that performance of NAMD scales linearly with the increased number of nodes.

    At the time of publishing this blog, there is an issue with the Intel Parallel Studio v, 2017.x and NAMD compilation. Intel recommends using Parallel Studio 2016.4 or 2018 (which is still in beta) with -xCORE-AVX512 under the FLOATOPS variable for best performance.

    A comparative analysis was also conducted with the previous generation Dell EMC PowerEdge C6320 server and Xeon® E5 v4 (Broadwell) processor. The Xeon® Gold outperformed the E5 V4 by 111% and maintained a linear performance increase as the cluster was scaled and the number of nodes multiplied.


    Intel NAMD Recipe:

    Intel Fabric Tunining and Application Performance:

  • LAMMPS Four Node Comparative Performance Analysis on Skylake Processors

    Author: Joseph Stanfield

    The purpose of this blog is to provide a comparative performance analysis of the Intel® Xeon® Gold 6150 processor (architecture code named “Skylake”) and the previous generation Xeon® E5-2697 v4 processor using the LAMMPS benchmark. The Xeon® Gold 6150 CPU features 18 cores or 36 when utilizing hyper threading. Intel significantly increased the L2 cache per core from 256 KB on previous generations of Xeon to 1 MB. The new processor also touts 24.75 MB of L3 cache and a six channel DDR4 memory interface.

    LAMMPS, or Large Scale Atom/Molecular Massively Parallel Simulator, is an open-source molecular dynamics program originally developed by Sandia National Laboratories, Temple University, and the United States Department of Energy. The main function of LAMMPS is to model particles in a gaseous, liquid, or solid state.


    Test cluster configuration


    Dell EMC PowerEdge C6420


    Dell EMC PowerEdge C6320


    2x Xeon® Gold 6150 18c 2.7 GHz

    2x Xeon® E5-2697 v4 16c 2.3 GHz


    12x 16GB @2666 MT/s

    8x 16GB @2400 MT/s


    1TB SATA

    1 TB SATA


    RHEL 7.3

    RHEL 7.3


    EDR ConnectX-4

    EDR ConnectX-4

    BIOS Settings

    BIOS Options


    System Profile

    Performance Optimized

    Logical Processor


    Virtualization Technology


    The LAMMPS version used for testing release was lammps-6June-17. The in.eam dataset was used for the analysis on both configurations. In.eam is a dataset that simulates a metallic solid, Cu EAM potential with 4.95 Angstrom cutoff (45 neighbors per atom), NVE integration. The simulation was executed using 100 steps with 32,000 atoms. The first series of benchmarks conducted were to measure performance in units of timesteps/s. The test environment consisted of four servers interconnected with InfiniBand EDR, and tests were run on a single node, two nodes, and four nodes with LAMMPS, three times for each configuration. Average results from a single node showed 106 time steps per second while a two node result nearly doubled performance with 216 time steps per second. This trend remained consistent as the environment was scaled to four nodes as seen in Figure 1.


    Figure 1.

    The second series of benchmarks were run to compare the Xeon® Gold against the previous generation, Xeon® E5 v4. The same dataset, in.eam, was used with 32,000 atoms and 100 steps per run. As you can see below in Figure 2, the Xeon® Gold CPU outperforms the Xeon® E5 v4 by about 120% with each test, but the performance increase drops slightly as the cluster is scaled.

    Figure 2.


    In this blog, we analyzed and presented the performance of a Dell EMC PowerEdge C6420 cluster scaling from a single node to four nodes running the LAMMPS benchmark. Results show that performance of LAMMPS scales linearly with the increased number of nodes.


    A comparative analysis was also conducted with the previous generation Dell EMC PowerEdge C6320 server with a Xeon® E5 v4 (Broadwell) processor. As with the first test, as node count was increased the linear scaling of the application was observed on Xeon® E5 v4, results similar to the Xeon® Gold.. But the Xeon® Gold processor outperformed the previous generation CPU by about 120% each run.




  • BIOS characterization for HPC with Intel Skylake processor

    Ashish Kumar Singh. Dell EMC HPC Innovation Lab. Aug 2017

    This blog discusses the impact of the different BIOS tuning options available on Dell EMC 14th generation PowerEdge servers with the Intel Xeon® Processor Scalable Family (architecture codenamed “Skylake”) for some HPC benchmarks and applications. A brief description of the Skylake processor, BIOS options and HPC applications is provided below.  

    Skylake is a new 14nm “tock” processor in the Intel “tick-tock” series, which has the same process technology as the previous generation but with a new microarchitecture. Skylake requires a new CPU socket that is available with the Dell EMC 14th Generation PowerEdge servers. Skylake processors are available in two different configurations, with an integrated Omni-Path fabric and without fabric. The Omni-Path fabric supports network bandwidth up to 100Gb/s. The Skylake processor supports up to 28 cores, six DDR4 memory channels with speed up to 2666MT/s, and additional vectorization power with the AVX512 instruction set. Intel also introduces a new cache coherent interconnect named “Ultra Path Interconnect” (UPI), replacing Intel® QPI, that connects multiple CPU sockets.

    Skylake offers a new, more powerful AVX512 vectorization technology that provides 512-bit vectors. The Skylake CPUs include models that support two 512-bit Fuse-Multiply-Add (FMA) units to deliver 32 Double Precision (DP) FLOPS/cycle and models with a single 512-bit FMA unit that is capable of 16 DP FLOPS/cycle. More details on AVX512 are described in the Intel programming reference. With 32 FLOPS/cycle, Skylake doubles the compute capability of the previous generation, Intel Xeon E5-2600 v4 processors (“Broadwell”).

    Skylake processors are supported in the Dell EMC PowerEdge 14th Generation servers. The new processor architecture allows different tuning knobs, which are exposed in the server BIOS menu. In addition to existing options for performance and power management, the new servers also introduce a clustering mode called Sub NUMA clustering (SNC). On CPU models that support SNC, enabling SNC is akin to splitting the single socket into two NUMA domains, each with half the physical cores and half the memory of the socket. If this sounds familiar, it is similar in utility to the Cluster-on-Die option that was available in E5-2600 v3 and v4 processors as described here. SNC is implemented differently from COD, and these changes improve remote socket access in Skylake when compared to the previous generation. At the Operating System level, a dual socket server with SNC enabled will display four NUMA domains. Two of the domains will be closer to each other (on the same socket), and the other two will be a larger distance away, across the UPI to the remote socket. This can be seen using OS tools like numactl –H.

    In this study, we have used the Performance and PerformancePerWattDAPC system profiles based on our earlier experiences with other system profiles for HPC workloads. The Performance Profile aims to optimize for pure performance. The DAPC profile aims to balance performance with energy efficiency concerns. Both of these system profiles are meta options that, in turn, set multiple performance and power management focused BIOS options like Turbo mode, Cstates, C1E, Pstate management, Uncore frequency, etc.

    We have used two HPC benchmarks and two HPC applications to understand the behavior of SNC and System Profile BIOS options with Dell EMC PowerEdge 14th generation servers. This study was performed with a single server only; cluster level performance deltas will be bounded by these single server results. The server configuration used for this study is described below.    

    Testbed configuration:

    Table 1: Test configuration of new 14G server

    Components                                          Details

    Server                                                     PowerEdge C6420 

    Processor                                               2 x Intel Xeon Gold 6150 – 2.7GHz, 18c, 165W

    Memory                                                  192GB (12 x 16GB) DDR4 @2666MT/s

    Hard drive                                              1 x 1TB SATA HDD, 7.2k rpm

    Operating System                                   Red Hat Enterprise Linux-7.3 (kernel - 3.10.0-514.el7.x86_64)

    MPI                                                         Intel® MPI 2017 update4

    MKL                                                        Intel® MKL 2017.0.3

    Compiler                                                 Intel® compiler 17.0.4

    Table 2: HPC benchmarks and applications

    Application                                Version                                               Benchmark

    HPL                                             From Intel® MKL                                Problem size - 92% of total memory

    STREAM                                      v5.04                                                  Triad

    WRF                                            3.8.1                                                  conus2.5km

    ANSYS Fluent                              v17.2                                                  truck_poly_14m, Ice_2m



    Sub-NUMA cluster

    As described above, a system with SNC enabled will expose four NUMA nodes to the OS on a two socket PowerEdge server. Each NUMA node can communicate with three remote NUMA nodes, two in another socket and one within same socket. NUMA domains on different sockets communicate over the UPI interconnect. With the Intel® Xeon Gold 6150 18 cores processor, each NUMA node will have nine cores. Since both sockets are equally populated in terms of memory, each NUMA domain will have one fourth of the total system memory.    


                                                 Figure 1: Memory bandwidth with SNC enabled

    Figure 1 plots the memory bandwidth with SNC enabled. Except SNC and logical processors, all other options are set to BIOS defaults. Full system memory bandwidth is ~195 GB/s on the two socket server. This test uses all available 36 cores for memory access and calculates aggregate memory bandwidth. The “Local socket – 18 threads” data point measures the memory bandwidth of single socket with 18 threads. As per the graph, local socket memory bandwidth is ~101 GB/s, which is about half of the full system bandwidth. By enabling SNC, a single socket is divided into two NUMA nodes. The memory bandwidth of a single SNC enabled NUMA node is noted by “Local NUMA node – 9 threads”. In this test, the nine local cores access their local memory attached to their NUMA domain. The memory bandwidth here is ~50 GB/s, which is half of the total local socket bandwidth.

    The data point “Remote to same socket” measures the memory bandwidth between two NUMA nodes, which are on the same socket with cores on one NUMA domain accessing the memory of the other NUMA domain. As per the graph, the server measures  ~ 50GB/s memory bandwidth for this case; the same as the “local NUMA node – 9 threads” case. That is, with SNC enabled, memory access within the socket is similar in terms of bandwidth even across NUMA domains. This is a big difference from the previous generation where there was a penalty when accessing memory on the same socket with COD enabled. See Figure 1 in the previous blog where a 47% drop in bandwidth was observed and compare that to the 0% performance drop here. The “Remote to other socket” test involves cores on one NUMA domain accessing the memory of a remote NUMA node on the other socket. This bandwidth is 54% lower due to non-local memory access over UPI interconnect.

    These memory bandwidth tests are interesting, but what do they mean? Like in previous generations, SNC is a good option for codes that have high NUMA locality. Reducing the size of the NUMA domain can help some codes run faster due to less snoops and cache coherence checks within the domain. Additionally, the penalty for remote accesses on Skylake is not as bad as it was for Broadwell.



     Figure 2: Comparing Sub-NUMA clustering with DAPC

    Figure 2 shows the effect of SNC on multiple HPC workloads; note that all of these have good memory locality. All options except SNC and Hyper Threading are set to BIOS default. SNC disabled is considered as the baseline for each workload. As per Figure 2, all tests measure no more than 2% higher performance with SNC enabled. Although this is well within the run-to-run variation for these applications, SNC enabled consistently shows marginally higher performance for STREAM, WRF and Fluent for these datasets. The performance delta will vary for larger and different datasets. For many HPC clusters, this level of tuning for a few percentage points might not be worth it, especially if applications with sub-optimal memory locality will be penalized.


    The Dell EMC default setting for this option is “disabled”, i.e. two sockets show up as just two NUMA domains. The HPC recommendation is to leave this at disabled to accommodate multiple types of codes, including those with inefficient memory locality, and to test this on a case-by-case basis for the applications running on your cluster.


    System Profiles

    Figure 3 plots the impact of different system profiles on the tests in this study. For these studies, all BIOS options are default except system profiles and logical processors. The DAPC profile with SNC disabled is used as the baseline. Most of these workloads show similar performance on both Performance and DAPC system profile. Only HPL performance is higher by a few percent. As per our earlier studies, DAPC profile always consumes less power than performance profile, which makes it suitable for HPC workloads without compromising too much on performance.  


     Figure 3: Comparing System Profiles

    Power Consumption

    Figure 4 shows the power consumption of different system profiles with SNC enabled and disabled. The HPL benchmark is suited to put stress on the system and utilize the maximum compute power of the system. We have measured idle power and peak power consumption with logical processor set to disabled.


                                                 Figure 4: Idle and peak power consumption

    As per Figure 4, DAPC Profile with SNC disabled shows the lowest idle power consumption relative to other profiles. Both Performance and DAPC system profiles consume up to ~5% lower power in idle status with SNC disabled. In idle state, Performance Profile consumes ~28% more power than DAPC.

    The peak power consumption is similar with SNC enabled and with SNC disabled. Peak power consumption in DAPC Profile is ~16% less than in Performance Profile. 


    Performance system profile is still the best profile to achieve maximum performance for HPC workloads. However, DAPC consumes less power than performance increase with performance profile, which makes DAPC the best suitable system profile.


  • Deep Learning Inference on P40 vs P4 with Skylake

    Authors: Rengan Xu, Frank Han and Nishanth Dandapanthula. Dell EMC HPC Innovation Lab. July. 2017

    This blog evaluates the performance, scalability and efficiency of deep learning inference on P40 and P4 GPUs on Dell EMC’s PowerEdge R740 server. The purpose is to compare P40 versus P4 in terms of performance and efficiency. It also measures the accuracy differences between high precision and reduced precision floating point in deep learning inference.

    Introduction to R740 Server

    The PowerEdgeTM R740 is Dell EMC’s latest generation 2-socket, 2U rack server designed to run complex workloads using highly scalable memory, I/O, and network options. The system features the Intel Xeon Processor Scalable Family (architecture codenamed Skylake-SP), up to 24 DIMMs, PCI Express (PCIe) 3.0 enabled expansion slots, and a choice of network interface technologies to cover NIC and rNDC. The PowerEdge R740 is a general-purpose platform capable of handling demanding workloads and applications, such as data warehouses, ecommerce, databases, and high performance computing (HPC). It supports up to 3 Tesla P40 GPUs or 4 Tesla P4 GPUs.

    Introduction to P40 and P4 GPUs

    NVIDIA® launched Tesla® P40 and P4 GPUs for the inference phase of deep learning. Both GPU models are powered by NVIDIA PascalTM architecture and designed for deep learning deployment, but they have different purposes. P40 is designed to deliver maximum throughput, while P4’s is aimed to provide better energy efficiency. Aside from high floating point throughput and efficiency, both GPU models introduce two new optimized instructions designed specifically for inference computations. The two new instructions are 8-bit integer (INT8) 4-element vector dot product (DP4A) and 16-bit 2-element vector dot product (DP2A) instructions. Although many HPC applications require high precision computation with FP32 (32-bit floating point) or FP64 (64-bit floating point), deep learning researchers have found using FP16 (16-bit floating point) is able to achieve the same inference accuracy as FP32 and many applications only require INT8 (8-bit integer) or lower precision to keep an acceptable inference accuracy. Tesla P4 delivers a peak of 21.8 INT8 TIOP/s (Tera Integer Operations per Second), while P40 delivers a peak of 47.0 INT8 TIOP/s. Other differences between these two GPU models are shown in Table 1. This blog uses both types of GPUs in the benchmarking.

    Table 1: Comparison between Tesla P40 and P4


    Tesla P40

    Tesla P4

    CUDA Cores



    Core Clock

    1531 MHz

    1063 MHz

    Memory Bandwidth

    346 GB/s

    192 GB/s

    Memory Size

    24 GB GDDR5

    8 GB GDDR5

    FP32 Compute

    12.0 TFLOPS

    5.5 TFLOPS

    INT8 Compute

    47 TIOPS

    22 TIOPS




    Introduction to NVIDIA TensorRT

    NVIDIA TensorRTTM, previously called GIE (GPU Inference Engine), is a high performance deep learning inference engine for production deployment of deep learning applications that maximizes inference throughput and efficiency. TensorRT provides users the ability to take advantage of fast reduced precision instructions provided in the Pascal GPUs. TensorRT v2 supports the new INT8 operations that are available on both P40 and P4 GPUs, and to the best of our knowledge it is the only library that supports INT8 to date.

    Testing Methodology

    This blog quantifies the performance of deep learning inference using NVIDIA TensorRT on one PowerEdge R740 server which supports up to 3 Tesla P40 GPUs or 4 Tesla P4 GPUs. Table 2 shows the hardware and software details. The inference benchmark we used was giexec in TensorRT sample codes. The synthetic images, which were filled with random non-zero numbers to simulate real images, were used in this sample code. Two classic neural networks were tested: AlexNet (2012 ImageNet winner) and GoogLeNet (2014 ImageNet winner) which is much deeper and more complicated than AlexNet.

    We measured the inference performance in images/sec which means the number of images that can be processed per second.

    Table 2: Hardware configuration and software details


    PowerEdge R740


    2 x Intel Xeon Gold 6150


    192GB DDR4 @ 2667MHz


    400GB SSD

    Shared storage

    9TB NFS through IPoIB on EDR Infiniband


    3x Tesla P40 with 24GB GPU memory, or

    4x Tesla P4 with 8 GB GPU memory

    Software and Firmware

    Operating System

    RHEL 7.2


    0.58 (beta version)

    CUDA and driver version

    8.0.44 (375.20)

    NVIDIA TensorRT Version

    2.0 EA and 2.1 GA

    Performance Evaluation


    In this section, we will present the inference performance with NVIDIA TensorRT on GoogLeNet and AlexNet. We also implemented the benchmark with MPI so that it can be run on multiple GPUs within a server. Figure 1 and Figure 2 show the inference performance with AlexNet and GoogLeNet on up to three P40s and four P4s in one R740 server. In these two figures, batch size 128 was used. The power consumption of each configuration was also measured and the energy efficiency of the configurations is plotted as a “performance per watt” metric. The power consumption was measured by subtracting the power when the system was idle from the power when running the inference. Both the images/sec and images/sec/watt metrics numbers are relative to the numbers on one P40. Figure 3 shows the performance with different batch sizes with 1 GPU, and both metrics numbers are relative to the numbers on P40 with batch size 1. In all figures, INT8 operations were used. The following conclusions can be observed:

    • Performance: with the same number of GPUs, the inference performance on P4 is around half of that on P40. This is consistent with the theoretical INT8 performance on both types of GPUs: 22 TIOPS on P4 vs 47 TIOPS on P40 on single GPU. Also since inference with larger batch sizes gives higher overall throughput but consumes more memory, and P4 has only 8GB memory compared to P40 24GB memory, P4 could not complete the inference with batch size 2048 or larger.
    • Scalability: the performance scales linearly on both P40s and P4s when multiple GPUs are used, because of no communication happens between the GPUs used in the test.
    • Efficiency (performance/watt): the performance/watt on P4 is ~1.5x than that on P40. This is also consistent with the theoretical efficiency difference. Because the theoretical performance of P4 is 1/2 of P40 and its TDP is around 1/3 of P40 (75W vs 250W), therefore its performance/watt is ~1.5x than P40.

    Figure 1: The inference performance with AlexNet on P40 and P4

    Figure 2: The performance of inference with GoogLeNet on P40 and P4

    Figure 3: P40 vs P4 for AlexNet with different batch sizes

    In our previous blog, we compared the inference performance using both FP32 and INT8 and the conclusion is that INT8 is ~3x faster than FP32. In this study, we also compare the accuracy when using both operations to verify that using INT8 can get comparable performance to FP32. We used the latest TensorRT 2.1 GA version to do this benchmarking. To make INT8 data encode the same information as FP32 data, a calibration method is applied in TensorRT to convert FP32 to INT8 in a way that minimizes the loss of information. More details of this calibration method can be found in the presentation “8-bit Inference with TensorRT” from GTC 2017. We used ILSVRC2012 validation dataset for both calibration and benchmarking. The validation dataset has 50,000 images and was divided into batches where each batch has 25 images. The first 50 batches were used for calibration purpose and the rest of the images were used for accuracy measurement. Several pre-trained neural network models were used in our experiments, including ResNet-50, ResNet-101, ResNet-152, VGG-16, VGG-19, GoogLeNet and AlexNet. Both top-1 and top-5 accuracies were recorded using FP32 and INT8 and the accuracy difference between FP32 and INT8 was calculated. The result is shown in Table 3. From this table, we can see the accuracy difference between FP32 and INT8 is between 0.02% - 0.18% which means very minimum accuracy loss is achieved, while 3x speed up can be achieved.

    Table 3: The accuracy comparison between FP32 and INT8





























































    In this blog, we compared the inference performance on both P40 and P4 GPUs in the latest Dell EMC PowerEdge R740 server and concluded that P40 has ~2x higher inference performance compared to P4. But P4 is more power efficient and the performance/watt is ~1.5x than P40. Also with NVIDIA TensorRT library, INT8 can achieve comparable accuracy compared to FP32 while outperforming it with 3x in terms of performance.

  • Dell EMC HPC Systems - SKY is the limit

    Munira Hussain, HPC Innovation Lab, July 2017

    This is an announcement about the Dell EMC HPC refresh that introduces support for 14th Generation servers based on the new Intel® Xeon® Processor Scalable Family (micro-architecture also known as “Skylake”). This includes the addition of PowerEdge R740, R740xd, R640, R940 and C6420 servers to the portfolio. The portfolio consists of fully tested, validated, and integrated solution offerings. These provide high speed interconnects, storage, an option for both hardware and cluster level system management and monitoring software.


    On a high level, the new generation Dell EMC Skylake servers for HPC provide greater computation power, which includes support for up to 28 cores and memory speed up to 2667 MT/s; the architecture extends AVX instructions to AVX512. The AVX512 instructions can execute up to 32 DP FLOP per cycle, which is twice the capability of the previous 13th generation servers that used Intel Xeon E5-2600 v4 processors (“Broadwell”). Additionally, the number of core counts per socket is 20% higher per system when compared to the previous generation, which consisted of a maximum 22 cores. It consists of six memory channels per socket; therefore, a minimum of 12 DIMMs are needed for a dual socket server to provide up to full memory bandwidth. The chipset also has 48 PCI-E lanes per socket, up from 40 lanes in the previous generation.


    The table below notes the enhancements in the latest PowerEdge servers over the previous generations:


    High Level Comparison of the Dell EMC Server Generations for HPC Offering:



    The HPC release supporting Dell EMC 14G servers is based on the Red Hat Enterprise Linux 7.3 operating system. It is based on the 3.10.0-514.el7.x86_64 kernel. The release also supports the new version of Bright Cluster Manager 8.0. Bright Cluster Manager (BCM) is integrated with Dell EMC supported tools, drivers, and third-party software components for the ease of deployment, configuration, and management of the cluster. It includes Dell EMC System Management tools based on OpenManage 9.0.1 and Dell EMC Deployment ToolKit 6.0.1 that help manage, monitor, and administer Dell EMC hardware. Additionally, updated third party drivers and development tools from Mellanox OFED for InfiniBand, Intel IFS for Omni-Path, NVIDIA CUDA for latest Accelerators, and other packages for Machine Learning are also included. Details of the components are as below:

    • Based on Red Hat Enterprise Linux 7.3 (Kernel 3.10.0-514.el7.x86_64)
    • Dell EMC System Management tools from Open Manage 9.0.1 and DTK 6.0.1 for 14G and Open Manage 8.5 and DTK 5.5 for up to 13G Dell EMC servers
    • Updated Dell EMC supported drivers for network and storage deployed during install

      • megaraid_sas = 7.700.50
      • igb =
      • ixgbe = 4.6.3
      • i40e = 1.6.44
      • tg3 = 1.137q
      • bnx2 = 2.2.5r
      • bnx2x = 1.714.2
    • Mellanox OFED 3.4 and 4.0 for InfiniBand
    • Intel IFS 10.3.1 drivers for Omni-Path
    • CUDA 8.0 drivers for NVidia accelerators
    • Intel XPPSL 1.5.1 for Intel Xeon Phi processors
    • Additional Machine Learning packages such as TensorFlow, Caffe, Cudnn, Digits and required dependencies are also supported and available for download


    Below are some images of the Bright Cluster Manager 8.0 BrightView:

    Figure1: This shows the overview of the Cluster. It displays the total capacity, usage, and job status.


    Figure 2: Displays the cascading view of Cluster configuration and respective settings within a group. The settings can be modified and applied from the console.


    Figure 3: Dell EMC Settings Tab shows the parsed info on hardware configuration and the required BIOS level settings.


    Dell EMC HPC Systems based on the 14th Generation servers expand HPC computation capacity and demands. They are fully balanced and architected solutions that are validated and verified for the customers, and the configurations are scalable. Please stay tuned as follow-on blogs will cover performance and application study; these will be posted here:


  • Dell EMC HPC System for Research - Keeping it fresh

    Dell EMC has announced an update to the PowerEdge C6320p modular server, introducing support for the Intel® Xeon Phi x200 processor with Intel Omni-Path™ fabric integration (KNL-F).  This update is a processor-only change, which means that changes to the PowerEdge C6320p motherboard were not required.  New purchases of the PowerEdge C6320p server can be configured with KNL or KNL-F processors.  For customers utilizing Omni-Path as a fabric, the KNL-F processor will improve cost and power efficiencies, as it eliminates the need to purchase and power discrete Omni-Path adapters.  Figure 1, below, illustrates the conceptual design differences between the KNL and KNL-F solutions.

    Late last year, we introduced the Dell EMC PowerEdge C6230p Server, which delivers a high performance processor node based on the Intel Xeon Phi processor (KNL).  This exciting server delivers a compute node optimized for HPC workloads, supporting highly parallelized processes with up to 72 out-of-order cores in a compact half-width 1U package.  High-speed fabric options include InfiniBand or Omni-Path, ideal for data intensive computational applications, such as life sciences, and weather simulations.   

    Figure 1: Functional design view of KNL and KNL-F Omni-Path support.

    As seen in the figure, the integrated fabric option eliminates the dependency on dual x16 PCIe lanes on the motherboard and allows support for a denser configuration, with two QSFP connectors on a single carrier circuit board.   For continued support of both processors, the PowerEdge C6230p server will retain the PCIe signals to the PCIe slots.  Inserting the KNL-F processor will disable these signals, and expose a connector supporting two QSFP ports carried on an optional adapter using the same PCIe x16 slot for power.

    Additional improvements to the PowerEdge C6320p server include support for 64GB LRDIMMs, bumping memory capacity to 384GB, and support for the LSI 2008 RAID controller via the PCIe x4 mezzanine slot.

    Current HPC solution offers from Dell EMC

    Dell EMC offers several HPC solutions optimized for customer usage and priorities.  Domain-specific HPC compute solutions from Dell EMC include the following scalable options:

    • HPC System for Life Sciences – A customizable and scalable system optimized for the needs of researchers in the biological sciences.
    • HPC System for Manufacturing – A customizable and scalable system designed and configured specifically for engineering and manufacturing solutions including design simulation, fluid dynamics, or structural analysis.
    • HPC System for Research – A highly configurable and scalable platform for supporting a broad set of HPC-related workloads and research users.

    For HPC storage needs, Dell EMC offers two high performance, scalable, and robust options:

    • Dell EMC HPC Lustre Storage - This enterprise solution handles big data and high-performance computing demands with a balanced configuration — designed for parallel input/output — and no single point of failure.
    • Dell EMC HPC NFS Storage Solution – Provides high data throughput, flexible, reliable, and hassle-free storage.


    The Dell EMC HPC System for Research, an ideal HPC platform for IT administrators serving diverse and expanding user demands, now supports KNL-F, with its improved cost and power efficiencies, eliminating the need to purchase and power discrete Omni-Path adapters. 

    Dell EMC is the industry leader in HPC computing, and we are committed to delivering increased capabilities and performance in partnership with Intel and other technology leaders in the HPC community.   To learn more about Dell EMC HPC solutions and services, visit us online.

  • Virtualized HPC Performance with VMware vSphere 6.5 on a Dell PowerEdge C6320 Cluster

    This article presents performance comparisons of several typical MPI applications — LAMMPS, WRF, OpenFOAM, and STAR-CCM+ — running on a traditional, bare-metal HPC cluster versus a virtualized cluster running VMware’s vSphere virtualization platform. The tests were performed on a 32-node, EDR-connected Dell PowerEdge C6320 cluster, located in the Dell EMC HPC Innovation Lab in Austin, Texas. In addition to performance results, virtual cluster architecture and configuration recommendations for optimal performance are described.

    Why HPC virtualization

    Interest in HPC virtualization and cloud have grown rapidly. While much of the interest stems from gaining the general value of cloud technologies, there are specific benefits of virtualizing HPC and supporting it in a cloud environment, such as centralized operation, cluster resource sharing, research environment reproducibility, multi-tenant data security, fault isolation and resiliency, dynamic load balancing, efficient power management, etc. Figure 1 illustrates several HPC virtualization benefits.

    Despite the potential benefits of moving HPC workloads to a private, public, or hybrid cloud, performance concerns have been a barrier to adoption. We focus here on the use of on-premises, private clouds for HPC — environments in which appropriate tuning can be applied to deliver maximum application performance. HPC virtualization performance is primarily determined by two factors; hardware virtualization support and virtual infrastructure capability. With advances in both VMware vSphere as well as x86 microprocessor architecture, throughput applications can generally run at close to full speed in the VMware virtualized environment — with less than 5% performance degradation compared to native, and often just 1 – 2% [1]. MPI applications by nature are more challenging, requiring sustained and intensive communication between nodes, making them sensitive to interconnect performance. With our continued performance optimization efforts, we see decreasing overheads running these challenging HPC workloads [2] and this blog post presents some MPI results as examples.

    Figure 1: Illustration of several HPC virtualization benefits

    Testbed Configuration

    As illustrated in Figure 2, the testbed consists of 32 Dell PowerEdge C6320 compute nodes and one management node. vCenter [3], the vSphere management component, as well as NFS and DNS are running in virtual machines (VMs) on the management node. VMware DirectPath I/O technology [4] (i.e., passthrough mode) is used to allow the guest OS (the operating system running within a VM) to directly access the EDR InfiniBand device, which shortens the message delivery path by bypassing the network virtualization layer to deliver best performance. Native tests were run using CentOS on each host, while virtual tests were run with the VMware ESXi hypervisor running on each host along with a single virtual machine running the same CentOS version.

    Figure 2: Testbed Virtual Cluster Architecture

    Table 1 shows all cluster hardware and software details, and Table 2 shows a summary of BIOS and vSphere settings.

    Table 1: Cluster Hardware and Software Details



    Dell PowerEdge C6320


    Dual 10-core Intel Xeon E5-2660 v3 processors@2.6GHz (Haswell)


    128GB DDR4


    Mellanox ConnectX-4 VPI adapter card; EDR IB (100Gb/s)


    VMware vSphere

    ESXi hypervisor


    vCenter management server


    BIOS, Firmware and OS




    OS Distribution (virtual and native)

    CentOS 7.2



    OFED and MPI



    Open MPI

    (LAMMPS, WRF and OpenFOAM)


    Intel MPI (STAR-CCM+)











    Table 2: BIOS and vSphere Settings

    BIOS Settings

    Hardware-assisted virtualization


    Power profile

    Performance Per Watt (OS)

    Logical processor


    Node interleaving

    Disabled (default)

    vSphere Settings

    ESXi power policy

    Balanced (default)

    DirectPath I/O

    Enabled for EDR InfiniBand

    VM size

    20 virtual CPUs, 100GB memory

    Virtual NUMA topology (vNUMA)

    Auto detected (default)

    Memory reservation

    Fully reserved

    CPU Scheduler affinity

    None (default)


    Figures 3-6 show native versus virtual performance ratios with the settings in Table 2 applied. A value of 1.0 means that virtual performance is identical to native. Applications were benchmarked using a strong scaling methodology — problem sizes remained constant as job sizes were scaled. In the Figure legends, ‘nXnpY’ indicates a test run on X nodes using a total of Y MPI ranks. Benchmark problems were selected to achieve reasonable parallel efficiency at the largest scale tested. All MPI processes were consecutively mapped from node 1 to node 32.

    As can be seen from the results, the majority of tests show degradations under 5%, though there are increasing overheads as we scale. At the highest scale tested (n32np640), performance degradation varies by applications and benchmark problems, with the largest degradation seen with LAMMPS atomic fluid (25%) and the smallest seen with STAR-CCM+ EmpHydroCyclone_30M (6%). Single-node STAR-CCM+ results are anomalous and currently under study. As we continue our performance optimization work, we expect to report better and more scalable results in the future.

    Figure 3: LAMMPS native vs. virtual performance. Higher is better.

    Figure 4: WRF native vs. virtual performance. Higher is better.


    Figure 5: OpenFOAM native vs. virtual performance. Higher is better.

    Figure 6: STAR-CCM+ native vs. virtual performance. Higher is better.

    Best Practices

    The following configurations are suggested to achieve optimal virtual performance for HPC. For more comprehensive vSphere performance guidance, please see [5] and [6].


    • Enable hardware-assisted virtualization features , e.g. Intel VT.
    • Enable logical processors. Though logical processors (hyper-threading) usually does not help HPC performance, enable it but configure the virtual CPUs (vCPUs) of a VM to each use a physical core and leave extra threads/logical cores for ESXi hypervisor helper threads to run.
    • It’s recommended to configure BIOS settings to allow ESXi the most flexibility in using power management features. In order to allow ESXi to control power-saving features, set the power policy to the “OS Controlled” profile.
    • Leave node interleaving disabled to let the ESXi hypervisor detect NUMA and apply NUMA optimizations


    • Configure EDR InfiniBand in DirectPath I/O mode for each VM
    • Properly size VMs:

    MPI workloads are CPU-heavy and can make use of all cores, thus requiring a large VM. However, CPU or memory overcommit would greatly impact performance. In our tests, each VM is configured with 20vCPUs, using all physical cores, and 100 GB fully reserved memory, leaving some free memory to consume ESXi hypervisor memory overhead.

    • ESXi power management policy:

    There are three ESXi power management policies: “High Performance”, “Balanced” (default), “Low Power” and “Custom”. Though “High performance” power management would slightly increase performance of latency-sensitive workloads, in situations in which a system’s load is low enough to allow Turbo to operate, it will prevent the system from going into C/C1E states, leading to lower Turbo boost benefits. The “Balanced” power policy will reduce host power consumption while having little or no impact on performance. It’s recommended to use this default.

    • Virtual NUMA

    Virtual NUMA (vNUMA) exposes NUMA topology to the guest OS, allowing NUMA-aware OSes and applications to make efficient use of the underlying hardware. This is an out-of-the-box feature in vSphere.

    Conclusion and Future Work

    Virtualization holds promise for HPC, offering new capabilities and increased flexibility beyond what is available in traditional, unvirtualized environments. These values are only useful, however, if high performance can be maintained. In this short post, we have shown that performance degradations for a range of common MPI applications can be kept under 10%, with our highest scale testing showing larger slowdowns in some cases. With throughput applications running at very close to native speeds, and with the results shown here, it is clear that virtualization can be a viable and useful approach for a variety of HPC use-cases. As we continue to analyze and address remaining sources of performance overhead, the value of the approach will only continue to expand.

    If you have any technical questions regarding VMware HPC virtualization, please feel free to contact us!


    These results have been produced in collaboration with our Dell Technology colleagues in the Dell EMC HPC Innovation Lab who have given us access to the compute cluster used to produce these results and to continue our analysis of remaining performance overheads.


    1. J. Simons, E. DeMattia, and C. Chaubal, “Virtualizing HPC and Technical Computing with VMware vSphere,” VMware Technical White Paper,
    2. N.Zhang, J.Simons, “Performance of RDMA and HPC Applications in Virtual Machines using FDR InfiniBand on VMware vSphere,” VMware Technical White Paper,
    3. vCenter Server for vSphere Management, VMware Documentation,
    4. DirectPath I/O, VMware Docuementation,
    5. VMware Performance Team, "Performance Best Practices for VMware vSphere 6.0," VMware Technical White Paper,***/digitalmarketing/vmware/en/pdf/techpaper/vmware-perfbest-practices-vsphere6-0-white-paper.pdf.
    6. Bhavesh Davda, "Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs," VMware Technical White Paper,

    Na Zhang is member of the technical staff working on HPC within VMware’s Office of the CTO. Her current focus is on performance and solutions of HPC virtualization. Na has Ph.D. degree in Applied Mathematics from Stony Brook University. Her research primarily focused on design and analysis of parallel algorithms for large- and multi-scale simulations running on supercomputers.