Authors: Frank Han, Rengan Xu, Nishanth Dandapanthula.
HPC Innovation Lab. August 2017
This is one of two articles in our Tesla V100 blog series. In this blog, we present the initial benchmark results of NVIDIA® Tesla® Volta-based V100™ GPUs on 4 different HPC benchmarks, as well as a comparative analysis against previous generation Tesla P100 GPUs. We are releasing another V100 series blog, which discusses our V100 and deep learning applications. If you haven’t read it yet, it is highly recommend to take a look here.
PowerEdge C4130 with V100 GPU support
The NVIDIA® Tesla® V100 accelerator is one of the most advanced accelerators available in the market right now and was launched within one year of the P100 release. In fact, Dell EMC is the first in the industry to integrate Tesla V100 and bring it to market. As was the case with the P100, V100 supports two form factors: V100-PCIe and the mezzanine version V100-SXM2. The Dell EMC PowerEdge C4130 server supports both types of V100 and P100 GPU cards. Table 1 below notes the major enhancements in V100 over P100:
Table 1: The comparison between V100 and P100
GPU Max Clock rate (MHz)
Memory Clock rate (MHz)
Memory Bandwidth (GB/s)
Interconnect Bandwidth Bi-Directional (GB/s)
Deep Learning (TFlops)
Single Precision (TFlops)
Double Precision (TFlops)
V100 not only significantly improves performance and scalability as will be shown below, but also comes with new features. Below are some highlighted features important for HPC Applications:
Second-Generation NVIDIA NVLink™
All four V100-SXM2 GPUs in the C4130 are connected by NVLink™ and each GPU has six links. The bi-directional bandwidth of each link is 50 GB/s, so the bi-directional bandwidth between different GPUs is 300 GB/s. This is useful for applications requiring a lot of peer-to-peer data transfers between GPUs.
New Streaming Multiprocessor (SM)
Single precision and double precision capability of the new SM is 50% more than the previous P100 for both PCIe and SXM2 form factors. The TDP (Thermal Design Power) of both cards are the same, which means V100 is ~1.5 times more energy efficient than the previous P100.
HBM2 Memory: Faster, Higher Efficiency
The 900 GB/sec peak memory bandwidth delivered by V100, is 23% higher than P100. Also the DRAM utilization has been improved from 76% to 95%, which allows for a 1.5x improvement in delivered memory bandwidth.
More in-depth details of all new features of V100 GPU card can be found at this Nvidia website.
Hardware and software specification update
All the performance results in this blog were measured on a PowerEdge Server C4130 using Configuration G (4x PCIe V100) and Configuration K (4x V100-SXM2). Both these configurations have been used previously in P100 testing. Also except for the GPU, the hardware components remain identical to those used in the P100 tests as well: dual Intel Xeon E5-2690 v4 processors, 256GB (16GB*16 2400 MHz) Memory and an NFS file system mounted via IPoIB on InfiniBand EDR were used. Complete specs details are included in our previous blog. Moreover, if you are interested in other C4130 configurations besides G and K, you can find them in our K80 blog.
There are some changes on the software front. In order to unleash the power of the V100, it was necessary to use the latest version of all software components. Table 2 lists the versions used for this set of performance tests. To keep the comparison fair, the P100 tests were reran using the new software stack to normalize for the upgraded software.
Table 2: The changes in software versions
Previous version in P100 blog
1.10.7 & 2.1.2
1.10.1 & 2.0.1
Compiled with sm7.0
Compiled with sm6.0
16, AmberTools17 update 20
16 AmberTools16 update3
p2pBandwidthLatencyTest is a micro-benchmark included in the CUDA SDK. It tests the card to card bandwidth and latency with and without GPUDirect™ Peer-to-Peer enabled. Since the full output matrix is fairly long, the unidirectional P2P result is listed below as an example here to demonstrate the way to verify the NVLINKs speed on both V100 and P100.
In theory, V100 has 6x 25GB/s uni-directional links, giving 150GB/s throughput. The previous P100-SXM2 only has 4x 20GB/s links, delivering 80GB/s. The results of p2pBandwitdhtLatencyTest on both cards are in Table 3. “D/D” represents “device-to-device”, that is the bandwidth available between two devices (GPUs). The achievable bandwidth of GPU0 was calculated by aggregating the second, third and fourth value in the first line, which represent the throughput from GPU0 to GPU1, GPU2 and GPU3 respectively.
Table 3: Unidirectional peer-to-peer bandwidth
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s). Four GPUs cards in the server.
It is clearly seen that V100-SXM2 on C4130 configuration K is significant faster than P100-SXM2, on:
Achievable throughput. V100-SXM2 has 47.88+47.9+47.93= 143.71 GB/s aggregated achievable throughput, which is 95.8% of the theoretical value 150GB/s and significant higher than 73.06GB/s and 91.3% on P100-SXM2. The bandwidth for bidirectional traffic is twice that of unidirectional traffic and is also very close to the theoretically 300 GB/s throughput.
Real world application. Symmetric access is the key for real world applications, on each chipset, P100 has 4 links, out of which three are connected to each of the other three GPUS. The remaining fourth link is connected to one of the other three GPUs. So, there are two links between GPU0 and GPU3, but only 1 link between GPU0 and GPU1 as well as GPU0 and GPU2. This is not symmetrical. The above numbers of p2pBandwidthLatencyTest in blue show this imbalance, as the value between GPU0 to GPU3 reaches 36.39 GB/s, which is double the bandwidth between GPU0 and GPU1 or GPU0 and GPU2. In most real world applications, it is common for the developer to treat all cards equally and not take such architectural differences into account. Therefore it will be likely that the faster pair of GPUs will need to wait for the slowest transfers, which means that 18.31 GB/s is the actual speed between all pairs of GPUs.
On the other hand, V100 has a symmetrical design with 6 links as seen in Figure 1. GPU0 to GPU1, GPU2, or GPU3 all have 2 links between each pair. So 47.88 GB/s is the achievable link bandwidth for each, which is 2.6 times faster than the P100.
Figure1: V100 and P100 Topologies on C4130 configuration K
High Performance Linpack (HPL)
Figure2: HPL Multi-GPU results with V100 and P100 on C4130 configuration G and K
Figure 2 shows the HPL performance on the C4130 platform with 1, 2 and 4 V100-PCIe and V100-SXM2 installed. P100’s performance number is also listed for comparison. It can be observed:
Both P100 and V100 scaling well, performance increases as more GPUs are added.
V100 is ~30% faster than P100 on both PCIe (Config G) and SMX2 (Config K).
A single C4130 server with 4x V100 reaches over 20TFlops on PCIe (Config G).
HPL is a system level benchmark and its performance is limited by other components like CPU, memory and PCIe bandwidth. Configuration G is a balanced design, which has 2 PCIe links between CPU and GPU and this is why it outperforms configuration K with 4x GPUs in the HPL benchmark. We do see some other applications perform better in Configuration K, since SXM2 (Config K) supports NVLink, higher core clock speed and peer-to-peer data transfer, these are described below.
Figure 3: HPCG Performance results with 4x V100 and P100 on C4130 configuration G and K
HPCG, the High Performance Conjugate Gradients benchmark, is another well-known metric for HPC system ranking. Unlike HPL, its performance is strongly influenced by memory bandwidth. Credit to the faster and higher efficient HBM2 memory of V100, the performance improvement observed is 44% over P100 on both Configuration G and K.
Figure 4: AMBER Multi-GPU results with V100 and P100 on C4130 configuration G and K
Figure 4 illustrates AMBER’s results with Satellite Tobacco Mosaic Virus (STMV) dataset. On SXM2 system (Config K), AMBER scales weakly with 2 and 4 GPUs. Even though the scaling is not strong, V100 has noticeable improvement than P100, giving ~78% increase in single card runs, and 1x V100 is actually 23% faster than 4x P100. On the PCIe (Config G) side, 1 and 2 cards perform similar to SXM2, but 4 cards’ results dropped sharply. This is because PCIe (Config G) only supports Peer-to-Peer access between GPU0/1 and GPU2/3 and not among all four GPUs. Since AMBER has redesigned the way data transfers among GPUs to address the PCIe bottleneck, it relies heavily on Peer-to-Peer access for performance with multiple GPU cards. Hence a fast, direct interconnect like NVLink between all GPUs in SXM2 (Config K) is vital for AMBER multiple GPU performance.
Figure 5: AMBER Multi-GPU Aggregate results with V100 and P100 on C4130 configuration G and K
To compensate for a single job’s weak scaling on multiple GPUs, there is another use case promoted by AMBER developers, which is running multiple jobs in the same node concurrently but where each job uses only 1 or 2 GPUs. Figure 5 shows the results of 1-4 individual jobs on one C4130 with V100s and the numbers indicate that those individual jobs have little impact on each other. This is because AMBER is designed to run pretty much entirely on the GPUs and has very low dependency on the CPU. The aggregate throughput of multiple individual jobs scales linearly in this case. Without any card to card communication, the 5% better performance on SXM2 is contributed by its higher clock speed.
Figure 6: LAMMPS 4-GPU results with V100 and P100 on C4130 configuration G and K
Figure 6 shows LAMMPS performance on both configurations G and K. The testing dataset is Lennard-Jones liquid dataset, which contains 512000 atoms, and LAMMPS compiled with the kokkos package. V100 is 71% and 81% faster on Config G and Config K respectively. Comparing V100-SXM2 (Config K) and V100-PCIe (Config G), the former is 5% faster due to NVLINK and higher CUDA core frequency.
Figure 7: V100 Speedups on C4130 configuration G and K
The C4130 server with NVIDIA® Tesla® V100™ GPUs demonstrates exceptional performance for HPC applications that require faster computational speed and highest data throughput. Applications like HPL, HPCG benefit from the additional PCIe links between CPU and GPU that are offered by Dell PowerEdge C4130 configuration G. On the other hand, applications like AMBER and LAMMPS were boosted with C4130 configuration K, owing to P2P access, higher bandwidth of NVLink and higher CUDA core clock speed. Overall, a PowerEdge C4130 with Tesla V100 GPUs performs 1.24x to 1.8x faster than a C4130 with P100 for HPL, HPCG, AMBER and LAMMPS.
HPC Innovation Lab. September 2017
In this blog, we will introduce the NVIDIA Tesla Volta-based V100 GPU and evaluate it with different deep learning frameworks. We will compare the performance of the V100 and P100 GPUs. We will also evaluate two types of V100: V100-PCIe and V100-SXM2. The results indicate that in training V100 is ~40% faster than P100 with FP32 and >100% faster than P100 with FP16, and in inference V100 is 3.7x faster than P100. This is one blog of our Tesla V100 blog series. Another blog of this series is about the general HPC applications performance on V100 and you can read it here.
Introduction to V100 GPU
In the 2017 GPU Technology Conference (GTC), NVIDIA announced the Volta-based V100 GPU. Similar to P100, there are also two types of V100: V100-PCIe and V100-SXM2. V100-PCIe GPUs are inter-connected by PCIe buses and the bi-directional bandwidth is up to 32 GB/s. V100-SXM2 GPUs are inter-connected by NVLink and each GPU has six links and the bi-directional bandwidth of each link is 50 GB/s, so the bi-directional bandwidth between different GPUs is up to 300 GB/s. A new type of core added in V100 is called tensor core which was designed specifically for deep learning. These cores are essentially a collection of ALUs for performing 4x4 matrix operations: specifically a fused multiply add (A*B+C), multiplying two 4x4 FP16 matrices together, and then adding to a FP16/FP32 4x4 matrix to generate a final 4x4 FP16/FP32 matrix. By fusing matrix multiplication and add in one unit, the GPU can achieve high FLOPS for this operation. A single Tensor Core performs the equivalent of 64 FMA operations per clock (for 128 FLOPS total), and with 8 such cores per Streaming Multiprocessor (SM), 1024 FLOPS per clock per SM. By comparison, even with pure FP16 operations, the standard CUDA cores in a SM only generate 256 FLOPS per clock. So in scenarios where these cores can be used, V100 is able to deliver 4x the performance versus P100. The detailed comparison between V100 and P100 is in Table 1.
As in our previous deep learning blog, we still use the three most popular deep learning frameworks: NVIDIA’s fork of Caffe (NV-Caffe), MXNet and TensorFlow. Both NV-Caffe and MXNet have been optimized for V100. TensorFlow still does not have any official release to support V100, but we applied some patches obtained from TensorFlow developers so that it is also optimized for V100 in these tests. For the dataset, we still use ILSVRC 2012 dataset whose training set contains 1281167 training images and 50000 validation images. For the testing neural network, we chose Resnet50 as it is a computationally intensive network. To get best performance, we used CUDA 9-rc compiler and CUDNN library in all of the three frameworks since they are optimized for V100. The testing platform is Dell EMC’s PowerEdge C4130 server. The C4130 server has multiple configurations, and we evaluated both P100-PCIe in configuration G and P100-SXM2 in configuration K. The difference between configuration G and configuration K is shown in Figure 1. There are mainly two differences: one is that configure G has two x16 PCIe link connecting from dual CPUs to the four GPUs, while configure K has only one x16 PCIe bus connecting from one CPU to four GPUs; another difference is that GPUs are connected by PCIe buses in configure G but by NVLink in configure K. The other hardware and software details are shown in Table 2.
Figure 1: Comparison between configure G and configure K
Table 2: The hardware configuration and software details
In this experiment, we trained various deep learning frameworks with one pass on the whole dataset since we were comparing only the training speed, not the training accuracy. Other important input parameters for different deep learning frameworks are listed in Table 3. For NV-Caffe and MXNet, in terms of different batch size, we doubled the batch size for FP16 tests since FP16 consumes half the memory for floating points as FP32. As TensorFlow does not support FP16 yet, we did not evaluate its FP16 performance in this blog. Because of different implementations, NV-Caffe consumes more memory than MXNet and TensorFlow for the same neural network, the batch size in FP32 mode is only half of that in MXNet and TensorFlow. In NV-Caffe, if FP16 is used, then the data type of several parameters need to be changed. We explain these parameters as follows: the solver_data_type controls the data type for master weights; the default_forward_type and default_backward_type controls the data type for training values; the default_forward_math and default_backward_math controls the data type for matrix-multiply accumulator. In this blog we used FP16 for training values, FP32 for matrix-multiply accumulator and FP32 for master weights. We will explore other combinations in our future blogs. In MXNet, we tried different values for the parameter “--data-nthreads” which controls the number of threads for data decoding.
Table 3: Input parameters used in different deep learning frameworks
Figure 1, Figure 2, and Figure 3 show the performance of V100 versus P100 with NV-Caffe, MXNet and TensorFlow, respectively. And Table 4 shows the performance improvement of V100 compared to P100. From these results, we can obtain the following conclusions:
In both PCIe and SXM2 versions, V100 is >40% faster than P100 in FP32 for both NV-Caffe and MXNet. This matches the theoretical speedup. Because FP32 is single precision floating points, and V100 is 1.5x faster than P100 in single precision. With TensorFlow, V100 is more than 30% faster than P100. Its performance improvement is lower than the other two frameworks and we think that is because of different algorithm implementations in these frameworks.
In both PCIe and SXM2 versions, V100 is >2x faster than P100 in FP16. Based on the specification, V100 tensor performance is ~6x than P100 FP16. The reason that the actual speedup does not match the theoretical speedup is that not all data are stored in FP16 and so not all operations are tensor operations (the FMA matrix multiply and add operation).
In V100, the performance of FP16 is close to 2x than that of FP32. This is because FP16 only requires half storage compared to FP32 and therefore we could double the batch size in FP16 to improve the computation speed.
In MXNet, we set the “--data-nthreads” to 16 instead of the default value 4. The default value is often sufficient to decode more than 1K images per second but still not fast enough for V100 GPU. In our testing, we found the default value 4 is enough for P100 but for V100 we need to set it at least 12 to achieve good performance, with a value of 16 being ideal.
Figure 2: Performance of V100 vs P100 with NV-Caffe
Figure 3: Performance of V100 vs P100 with MXNet
Figure 4: Performance of V100 vs P100 with TensorFlow
Table 4: Improvement of V100 compared to P100
Since V100 supports both deep learning training and inference, we also tested the inference performance with V100 using the latest TensorRT 3.0.0. The testing was done in FP16 mode on both V100-SXM2 and P100-PCIe and the result is shown in Figure 5. We used batch size 39 for V100 and 10 for P100. Different batches were chosen to make their inference latencies are close to each other (~7ms in the figure). The result shows that when their latencies are close, the inference throughput of V100 is 3.7x faster compared to P100.
Figure 5: Resnet50 inference performance on V100 vs P100
Conclusions and Future Work
After evaluating the performance of V100 with three popular deep learning frameworks, we conclude that in training V100 is more than 40% faster than P100 in FP32 and more than 100% faster in FP16, and in inference V100 is 3.7x faster than P100. This demonstrates the performance benefits when the V100 tensor cores are used. In the future work, we will evaluate different data type combinations in FP16 and study the accuracy impact with FP16 in deep learning training. We will also evaluate the TensorFlow with FP16 once support is added into the software. Finally, we plan to scale the training to multiple nodes with these frameworks.
Garima Kochhar, Kihoon Yoon, Joshua Weage. HPC Innovation Lab. 25 Sep 2017.
The document below presents performance results on Skylake based Dell EMC 14th generation systems for a variety of HPC benchmarks and applications (Stream, HPL, WRF, BWA, Ansys Fluent, STAR-CCM+ and LS-DYNA). It compares the performance of these new systems to previous generations, going as far back as Westmere (Intel Xeon 5500 series) in Dell EMC's 11th generation servers, showing the potential improvements when moving to this latest instance of server technology.
Author: Somanath Moharana and Ashish Kumar Singh, Dell EMC HPC Innovation Lab, September 2017
This blog presents analysis of the High Performance Conjugate Gradient (HPCG) benchmark on the Intel(R) Xeon(R) Gold 6150 CPU codename “Skylake”. It also compares the performance of Intel(R) Xeon(R) Gold 6150 processors with its previous generation Intel(R) Xeon(R) CPU E5-2697 v4 Codename “Broadwell-EP” processors.
Introduction to HPCG
The High Performance Conjugate Gradients (HPCG) Benchmark is a metric for ranking HPC systems. HPCG can be considered as a complement to the High Performance LINPACK (HPL) benchmark. HPCG is designed to exercise computational and data access patterns that more closely match a different and broad set of applications that have impact on the collective performance of these applications.
The HPCG benchmark is based on a 3D regular 27-point discretization of an elliptic partial differential equation. The 3D domain is scaled to fill a 3D virtual process grid for all of the available MPI ranks. The preconditioned conjugate gradient (CG) algorithm is used to solve the intermediate systems of equations and incorporates a local and symmetric Gauss-Seidel pre-conditioning step that requires a triangular forward solve and a backward solve. The benchmark exhibits irregular accesses to memory and fine-grain recursive computations.
HPCG has four computational blocks: Sparse Matrix-vector multiplication (SPMV), Symmetric Gauss-Seidel (SymGS), vector update phase (WAXPBY) and Dot Product (DDOT), while two communication blocks MPI_Allreduce and Halos Exchange.
Introduction to Intel Skylake processor
Intel Skylake is a microarchitecture redesign using the same 14 nm manufacturing process technology with support for up to 28 cores per socket, serving as a "tock" in Intel's "tick-tock" manufacturing and design model. It supports 6 DDR4 memory channels per socket with 2 DPC (DIMMs per channel), where supported full memory bandwidth is up to 2666 MT/s.
Please visit BIOS characteristics of Skylake processor-blog for a better understanding of Skylake processors and their bios features on Dell EMC platforms.
Table 1: Details of Servers used for HPCG analysis
2 x Intel(R) Xeon(R) Gold 6150 @2.7GHz, 18c
2 x Intel(R) Xeon(R) CPU E5-2697 v4 @2.3GHz, 18c
192GB (12 x 16GB) DDR4
128GB( 8 x 16GB ) DDR4
Intel Omni Path
Intel Omni path
Red Hat Enterprise Linux Server release 7.3
Red Hat Enterprise Linux Server release 7.2
Intel® MKL 2017.0.3
Intel® MKL 2017.0.0
Processor Settings > Logical Processors
Processor Settings > Sub NUMA cluster
HPCG Performance analysis with Intel Skylake
In HPCG we have to set the problem size to get the best results out of it. For a valid run, the problem size should be large enough so that the arrays accessed in the CG iteration loop does not fit in the cache of the device. The problem size should be large enough to occupy the significant fraction of “main memory”, at least 1/4th of the total.
Adjusting local domain dimensions can affect global problem size. For HPCG performance characterization, we have chosen the local domain dimension of 160^3,192^3 and 224^3 with the execution time of t=30 seconds. The local domain dimension defines the global domain dimension by (NR*Nx) x (NR*Ny) x (NR*Nz), where Nx=Ny=Nz=160 or 192 or 224 and NR is the number of MPI processes used for the benchmark.
Figure 1: HPCG Performance on multiple grid sizes with Intel Xeon Gold 6150 processors
As shown in figure 1, we can observe that the local dimension grid size of 192^3 gives the best performance compared to other local dimension grid sizes i.e. 160^3 and 224^3. Here we are getting a performance of 36.14 GFLOP/s for a single node and we can observe a linear increase in performance with the increase in number of nodes. All these tests have been carried out with 4 MPI processes and 9 OpenMP threads per MPI process.
Figure 2: Time consumed by HPCG computational routines Intel Xeon Gold 6150 processors
Time spent by each routine is mentioned in the HPCG output file as shown in the figure 2. As per the above graph, HPCG spends its most of the time in the compute intensive pre-conditioning of SymGS function and matrix vector multiplication of sparse matrix (SPMV). The vector update phase (WAXPBY) consumes very less time in comparison to SymGS and least time by residual calculation (DDOT) out of all four computation routines. As the local grid size is same across all multi-node runs, the time spent by all four compute kernels for each multi-node run are approximately same.
Figure 3: HPCG performance over multiple generation of Intel processors
Figure 3 compares HPCG performance between Intel Broadwell-EP processors and Intel Skylake processors. Dots in the figure shows the performance improvement of Intel Skylake over Broadwell-EP processors. For a single node, we can observe ~65% better performance with Skylake processor than Broadwell-EP processors and ~67% better performance for both two nodes and four nodes.
HPCG with Intel(R) Xeon(R) Gold 6150 processor shows ~65% higher performance over Intel(R) Xeon(R) CPU E5-2697 v4 processors. HPCG scales out well with more number of nodes and shows a linear increase in performance with the increase in number of nodes.
Author: Somanath Moharana, Dell EMC HPC Innovation Lab, August 2017
This blog explores the performance of the four socket Dell EMC PowerEdge R940 server with Intel Skylake processors. The latest Dell EMC 14th generation servers supports the new Intel® Xeon® Processor Scalable Family (processor architecture codenamed “Skylake”), and the increased number of cores and higher memory speed benefit a wide variety of HPC applications.
The PowerEdge R940 is Dell EMC’s latest 4-socket, 3U rack server designed to run complex workloads, which supports up to 6TB of DDR4 memory and up to 122 TB of storage. The system features the Intel® Xeon® Scalable Processor Family, 48 DDR4 DIMMs, up to 13 PCI Express® (PCIe) 3.0 enabled expansion slots and a choice of embedded NIC technologies. It is a general-purpose platform capable of handling demanding workloads and applications, such as data warehouses, ecommerce, databases, and high-performance computing (HPC). With the increase in storage capacity the PowerEdge R940 makes it well-suited for data intensive applications that require greater storage.
This blog also describes the impact of BIOS tuning options on HPL, STREAM and scientific applications ANSYS Fluent and WRF and compares performance of the new PowerEdge R940 to the previous generation PowerEdge R930 platform. It also analyses the performance with Sub NUMA Cluster (SNC) modes (SNC=Enabled and SNC=Disabled). SNC enabled will expose eight NUMA nodes to the OS on a four socket PowerEdge R940. Each NUMA node can communicate with seven other remote NUMA nodes, six in other three sockets and one within same socket. NUMA domains on different sockets communicate over the UPI interconnect. Please visit BIOS characteristics of Skylake processor-blog for more details on BIOS options. Table 1 lists the server configuration and the application details used for this study.
Table 1: Details of Server and HPC Applications used for R940 analysis
4 x Intel Xeon E7-8890 firstname.lastname@example.orgGHz (18 cores) 45MB L3 cache 165W
4 x Intel Xeon E7-8890 email@example.comGHz (24 cores) 60MB L3 cache 165W
4 x Intel Xeon Platinum firstname.lastname@example.orgGHz, 10.4GT/s (Cross-bar connection)
1024 GB = 64 x 16GB DDR4 @1866MHz
1024 GB = 32 x 32GB DDR4 @1866 MHz
384GB = (24 x 16GB) DDR4@2666MT/s
Intel QuickPath Interconnect (QPI) 8GT/s
Intel Ultra Path Interconnect (UPI) 10.4GT/s
Processor Settings > UPI Speed
Maximum Data Rate
Software and Firmware
Red Hat Enterprise Linux Server release 6.6
Red Hat Enterprise Linux Server release 7.3 (3.10.0-514.el7.x86_64)
2017 Update 3
Benchmark and Applications
V2.1 from MKL 11.3
V2.1 from MKL update 3
v5.10, Array Size 1800000000, Iterations 100
v5.4, Array Size 1800000000, Iterations 100
V3.5.1 Input Data Conus12KM, Netcdf-4.3.1
V3.8 Input Data Conus12KM, Netcdf-4.4.0
v3.8.1, Input Data Conus12KM, Conus2.5KM, Netcdf-4.4.2
v15, Input Data: truck_poly_14m
v16, Input Data: truck_poly_14m
v17.2, Input Data: truck_poly_14m, aircraft_wing_14m, ice_2m, combustor_12m, exhaust_system_33m
Note: The software versions were different for the older generation processors and results are compared against what was best configuration at that time. Due to big architectural changes in servers and processors generation over generation, the changes in software versions is not a significant factor.
The High Performance Linpack (HPL) Benchmark is a measure of a system's floating point computing power. It measures how fast a computer solves a dense n by n system of linear equations Ax = b, which is a common task in engineering. HPL performed with block size of NB=384 and problem size of N=217754. Since HPL is an AVX-512-enabled workload, we would calculate HPL theoretical maximum performance as (rated base frequency of processor * number of cores * 32 FLOP/second).
Figure 1: Comparing HPL Performance across BIOS profiles
Figure 1 depicts the performance of the PowerEdge R940 server described in Table 1 with different BIOS options. Here the “Performance SNC = Disabled” gives better performance compared to other bios profiles. With “SNC=Disabled” we can observe 1-2% better performance as compared to “SNC = Enabled” for all the BIOS profiles.
Figure 2: HPL performance with AVX2 and AVX512 instructions sets Figure 3: HPL Performance over multiple generations of processors
Figure 2 compares the performance of HPL ran with AVX2 instructions sets and AVX512 instructions sets on PowerEdge R940 (where AVX=Advanced Vector Instructions). AVX-512 are 512-bit extensions to the 256-bit AVX SIMD instructions for x86 instruction set architecture. By Setting “MKL_ENABLE_INSTRUCTIONS=AVX2/AVX512” environment variable we can set the AVX instructions. Here we can observe by running HPL with AVX512 instruction set gives around 75% improvement in performance compared to AVX2 instruction set.
Figure 3: Compares the results of four socket R930 powered by Haswell-EX processors and Broadwell-EX processors with R940 powered by Skylake processors. For HPL, R940 server performed ~192% better in comparison to R930 server with four Haswell-EX processors and ~99% better with Broadwell-EX processors. The performance improvement we observed in Skylake over Broadwell-EX is due to a 27% increase in the number of cores and 75% increase in performance for AVX 512 vector instructions.
The Stream benchmark is a synthetic benchmark program that measures sustainable memory bandwidth and the corresponding computation rate for simple vector kernels. The STREAM benchmark calculates the memory bandwidth by counting only the bytes that the user program requested to be loaded or stored. This study uses the results reported by the TRIAD function of the stream bandwidth test.
Figure 4: STREAM Performance across BIOS profiles Figure 5: STREAM Performance over multiple generations of processors
As per Figure 4, With “SNC = Enabled” we are getting up to 3% better bandwidth in comparison to “SNC = Disabled” across all bios profiles. Figure 5, shows the comparison of memory bandwidth of PowerEdge R930 server with Haswell-EX, Broadwell-EX processors and PowerEdge R940 server with Skylake processors. Both Haswell-EX and Broadwell-EX support DDR3 and DDR4 memories respectively, while the platform with this configuration supports 1600MT/s of memory frequency for both generation of processors. Due to use of DIMMs of same memory frequency for both generation of processors on PowerEdge R930, both Broadwell-EX and Haswell-EX processors have same memory bandwidth but there is ~51% increase in memory bandwidth with Skylake compared to Broadwell- EX processors. This is due to use of 2666MHz RDIMMS, which gives around ~66% increase in maximum memory bandwidth compared to Broadwell-EX and the second factor is ~50% increase in the number of memory channels per socket which is 6 channels per socket for Skylake Processors and 4 channels per socket in Broadwell-EX processors.
Figure 6: Comparing STREAM Performance with “SNC = Enabled” Figure 7: Comparing STREAM Performance with “SNC = Disabled
Figure 6 and Figure 7 describe the impact of traversing the UPI link to go across sockets on memory bandwidth for the PowerEdge R940 servers. With “SNC = Enabled” the local memory bandwidth and remote to same socket memory bandwidth is nearly same (0-1% variations) but in case of remote to other socket, the memory bandwidth shows ~57% decrease in comparison to local memory bandwidth. With “SNC = Disabled” remote memory bandwidth is 77% lower compared to local memory bandwidth.
The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture facilitating parallel computation and system extensibility. The model serves a wide range of meteorological applications across scales from tens of meters to thousands of kilometers. WRF can generate atmospheric simulations using real data or idealized conditions. We used the CONUS12km and CONUS2.5km benchmark datasets for this study.
CONUS12km is a single domain and small size (48hours, 12km resolution case over the Continental U.S. (CONUS) domain from October 24, 2001) benchmark with 72 seconds of time step. CONUS2.5km is a single domain and large size (Latter 3hours of a 9hours, 2.5km resolution case over the Continental U.S. (CONUS) domain from June 4, 2005) benchmark with 15 seconds of time step.
WRF decomposes the domain into tasks or patches. Each patch can be further decomposed into tiles that are processed separately, but by default there is only one tile for every run. If the single tile is too large to fit into the cache of the CPU and/or core, it slows down computation due to WRF’s memory bandwidth sensitivity. In order to reduce the size of the tile, it is possible to increase the number of tiles by defining “numtile = x” in input file or defining environment variable “WRF_NUM_TILES = x”. For both CONUS 12km and CONUS 2.5km the number of tiles are chosen based on best performance.
Figure 8: WRF Performance across BIOS profiles (Conus12KM) Figure 9: WRF Performance across BIOS profiles (Conus2.5KM)
Figure 8 and Figure 9 demonstrates the comparison of WRF datasets on different BIOS profiles .With Conus 12KM data and CONUS 2.5KM “Performance SNC = Enabled” gives best performance. For both conus12km and conus2.5km, the “SNC = Enabled” performs ~1%-2% better than “SNC = Disabled”. The performance difference across different bios profile is nearly equal for Conus12Km as it uses a smaller dataset size, while in CONUS2.5km which is having larger dataset we can observe 1-2% performance variations across the system profiles as it utilizes larger number of processors more efficiently.
Figure 10: Comparison over multiple generations of processors Figure 11: Comparison over multiple generations of processors
Figure 10 and Figure 11 shows the performance comparison between PowerEdge R940 powered by Skylake processors and PowerEdge R930 powered by Broadwell-EX processors and Haswell-EX processors. From the Figure 11 we can observe that for Conus12KM performance of PowerEdge R940 with Skylake is ~18% better as compared to PowerEdge R930 with Broadwell EX and ~45% better compared to Haswell-EX processors. In case of for Conus2.5 Skylake performs ~29% better than Broadwell-EX and ~38% better than Haswell-EX processors.
ANSYS Fluent is a computational fluid dynamics (CFD) software tool. Fluent includes well-validated physical modeling capabilities to deliver fast and accurate results across the widest range of CFD and multi physics applications.
Figure 12: Ansys Fluent Performance across BIOS profiles
We used five different datasets for our analysis which are truck_poly_14m, combuster_12m, exhaust_system_33m, ice_2m and aircraft_wing_14m. Fluent with ‘Solver Rating’ (Higher is better) as the performance metric. Figure 12 shows that all datasets performed better with “Performance SNC = Enabled” BIOS option than others. For all datasets, the “SNC = Enabled” performs 2% to 4% better than “SNC = Disabled”.
Figure 13: Ansys Fluent (truck_poly_14m) performance over multiple generations of Intel processors
Figure 13, shows the performance comparison of truck poly on PowerEdge R940 with Skylake processors and PowerEdge R930 with Broadwell-EX and Haswell-EX processors. For PowerEdge R940 fluent showed 46% better performance in-comparison to PowerEdge R930 with Broadwell-EX and 87% better performance compared to Haswell-EX processors.
The PowerEdge R940 is a highly efficient 4 socket next generation platform which provides up to 122TB of storage capacity with 6.3TF of computing power options, making it well-suited for data intensive applications, while not sacrificing performance. The Skylake processors gives PowerEdge R940 a performance boost in comparison to its previous generation of server (PowerEdge R930), we can observe more than 45% performance improvement across all the applications.
Considering our above analysis we can observe that if we compare system profiles “Performance” gives better performance with respect other system profiles.
In conclusion, PowerEdge R940 with Skylake processors is good platform for all variety of applications and may fulfill the demands of more compute power for HPC applications.
Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Feb 2017
Introduction to P100-PCIe GPU
This blog describes the performance analysis on NVIDIA® Tesla® P100™ GPUs on a cluster of Dell PowerEdge C4130 servers. There are two types of P100 GPUs: PCIe-based and SXM2-based. In PCIe-based server, GPUs are connected by PCIe buses and one P100 delivers around 4.7 and 9.3 TeraFLOPS of double and single precision performance, respectively. And in P100-SXM2, GPUs are connected by NVLink and one P100 delivers around 5.3 and 10.6 TeraFLOPS of double and single precision performance, respectively. This blog focuses on P100 for PCIe-based servers, i.e. P100-PCIe. We have already analyzed the P100 performance for several deep learning frameworks in this blog. The objective of this blog is to compare the performance of HPL, LAMMPS, NAMD, GROMACS, HOOMD-BLUE, Amber, ANSYS Mechanical and RELION. The hardware configuration of the cluster is the same as in the deep learning blog. Briefly speaking, we used a cluster of four C4130 nodes, each node has dual Intel Xeon E5-2690 v4 CPUs and four NVIDIA P100-PCIe GPUs and all nodes are connected with EDR Infiniband. Table 1 shows the detailed information about the hardware and software used in every compute node.
Table 1: Experiment Platform and Software Details
PowerEdge C4130 (configuration G)
2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)
256GB DDR4 @ 2400MHz
P100-PCIe with 16GB GPU memory
Mellanox ConnectX-4 VPI (EDR 100Gb/s Infiniband)
RHEL 7.2 x86_64
Linux Kernel Version
CUDA version and driver
CUDA 8.0.44 (375.20)
HPL is a multicomputer parallel application to measure how fast computers solve a dense n by n system of linear equations using LU decomposition with partial row pivoting and designed to be run at very large scale. The HPL running on the experimented cluster uses the double precision floating point operations. Figure 1 shows the HPL performance on the tested P100-PCIe cluster. It can be seen that 1 P100 is 3.6x faster than 2 x E5-2690 v4 CPUs. HPL also scales very well with more GPUs within nodes or across nodes. Recall that 4 P100 is within a server and therefore 8, 12 and 16 P100 are in 2, 3 and 4 servers. 16 P100 GPUs has the speedup of 14.9x compared to 1 P100. Note that the overall efficiency is calculated as: HPL Efficiency = rMax / (CPUs rPeak + GPUs rPeak), where rPeak is the highest theoretical FLOPS result, and the number reported by HPL is rMax and is the real performance that can be achieved. HPL cannot be run at the max boost clock. It is typically run at some number in between but the average is close to the base clock than to the max boost clock. That is why the efficiency is not very high. Although we also included CPU rPeak in the efficiency calculation, when running HPL on P100 we set DGEMM_SPLIT=1.0 which means CPU is not really contributing to the DGEMM, but only handling other overhead so it is not actually contributing a lot of FLOPS. Although we observed that CPUs stayed fully utilized they were just handling the overhead and data movement to keep the GPUs fed. What is the most important for P100 is that rMax is really big.
Figure 1: HPL performance on P100-PCIe
NAMD (for NAnoscale Molecular Dynamics) is a molecular dynamics application designed for high-performance simulation of large biomolecular systems. The dataset we used is Satellite Tobacco Mosaic Virus (STMV) which is a small, icosahedral plant virus that worsens the symptoms of infection by Tobacco Mosaic Virus (TMV). This dataset has 1,066,628 atoms and it is the largest dataset on NAMD utilities website. The performance metric in the output log of this application is “days/ns” (the lower the better). But its inverted metric “ns/day” is used in our plot since that is what most molecular dynamics users focus on. The average of all occurrences of this value in the output log was used. Figure 2 shows the performance within 1 node. It can be seen that the performance of using 2 P100 is better than that of using 4 P100. This is probably because of the communications among different CPU threads. This application launches a set of workers threads that handle the computation and communication threads that handle the data communication. As more GPUs are used, more communication threads are used and more synchronization is needed. In addition, based on the profiling result from NVIDIA’s CUDA profiler called nvprof, with 1 P100 the GPU computation takes less than 50% of the whole application time. According to Amdahl’s law, the speedup with more GPUs will be limited by another 50% work that is not parallelized by GPU. Based on this observation, we further ran this application on multiple nodes with two different settings (2 GPUs/node and 4 GPUs/node) and the result is shown in Figure 3. The result shows that no matter how many nodes are used, the performance of 2 GPUs/node is always better than 4 GPUs/node. Within a node, 2 P100 GPUs is 9.5x faster than dual CPUs.
Figure 2: NAMD Performance within 1 P100-PCIe node
Figure 3: NAMD Performance across Nodes
GROMACS (for GROningen MAchine for Chemical Simulations) primarily does simulations for biochemical molecules (bonded interactions). But because of its efficiency in calculating non-bonded interactions (atoms not linked by covalent bonds), the user base is expanding to non-biological systems. Figure 4 shows the performance of GROMACS on CPU, K80 GPUs and P100-PCIe GPUs. Since one K80 has two internal GPUs, from now on when we mention one K80 it always refers to two internal GPUs instead of one of the two internal GPUs. When testing with K80 GPUs, the same P100-PCIe GPUs based servers were used. Therefore, the CPUs and memory were kept the same and the only difference is that P100-PCIe GPUs were replaced to K80 GPUs. In all tests, there were four GPUs per server and all GPUs were utilized. For example, the 3 node data point is with 3 servers and 12 total GPUs. The performance of P100-PCIe is 4.2x – 2.8x faster than CPU from 1 node to 4 nodes, and is 1.5x – 1.1x faster than K80 GPU from 1 node to 4 nodes.
Figure 4: GROMACS Performance on P100-PCIe
LAMMPS (for Large Scale Atomic/Molecular Massively Parallel Simulator) is a classic molecular dynamics code, capable of simulations for solid-state materials (metals, semi-conductors), soft matter (biomolecules, polymers) and coarse-grained or mesoscopic systems. It can be used to model atoms or more generically as a parallel particle simulator at the atomic, meso or continuum scale. The dataset we used was LJ (Lennard-Jones liquid benchmark) which contains 512000 atoms. There are two GPU implementations in LAMMPS: GPU library version and kokkos version. In the experiment, we used kokkos version since it was much faster than the GPU library version.
Figure 5 shows LAMMPS performance on CPU and P100-PCIe GPUs. Using 16 P100 GPUs is 5.8x faster than using 1 P100. The reason that this application did not scale linearly is that the data transfer (CPU->GPU, GPU->CPU and GPU->GPU) time increases as more GPUs are used although the computation part reduces linearly. And the reason that the data transfer time increases is because this application requires the data communication among all GPUs used. However, the configuration G we used only allows Peer-to-Peer (P2P) access for two pairs of GPUs: GPU 1 - GPU 2 and GPU 3 - GPU 4. GPU 1/2 cannot communicate with GPU 3/4 directly. If the communication is needed, the data must go through CPU which slows the communication. The configuration B is able to ease this issue as it allows P2P access among all four GPUs within a node. The comparison between configuration G and configuration B is shown in Figure 6. By running LAMMPS on a configuration B server with 4 P100, the performance metric “timesteps/s” was improved to 510 compared to 505 in configuration G, resulting in 1% improvement. The reason why the improvement is not significant is because the data communication takes only less than 8% of the whole application time when running on configuration G with 4 P100. Figure 7 also compared the performance of P100-PCIe with that of CPU and K80 GPUs for this application. It is shown that within 1 node, 4 P100-PCIe is 6.6x faster than 2 E5-2690 v4 CPUs and 1.4x faster than 4 K80 GPUs.
Figure 5: LAMMPS Performance on P100-PCIe
Figure 6 : Comparison between Configuration G and Configuration B
Figure 7: LAMMPS Performance Comparison
HOOMD-blue (for Highly Optimized Object-oriented Many-particle Dynamics - blue) is a general purpose molecular dynamic simulator. Figure 8 shows the HOOMD-blue performance. Note that the y-axis is in logarithmic scale. It is observed that 1 P100 is 13.4x faster than dual CPU. The speedup of using 2 P100 is 1.5x compared to using only 1 P100. This is a reasonable speedup. However, with 4 P100 to 16 P100, the speedup is from 2.1x to 3.9x which is not high. The reason is that similar to LAMMPS, this application also involves lots of communications among all used GPUs. Based on the analysis in LAMMPS, using configuration B should reduce this communication bottleneck significantly. To verify this, we ran the same application again on a configuration B server. With 4 P100, the performance metric “hours for 10e6 steps” was reduced to 10.2 compared to 11.73 in configuration G, resulting in 13% performance improvement and the speedup compared to 1 P100 was improved to 2.4x from 2.1x.
Figure 8: HOOMD-blue Performance on CPU and P100-PCIe
Amber is the collective name for a suite of programs that allow users to carry out molecular dynamics simulations, particularly on biomolecules. The term Amber is also used to refer to the empirical force fields that are implemented in this suite. Figure 9 shows the performance of Amber on CPU and P100-PCIe. It can be seen that 1 P100 is 6.3x faster than dual CPU. Using 2 P100 GPUs is 1.2x faster than using 1 P100. However, the performance drops significantly when 4 or more GPUs are used. The reason is that similar to LAMMPS and HOOMD-blue, this application heavily relies on P2P access but configuration G only supports that between 2 pair GPUs. We verified this by again testing this application on a configuration B node. As a result, the performance of using 4 P100 was improved to 791 ns/day compared to 315 ns/day in configuration G, resulting in 151% performance improvement and the speedup of 2.5x. But even in configuration B, the multi-GPU scaling is still not good enough. This is because when the Amber multi-GPU support was originally designed the PCI-E bus speed was gen 2 x 16 and the GPUs were C1060 or C2050s. However, the current Pascal generation GPUs are > 16x faster than the C1060s while the PCI-E bus speed has only increased by 2x (PCI Gen2 x 16 to PCI Gen3 x 16) and Infiniband interconnects by about the same amount. Amber website explicitly states that “It should be noted that while the legacy MPI and GPU-Direct methods of multi-GPU communication are still supported, and will be used by the code automatically if peer to peer communication is not available, you are very unlikely to see any speedup by using multiple GPUs for a single job if the GPUs are newer than C2050s. Multi-node runs are almost impossible to get to scale.” This is consistent with our results on multi-node. Because it is obvious to see that in Figure 9, the more nodes are used, the worse the performance is.
Figure 9: Amber Performance on CPU and P100-PCIe
ANSYS® Mechanical™ software is a comprehensive finite element analysis (FEA) tool for structural analysis, including linear, nonlinear dynamic, hydrodynamic and explicit studies. It provides a complete set of element behavior, material models and equation solvers for a wide range of mechanical design problems. The finite element method is used to solve the partial differential equations which is a compute and memory intensive task. Our testing focused on the Power Supply Module (V17cg-1) benchmark. This is a medium sized job for iterative solvers and a good test for memory bandwidth. Figure 10 shows the performance of ANSYS Mechanical on CPU and P100-PCIe. It is shown that within a node, 4 P100 is 3.8x faster than dual CPUs. And with 4 nodes, 16 P100 is 2.3x faster than 8 CPUs. The figure also shows that the performance scales well with more nodes. The speedup with 4 nodes is 2.8x compared to 1 node.
Figure 10: ANSYS Mechanical Performance on CPU and P100-PCIe
RELION (for REgularised Likelihood OptimisationN) is a program that employs an empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM). Figure 11 shows the performance of RELION on CPU and P100-PCIe. Note that y-axis is in logarithmic scale. It demonstrates that 1 P100 is 8.8x faster than dual CPU. From the figure we also notice that it does not scale well starting from 4 P100 GPUs. Because of the long execution time, we did not perform the profiling for this application. But it is possible that the reason of the weak scaling is similar to LAMMPS, HOOMD-blue and Amber.
Figure 11: RELION Performance on CPU and P100-PCIe
In this blog, we presented and analyzed the performance of different applications on Dell PowerEdge C4130 servers with P100-PCIe GPUs. In all of the tested applications, HPL, GROMACS and ANSYS Mechanical benefit from the balanced CPU-GPU configuration in configuration G, because they do not require P2P access among GPUs. However, LAMMPS, HOOMD-blue, Amber (and possibly RELION) rely on P2P accesses. Therefore, with configuration G, they scale well up to 2 P100 GPUs, then scale weakly with 4 or more P100 GPUs. But with Configuration B, they scale better than G with 4 GPUs, so configuration B is more suitable and recommended for applications implemented with P2P accesses.
In the future work, we will run these applications on P100-SXM2 and compare the performance difference between P100-PCIe and P100-SXM2.
Author: Joseph Stanfield
The purpose of this blog is to provide a comparative performance analysis of the Intel® Xeon® Gold 6150 processor and the previous generation Xeon® E5-2697 v4 processors using the NAMD benchmark. The Xeon® Gold 6150 CPU features 18 physical cores or 36 logical cores when utilizing hyper threading. This processor is based on Intel’s new micro-architecture codenamed “Skylake”. Intel significantly increased the L2 cache per core from 256 KB on Broadwell to 1 MB on Skylake. The 6150 also touts 24.75 MB of L3 cache and a six channel DDR4 memory interface.
Nanoscale Molecular Dynamics (NAMD) is an application developed using the Charm++ parallel programming model for molecular dynamics simulation. It is popular due to its parallel efficiency, scalability, and the ability to simulate millions of atoms.
. Test Cluster Configurations:
Dell EMC PowerEdge C6420
Dell EMC PowerEdge C6320
2x Xeon® Gold 6150 18c 2.7 GHz (Skylake)
2x Xeon® E5-2697 v4 16c 2.3 GHz (Broadwell)
12x 16GB @2666 MHz
8x 16GB @2400 MHz
1 TB SATA
The benchmark dataset selected for this series of tests was the Satellite Tobacco Mosaic Virus, or STMV. STMV contains 1,066,628 atoms, which makes it ideal for demonstrating scaling to large clustered environments. The performance is measured in nanoseconds per day (ns/day), which is the number of days required to simulate 1 nanosecond of real-time. A larger value indicates faster performance.
The first series of benchmark tests conducted were to measure the CPU performance. The test environment consisted of a single node, two nodes, four nodes, and eight nodes with the NAMD STMV dataset run three times for each configuration. The network interconnect between the nodes used was EDR InfiniBand as noted in the table above. Average results from a single node showed 0.70 ns/day. While for a two-node run performance increased by 80% to 1.25 ns/days. The trend of an average of 80% increase in performance for each doubling of node count remained relatively consistent as the environment was scaled to eight nodes, as seen in Figure 1.
The second series of benchmarks were run to compare the Xeon® Gold 6150 against the previous generation Xeon® E5-2697v4. The same dataset, STMV was used for both benchmark environments. As you can see below in Figure 2, the Xeon® Gold CPU results surpass the Xeon E5 V4 by 111% on a single node, and the relative performance advantage decreases to 63% at eight nodes.
In this blog, we analyzed and presented the performance of a Dell EMC PowerEdge C6420 cluster scaling from a single node to eight nodes running NAMD with the STMV dataset. Results show that performance of NAMD scales linearly with the increased number of nodes.
At the time of publishing this blog, there is an issue with the Intel Parallel Studio v, 2017.x and NAMD compilation. Intel recommends using Parallel Studio 2016.4 or 2018 (which is still in beta) with -xCORE-AVX512 under the FLOATOPS variable for best performance.
A comparative analysis was also conducted with the previous generation Dell EMC PowerEdge C6320 server and Xeon® E5 v4 (Broadwell) processor. The Xeon® Gold outperformed the E5 V4 by 111% and maintained a linear performance increase as the cluster was scaled and the number of nodes multiplied.
Intel NAMD Recipe: https://software.intel.com/en-us/articles/building-namd-on-intel-xeon-and-intel-xeon-phi-processor
Intel Fabric Tunining and Application Performance: https://www.intel.com/content/www/us/en/high-performance-computing-fabrics/omni-path-architecture-application-performance-mpi.html
Author: Joseph Stanfield
The purpose of this blog is to provide a comparative performance analysis of the Intel® Xeon® Gold 6150 processor (architecture code named “Skylake”) and the previous generation Xeon® E5-2697 v4 processor using the LAMMPS benchmark. The Xeon® Gold 6150 CPU features 18 cores or 36 when utilizing hyper threading. Intel significantly increased the L2 cache per core from 256 KB on previous generations of Xeon to 1 MB. The new processor also touts 24.75 MB of L3 cache and a six channel DDR4 memory interface.
LAMMPS, or Large Scale Atom/Molecular Massively Parallel Simulator, is an open-source molecular dynamics program originally developed by Sandia National Laboratories, Temple University, and the United States Department of Energy. The main function of LAMMPS is to model particles in a gaseous, liquid, or solid state.
Test cluster configuration
12x 16GB @2666 MT/s
8x 16GB @2400 MT/s
The LAMMPS version used for testing release was lammps-6June-17. The in.eam dataset was used for the analysis on both configurations. In.eam is a dataset that simulates a metallic solid, Cu EAM potential with 4.95 Angstrom cutoff (45 neighbors per atom), NVE integration. The simulation was executed using 100 steps with 32,000 atoms. The first series of benchmarks conducted were to measure performance in units of timesteps/s. The test environment consisted of four servers interconnected with InfiniBand EDR, and tests were run on a single node, two nodes, and four nodes with LAMMPS, three times for each configuration. Average results from a single node showed 106 time steps per second while a two node result nearly doubled performance with 216 time steps per second. This trend remained consistent as the environment was scaled to four nodes as seen in Figure 1.
The second series of benchmarks were run to compare the Xeon® Gold against the previous generation, Xeon® E5 v4. The same dataset, in.eam, was used with 32,000 atoms and 100 steps per run. As you can see below in Figure 2, the Xeon® Gold CPU outperforms the Xeon® E5 v4 by about 120% with each test, but the performance increase drops slightly as the cluster is scaled.
In this blog, we analyzed and presented the performance of a Dell EMC PowerEdge C6420 cluster scaling from a single node to four nodes running the LAMMPS benchmark. Results show that performance of LAMMPS scales linearly with the increased number of nodes.
A comparative analysis was also conducted with the previous generation Dell EMC PowerEdge C6320 server with a Xeon® E5 v4 (Broadwell) processor. As with the first test, as node count was increased the linear scaling of the application was observed on Xeon® E5 v4, results similar to the Xeon® Gold.. But the Xeon® Gold processor outperformed the previous generation CPU by about 120% each run.
Ashish Kumar Singh. Dell EMC HPC Innovation Lab. Aug 2017
This blog discusses the impact of the different BIOS tuning options available on Dell EMC 14th generation PowerEdge servers with the Intel Xeon® Processor Scalable Family (architecture codenamed “Skylake”) for some HPC benchmarks and applications. A brief description of the Skylake processor, BIOS options and HPC applications is provided below.
Skylake is a new 14nm “tock” processor in the Intel “tick-tock” series, which has the same process technology as the previous generation but with a new microarchitecture. Skylake requires a new CPU socket that is available with the Dell EMC 14th Generation PowerEdge servers. Skylake processors are available in two different configurations, with an integrated Omni-Path fabric and without fabric. The Omni-Path fabric supports network bandwidth up to 100Gb/s. The Skylake processor supports up to 28 cores, six DDR4 memory channels with speed up to 2666MT/s, and additional vectorization power with the AVX512 instruction set. Intel also introduces a new cache coherent interconnect named “Ultra Path Interconnect” (UPI), replacing Intel® QPI, that connects multiple CPU sockets.
Skylake offers a new, more powerful AVX512 vectorization technology that provides 512-bit vectors. The Skylake CPUs include models that support two 512-bit Fuse-Multiply-Add (FMA) units to deliver 32 Double Precision (DP) FLOPS/cycle and models with a single 512-bit FMA unit that is capable of 16 DP FLOPS/cycle. More details on AVX512 are described in the Intel programming reference. With 32 FLOPS/cycle, Skylake doubles the compute capability of the previous generation, Intel Xeon E5-2600 v4 processors (“Broadwell”).
Skylake processors are supported in the Dell EMC PowerEdge 14th Generation servers. The new processor architecture allows different tuning knobs, which are exposed in the server BIOS menu. In addition to existing options for performance and power management, the new servers also introduce a clustering mode called Sub NUMA clustering (SNC). On CPU models that support SNC, enabling SNC is akin to splitting the single socket into two NUMA domains, each with half the physical cores and half the memory of the socket. If this sounds familiar, it is similar in utility to the Cluster-on-Die option that was available in E5-2600 v3 and v4 processors as described here. SNC is implemented differently from COD, and these changes improve remote socket access in Skylake when compared to the previous generation. At the Operating System level, a dual socket server with SNC enabled will display four NUMA domains. Two of the domains will be closer to each other (on the same socket), and the other two will be a larger distance away, across the UPI to the remote socket. This can be seen using OS tools like numactl –H.
In this study, we have used the Performance and PerformancePerWattDAPC system profiles based on our earlier experiences with other system profiles for HPC workloads. The Performance Profile aims to optimize for pure performance. The DAPC profile aims to balance performance with energy efficiency concerns. Both of these system profiles are meta options that, in turn, set multiple performance and power management focused BIOS options like Turbo mode, Cstates, C1E, Pstate management, Uncore frequency, etc.
We have used two HPC benchmarks and two HPC applications to understand the behavior of SNC and System Profile BIOS options with Dell EMC PowerEdge 14th generation servers. This study was performed with a single server only; cluster level performance deltas will be bounded by these single server results. The server configuration used for this study is described below.
Table 1: Test configuration of new 14G server
Server PowerEdge C6420
Processor 2 x Intel Xeon Gold 6150 – 2.7GHz, 18c, 165W
Memory 192GB (12 x 16GB) DDR4 @2666MT/s
Hard drive 1 x 1TB SATA HDD, 7.2k rpm
Operating System Red Hat Enterprise Linux-7.3 (kernel - 3.10.0-514.el7.x86_64)
MPI Intel® MPI 2017 update4
MKL Intel® MKL 2017.0.3
Compiler Intel® compiler 17.0.4
Table 2: HPC benchmarks and applications
Application Version Benchmark
HPL From Intel® MKL Problem size - 92% of total memory
STREAM v5.04 Triad
WRF 3.8.1 conus2.5km
ANSYS Fluent v17.2 truck_poly_14m, Ice_2m
As described above, a system with SNC enabled will expose four NUMA nodes to the OS on a two socket PowerEdge server. Each NUMA node can communicate with three remote NUMA nodes, two in another socket and one within same socket. NUMA domains on different sockets communicate over the UPI interconnect. With the Intel® Xeon Gold 6150 18 cores processor, each NUMA node will have nine cores. Since both sockets are equally populated in terms of memory, each NUMA domain will have one fourth of the total system memory.
Figure 1: Memory bandwidth with SNC enabled
Figure 1 plots the memory bandwidth with SNC enabled. Except SNC and logical processors, all other options are set to BIOS defaults. Full system memory bandwidth is ~195 GB/s on the two socket server. This test uses all available 36 cores for memory access and calculates aggregate memory bandwidth. The “Local socket – 18 threads” data point measures the memory bandwidth of single socket with 18 threads. As per the graph, local socket memory bandwidth is ~101 GB/s, which is about half of the full system bandwidth. By enabling SNC, a single socket is divided into two NUMA nodes. The memory bandwidth of a single SNC enabled NUMA node is noted by “Local NUMA node – 9 threads”. In this test, the nine local cores access their local memory attached to their NUMA domain. The memory bandwidth here is ~50 GB/s, which is half of the total local socket bandwidth.
The data point “Remote to same socket” measures the memory bandwidth between two NUMA nodes, which are on the same socket with cores on one NUMA domain accessing the memory of the other NUMA domain. As per the graph, the server measures ~ 50GB/s memory bandwidth for this case; the same as the “local NUMA node – 9 threads” case. That is, with SNC enabled, memory access within the socket is similar in terms of bandwidth even across NUMA domains. This is a big difference from the previous generation where there was a penalty when accessing memory on the same socket with COD enabled. See Figure 1 in the previous blog where a 47% drop in bandwidth was observed and compare that to the 0% performance drop here. The “Remote to other socket” test involves cores on one NUMA domain accessing the memory of a remote NUMA node on the other socket. This bandwidth is 54% lower due to non-local memory access over UPI interconnect.
These memory bandwidth tests are interesting, but what do they mean? Like in previous generations, SNC is a good option for codes that have high NUMA locality. Reducing the size of the NUMA domain can help some codes run faster due to less snoops and cache coherence checks within the domain. Additionally, the penalty for remote accesses on Skylake is not as bad as it was for Broadwell.
Figure 2: Comparing Sub-NUMA clustering with DAPC
Figure 2 shows the effect of SNC on multiple HPC workloads; note that all of these have good memory locality. All options except SNC and Hyper Threading are set to BIOS default. SNC disabled is considered as the baseline for each workload. As per Figure 2, all tests measure no more than 2% higher performance with SNC enabled. Although this is well within the run-to-run variation for these applications, SNC enabled consistently shows marginally higher performance for STREAM, WRF and Fluent for these datasets. The performance delta will vary for larger and different datasets. For many HPC clusters, this level of tuning for a few percentage points might not be worth it, especially if applications with sub-optimal memory locality will be penalized.
The Dell EMC default setting for this option is “disabled”, i.e. two sockets show up as just two NUMA domains. The HPC recommendation is to leave this at disabled to accommodate multiple types of codes, including those with inefficient memory locality, and to test this on a case-by-case basis for the applications running on your cluster.
Figure 3 plots the impact of different system profiles on the tests in this study. For these studies, all BIOS options are default except system profiles and logical processors. The DAPC profile with SNC disabled is used as the baseline. Most of these workloads show similar performance on both Performance and DAPC system profile. Only HPL performance is higher by a few percent. As per our earlier studies, DAPC profile always consumes less power than performance profile, which makes it suitable for HPC workloads without compromising too much on performance.
Figure 3: Comparing System Profiles
Figure 4 shows the power consumption of different system profiles with SNC enabled and disabled. The HPL benchmark is suited to put stress on the system and utilize the maximum compute power of the system. We have measured idle power and peak power consumption with logical processor set to disabled.
Figure 4: Idle and peak power consumption
As per Figure 4, DAPC Profile with SNC disabled shows the lowest idle power consumption relative to other profiles. Both Performance and DAPC system profiles consume up to ~5% lower power in idle status with SNC disabled. In idle state, Performance Profile consumes ~28% more power than DAPC.
The peak power consumption is similar with SNC enabled and with SNC disabled. Peak power consumption in DAPC Profile is ~16% less than in Performance Profile.
Performance system profile is still the best profile to achieve maximum performance for HPC workloads. However, DAPC consumes less power than performance increase with performance profile, which makes DAPC the best suitable system profile.
Authors: Rengan Xu, Frank Han and Nishanth Dandapanthula. Dell EMC HPC Innovation Lab. July. 2017
This blog evaluates the performance, scalability and efficiency of deep learning inference on P40 and P4 GPUs on Dell EMC’s PowerEdge R740 server. The purpose is to compare P40 versus P4 in terms of performance and efficiency. It also measures the accuracy differences between high precision and reduced precision floating point in deep learning inference.
Introduction to R740 Server
The PowerEdgeTM R740 is Dell EMC’s latest generation 2-socket, 2U rack server designed to run complex workloads using highly scalable memory, I/O, and network options. The system features the Intel Xeon Processor Scalable Family (architecture codenamed Skylake-SP), up to 24 DIMMs, PCI Express (PCIe) 3.0 enabled expansion slots, and a choice of network interface technologies to cover NIC and rNDC. The PowerEdge R740 is a general-purpose platform capable of handling demanding workloads and applications, such as data warehouses, ecommerce, databases, and high performance computing (HPC). It supports up to 3 Tesla P40 GPUs or 4 Tesla P4 GPUs.
Introduction to P40 and P4 GPUs
NVIDIA® launched Tesla® P40 and P4 GPUs for the inference phase of deep learning. Both GPU models are powered by NVIDIA PascalTM architecture and designed for deep learning deployment, but they have different purposes. P40 is designed to deliver maximum throughput, while P4’s is aimed to provide better energy efficiency. Aside from high floating point throughput and efficiency, both GPU models introduce two new optimized instructions designed specifically for inference computations. The two new instructions are 8-bit integer (INT8) 4-element vector dot product (DP4A) and 16-bit 2-element vector dot product (DP2A) instructions. Although many HPC applications require high precision computation with FP32 (32-bit floating point) or FP64 (64-bit floating point), deep learning researchers have found using FP16 (16-bit floating point) is able to achieve the same inference accuracy as FP32 and many applications only require INT8 (8-bit integer) or lower precision to keep an acceptable inference accuracy. Tesla P4 delivers a peak of 21.8 INT8 TIOP/s (Tera Integer Operations per Second), while P40 delivers a peak of 47.0 INT8 TIOP/s. Other differences between these two GPU models are shown in Table 1. This blog uses both types of GPUs in the benchmarking.
Table 1: Comparison between Tesla P40 and P4
24 GB GDDR5
8 GB GDDR5
Introduction to NVIDIA TensorRT
NVIDIA TensorRTTM, previously called GIE (GPU Inference Engine), is a high performance deep learning inference engine for production deployment of deep learning applications that maximizes inference throughput and efficiency. TensorRT provides users the ability to take advantage of fast reduced precision instructions provided in the Pascal GPUs. TensorRT v2 supports the new INT8 operations that are available on both P40 and P4 GPUs, and to the best of our knowledge it is the only library that supports INT8 to date.
This blog quantifies the performance of deep learning inference using NVIDIA TensorRT on one PowerEdge R740 server which supports up to 3 Tesla P40 GPUs or 4 Tesla P4 GPUs. Table 2 shows the hardware and software details. The inference benchmark we used was giexec in TensorRT sample codes. The synthetic images, which were filled with random non-zero numbers to simulate real images, were used in this sample code. Two classic neural networks were tested: AlexNet (2012 ImageNet winner) and GoogLeNet (2014 ImageNet winner) which is much deeper and more complicated than AlexNet.
We measured the inference performance in images/sec which means the number of images that can be processed per second.
Table 2: Hardware configuration and software details
2 x Intel Xeon Gold 6150
192GB DDR4 @ 2667MHz
9TB NFS through IPoIB on EDR Infiniband
3x Tesla P40 with 24GB GPU memory, or
4x Tesla P4 with 8 GB GPU memory
0.58 (beta version)
CUDA and driver version
NVIDIA TensorRT Version
2.0 EA and 2.1 GA
In this section, we will present the inference performance with NVIDIA TensorRT on GoogLeNet and AlexNet. We also implemented the benchmark with MPI so that it can be run on multiple GPUs within a server. Figure 1 and Figure 2 show the inference performance with AlexNet and GoogLeNet on up to three P40s and four P4s in one R740 server. In these two figures, batch size 128 was used. The power consumption of each configuration was also measured and the energy efficiency of the configurations is plotted as a “performance per watt” metric. The power consumption was measured by subtracting the power when the system was idle from the power when running the inference. Both the images/sec and images/sec/watt metrics numbers are relative to the numbers on one P40. Figure 3 shows the performance with different batch sizes with 1 GPU, and both metrics numbers are relative to the numbers on P40 with batch size 1. In all figures, INT8 operations were used. The following conclusions can be observed:
Figure 1: The inference performance with AlexNet on P40 and P4
Figure 2: The performance of inference with GoogLeNet on P40 and P4
Figure 3: P40 vs P4 for AlexNet with different batch sizes
In our previous blog, we compared the inference performance using both FP32 and INT8 and the conclusion is that INT8 is ~3x faster than FP32. In this study, we also compare the accuracy when using both operations to verify that using INT8 can get comparable performance to FP32. We used the latest TensorRT 2.1 GA version to do this benchmarking. To make INT8 data encode the same information as FP32 data, a calibration method is applied in TensorRT to convert FP32 to INT8 in a way that minimizes the loss of information. More details of this calibration method can be found in the presentation “8-bit Inference with TensorRT” from GTC 2017. We used ILSVRC2012 validation dataset for both calibration and benchmarking. The validation dataset has 50,000 images and was divided into batches where each batch has 25 images. The first 50 batches were used for calibration purpose and the rest of the images were used for accuracy measurement. Several pre-trained neural network models were used in our experiments, including ResNet-50, ResNet-101, ResNet-152, VGG-16, VGG-19, GoogLeNet and AlexNet. Both top-1 and top-5 accuracies were recorded using FP32 and INT8 and the accuracy difference between FP32 and INT8 was calculated. The result is shown in Table 3. From this table, we can see the accuracy difference between FP32 and INT8 is between 0.02% - 0.18% which means very minimum accuracy loss is achieved, while 3x speed up can be achieved.
Table 3: The accuracy comparison between FP32 and INT8
In this blog, we compared the inference performance on both P40 and P4 GPUs in the latest Dell EMC PowerEdge R740 server and concluded that P40 has ~2x higher inference performance compared to P4. But P4 is more power efficient and the performance/watt is ~1.5x than P40. Also with NVIDIA TensorRT library, INT8 can achieve comparable accuracy compared to FP32 while outperforming it with 3x in terms of performance.