Dell Community
High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • NAMD Performance Analysis on Skylake Architecture

    Author: Joseph Stanfield

    The purpose of this blog is to provide a comparative performance analysis of the Intel® Xeon® Gold 6150 processor and the previous generation Xeon® E5-2697 v4 processors using the NAMD benchmark. The Xeon® Gold 6150 CPU features 18 physical cores or 36 logical cores when utilizing hyper threading. This processor is based on Intel’s new micro-architecture codenamed “Skylake”. Intel significantly increased the L2 cache per core from 256 KB on Broadwell to 1 MB on Skylake. The 6150 also touts 24.75 MB of L3 cache and a six channel DDR4 memory interface.

     

    Nanoscale Molecular Dynamics (NAMD) is an application developed using the Charm++ parallel programming model for molecular dynamics simulation. It is popular due to its parallel efficiency, scalability, and the ability to simulate millions of atoms.

    .
    Test Cluster Configurations:

     

    Dell EMC PowerEdge C6420

     

    Dell EMC PowerEdge C6320

    CPU

    2x Xeon® Gold 6150 18c 2.7 GHz
    (Skylake)

    2x Xeon® E5-2697 v4 16c 2.3 GHz
    (Broadwell)

    RAM

    12x 16GB @2666 MHz

    8x 16GB @2400 MHz

    HDD

    1TB SATA

    1 TB SATA

    OS

    RHEL 7.3

    RHEL 7.3

    InfiniBand

    EDR ConnectX-4

    EDR ConnectX-4

    CHARM++

    6.7.1

    NAMD

    2.12_Source

     
    BIOS Settings

    BIOS Options

    Settings

    System Profile

    Performance Optimized

    Logical Processor

    Disabled

    Virtualization Technology

    Disabled


    The benchmark dataset selected for this series of tests was the Satellite Tobacco Mosaic Virus, or STMV. STMV contains 1,066,628 atoms, which makes it ideal for demonstrating scaling to large clustered environments. The performance is measured in nanoseconds per day (ns/day), which is the number of days required to simulate 1 nanosecond of real-time. A larger value indicates faster performance.

     

    The first series of benchmark tests conducted were to measure the CPU performance. The test environment consisted of a single node, two nodes, four nodes, and eight nodes with the NAMD STMV dataset run three times for each configuration. The network interconnect between the nodes used was EDR InfiniBand as noted in the table above. Average results from a single node showed 0.70 ns/day. While for a two-node run performance increased by 80% to 1.25 ns/days. The trend of an average of 80% increase in performance for each doubling of node count remained relatively consistent as the environment was scaled to eight nodes, as seen in Figure 1.

    Figure 1.

     

    The second series of benchmarks were run to compare the Xeon® Gold 6150 against the previous generation Xeon® E5-2697v4. The same dataset, STMV was used for both benchmark environments. As you can see below in Figure 2, the Xeon® Gold CPU results surpass the Xeon E5 V4 by 111% on a single node, and the relative performance advantage decreases to 63% at eight nodes.

     


    Figure 2.

     

     

    Summary

    In this blog, we analyzed and presented the performance of a Dell EMC PowerEdge C6420 cluster scaling from a single node to eight nodes running NAMD with the STMV dataset. Results show that performance of NAMD scales linearly with the increased number of nodes.

    At the time of publishing this blog, there is an issue with the Intel Parallel Studio v, 2017.x and NAMD compilation. Intel recommends using Parallel Studio 2016.4 or 2018 (which is still in beta) with -xCORE-AVX512 under the FLOATOPS variable for best performance.

    A comparative analysis was also conducted with the previous generation Dell EMC PowerEdge C6320 server and Xeon® E5 v4 (Broadwell) processor. The Xeon® Gold outperformed the E5 V4 by 111% and maintained a linear performance increase as the cluster was scaled and the number of nodes multiplied.


    Resources

    Intel NAMD Recipe: https://software.intel.com/en-us/articles/building-namd-on-intel-xeon-and-intel-xeon-phi-processor

    Intel Fabric Tunining and Application Performance: https://www.intel.com/content/www/us/en/high-performance-computing-fabrics/omni-path-architecture-application-performance-mpi.html

  • LAMMPS Four Node Comparative Performance Analysis on Skylake Processors

    Author: Joseph Stanfield
     

    The purpose of this blog is to provide a comparative performance analysis of the Intel® Xeon® Gold 6150 processor (architecture code named “Skylake”) and the previous generation Xeon® E5-2697 v4 processor using the LAMMPS benchmark. The Xeon® Gold 6150 CPU features 18 cores or 36 when utilizing hyper threading. Intel significantly increased the L2 cache per core from 256 KB on previous generations of Xeon to 1 MB. The new processor also touts 24.75 MB of L3 cache and a six channel DDR4 memory interface.

    LAMMPS, or Large Scale Atom/Molecular Massively Parallel Simulator, is an open-source molecular dynamics program originally developed by Sandia National Laboratories, Temple University, and the United States Department of Energy. The main function of LAMMPS is to model particles in a gaseous, liquid, or solid state.

     

    Test cluster configuration

     

    Dell EMC PowerEdge C6420

     

    Dell EMC PowerEdge C6320

    CPU

    2x Xeon® Gold 6150 18c 2.7 GHz
    (Skylake)

    2x Xeon® E5-2697 v4 16c 2.3 GHz
    (Broadwell)

    RAM

    12x 16GB @2666 MT/s

    8x 16GB @2400 MT/s

    HDD

    1TB SATA

    1 TB SATA

    OS

    RHEL 7.3

    RHEL 7.3

    InfiniBand

    EDR ConnectX-4

    EDR ConnectX-4

    BIOS Settings

    BIOS Options

    Settings

    System Profile

    Performance Optimized

    Logical Processor

    Disabled

    Virtualization Technology

    Disabled


    The LAMMPS version used for testing release was lammps-6June-17. The in.eam dataset was used for the analysis on both configurations. In.eam is a dataset that simulates a metallic solid, Cu EAM potential with 4.95 Angstrom cutoff (45 neighbors per atom), NVE integration. The simulation was executed using 100 steps with 32,000 atoms. The first series of benchmarks conducted were to measure performance in units of timesteps/s. The test environment consisted of four servers interconnected with InfiniBand EDR, and tests were run on a single node, two nodes, and four nodes with LAMMPS, three times for each configuration. Average results from a single node showed 106 time steps per second while a two node result nearly doubled performance with 216 time steps per second. This trend remained consistent as the environment was scaled to four nodes as seen in Figure 1.

     

    Figure 1.

    The second series of benchmarks were run to compare the Xeon® Gold against the previous generation, Xeon® E5 v4. The same dataset, in.eam, was used with 32,000 atoms and 100 steps per run. As you can see below in Figure 2, the Xeon® Gold CPU outperforms the Xeon® E5 v4 by about 120% with each test, but the performance increase drops slightly as the cluster is scaled.


    Figure 2.

    Conclusion

    In this blog, we analyzed and presented the performance of a Dell EMC PowerEdge C6420 cluster scaling from a single node to four nodes running the LAMMPS benchmark. Results show that performance of LAMMPS scales linearly with the increased number of nodes.

     

    A comparative analysis was also conducted with the previous generation Dell EMC PowerEdge C6320 server with a Xeon® E5 v4 (Broadwell) processor. As with the first test, as node count was increased the linear scaling of the application was observed on Xeon® E5 v4, results similar to the Xeon® Gold.. But the Xeon® Gold processor outperformed the previous generation CPU by about 120% each run.


    Resources

     

     

  • BIOS characterization for HPC with Intel Skylake processor

    Ashish Kumar Singh. Dell EMC HPC Innovation Lab. Aug 2017

    This blog discusses the impact of the different BIOS tuning options available on Dell EMC 14th generation PowerEdge servers with the Intel Xeon® Processor Scalable Family (architecture codenamed “Skylake”) for some HPC benchmarks and applications. A brief description of the Skylake processor, BIOS options and HPC applications is provided below.  

    Skylake is a new 14nm “tock” processor in the Intel “tick-tock” series, which has the same process technology as the previous generation but with a new microarchitecture. Skylake requires a new CPU socket that is available with the Dell EMC 14th Generation PowerEdge servers. Skylake processors are available in two different configurations, with an integrated Omni-Path fabric and without fabric. The Omni-Path fabric supports network bandwidth up to 100Gb/s. The Skylake processor supports up to 28 cores, six DDR4 memory channels with speed up to 2666MT/s, and additional vectorization power with the AVX512 instruction set. Intel also introduces a new cache coherent interconnect named “Ultra Path Interconnect” (UPI), replacing Intel® QPI, that connects multiple CPU sockets.

    Skylake offers a new, more powerful AVX512 vectorization technology that provides 512-bit vectors. The Skylake CPUs include models that support two 512-bit Fuse-Multiply-Add (FMA) units to deliver 32 Double Precision (DP) FLOPS/cycle and models with a single 512-bit FMA unit that is capable of 16 DP FLOPS/cycle. More details on AVX512 are described in the Intel programming reference. With 32 FLOPS/cycle, Skylake doubles the compute capability of the previous generation, Intel Xeon E5-2600 v4 processors (“Broadwell”).

    Skylake processors are supported in the Dell EMC PowerEdge 14th Generation servers. The new processor architecture allows different tuning knobs, which are exposed in the server BIOS menu. In addition to existing options for performance and power management, the new servers also introduce a clustering mode called Sub NUMA clustering (SNC). On CPU models that support SNC, enabling SNC is akin to splitting the single socket into two NUMA domains, each with half the physical cores and half the memory of the socket. If this sounds familiar, it is similar in utility to the Cluster-on-Die option that was available in E5-2600 v3 and v4 processors as described here. SNC is implemented differently from COD, and these changes improve remote socket access in Skylake when compared to the previous generation. At the Operating System level, a dual socket server with SNC enabled will display four NUMA domains. Two of the domains will be closer to each other (on the same socket), and the other two will be a larger distance away, across the UPI to the remote socket. This can be seen using OS tools like numactl –H.

    In this study, we have used the Performance and PerformancePerWattDAPC system profiles based on our earlier experiences with other system profiles for HPC workloads. The Performance Profile aims to optimize for pure performance. The DAPC profile aims to balance performance with energy efficiency concerns. Both of these system profiles are meta options that, in turn, set multiple performance and power management focused BIOS options like Turbo mode, Cstates, C1E, Pstate management, Uncore frequency, etc.

    We have used two HPC benchmarks and two HPC applications to understand the behavior of SNC and System Profile BIOS options with Dell EMC PowerEdge 14th generation servers. This study was performed with a single server only; cluster level performance deltas will be bounded by these single server results. The server configuration used for this study is described below.    

    Testbed configuration:

    Table 1: Test configuration of new 14G server

    Components                                          Details

    Server                                                     PowerEdge C6420 

    Processor                                               2 x Intel Xeon Gold 6150 – 2.7GHz, 18c, 165W

    Memory                                                  192GB (12 x 16GB) DDR4 @2666MT/s

    Hard drive                                              1 x 1TB SATA HDD, 7.2k rpm

    Operating System                                   Red Hat Enterprise Linux-7.3 (kernel - 3.10.0-514.el7.x86_64)

    MPI                                                         Intel® MPI 2017 update4

    MKL                                                        Intel® MKL 2017.0.3

    Compiler                                                 Intel® compiler 17.0.4

    Table 2: HPC benchmarks and applications

    Application                                Version                                               Benchmark

    HPL                                             From Intel® MKL                                Problem size - 92% of total memory

    STREAM                                      v5.04                                                  Triad

    WRF                                            3.8.1                                                  conus2.5km

    ANSYS Fluent                              v17.2                                                  truck_poly_14m, Ice_2m

           combustor_12m   

     

    Sub-NUMA cluster

    As described above, a system with SNC enabled will expose four NUMA nodes to the OS on a two socket PowerEdge server. Each NUMA node can communicate with three remote NUMA nodes, two in another socket and one within same socket. NUMA domains on different sockets communicate over the UPI interconnect. With the Intel® Xeon Gold 6150 18 cores processor, each NUMA node will have nine cores. Since both sockets are equally populated in terms of memory, each NUMA domain will have one fourth of the total system memory.    

                    

                                                 Figure 1: Memory bandwidth with SNC enabled

    Figure 1 plots the memory bandwidth with SNC enabled. Except SNC and logical processors, all other options are set to BIOS defaults. Full system memory bandwidth is ~195 GB/s on the two socket server. This test uses all available 36 cores for memory access and calculates aggregate memory bandwidth. The “Local socket – 18 threads” data point measures the memory bandwidth of single socket with 18 threads. As per the graph, local socket memory bandwidth is ~101 GB/s, which is about half of the full system bandwidth. By enabling SNC, a single socket is divided into two NUMA nodes. The memory bandwidth of a single SNC enabled NUMA node is noted by “Local NUMA node – 9 threads”. In this test, the nine local cores access their local memory attached to their NUMA domain. The memory bandwidth here is ~50 GB/s, which is half of the total local socket bandwidth.

    The data point “Remote to same socket” measures the memory bandwidth between two NUMA nodes, which are on the same socket with cores on one NUMA domain accessing the memory of the other NUMA domain. As per the graph, the server measures  ~ 50GB/s memory bandwidth for this case; the same as the “local NUMA node – 9 threads” case. That is, with SNC enabled, memory access within the socket is similar in terms of bandwidth even across NUMA domains. This is a big difference from the previous generation where there was a penalty when accessing memory on the same socket with COD enabled. See Figure 1 in the previous blog where a 47% drop in bandwidth was observed and compare that to the 0% performance drop here. The “Remote to other socket” test involves cores on one NUMA domain accessing the memory of a remote NUMA node on the other socket. This bandwidth is 54% lower due to non-local memory access over UPI interconnect.

    These memory bandwidth tests are interesting, but what do they mean? Like in previous generations, SNC is a good option for codes that have high NUMA locality. Reducing the size of the NUMA domain can help some codes run faster due to less snoops and cache coherence checks within the domain. Additionally, the penalty for remote accesses on Skylake is not as bad as it was for Broadwell.

                                 

     

     Figure 2: Comparing Sub-NUMA clustering with DAPC

    Figure 2 shows the effect of SNC on multiple HPC workloads; note that all of these have good memory locality. All options except SNC and Hyper Threading are set to BIOS default. SNC disabled is considered as the baseline for each workload. As per Figure 2, all tests measure no more than 2% higher performance with SNC enabled. Although this is well within the run-to-run variation for these applications, SNC enabled consistently shows marginally higher performance for STREAM, WRF and Fluent for these datasets. The performance delta will vary for larger and different datasets. For many HPC clusters, this level of tuning for a few percentage points might not be worth it, especially if applications with sub-optimal memory locality will be penalized.

     

    The Dell EMC default setting for this option is “disabled”, i.e. two sockets show up as just two NUMA domains. The HPC recommendation is to leave this at disabled to accommodate multiple types of codes, including those with inefficient memory locality, and to test this on a case-by-case basis for the applications running on your cluster.

     

    System Profiles

    Figure 3 plots the impact of different system profiles on the tests in this study. For these studies, all BIOS options are default except system profiles and logical processors. The DAPC profile with SNC disabled is used as the baseline. Most of these workloads show similar performance on both Performance and DAPC system profile. Only HPL performance is higher by a few percent. As per our earlier studies, DAPC profile always consumes less power than performance profile, which makes it suitable for HPC workloads without compromising too much on performance.  

                                                                               

     Figure 3: Comparing System Profiles

    Power Consumption

    Figure 4 shows the power consumption of different system profiles with SNC enabled and disabled. The HPL benchmark is suited to put stress on the system and utilize the maximum compute power of the system. We have measured idle power and peak power consumption with logical processor set to disabled.

                       

                                                 Figure 4: Idle and peak power consumption

    As per Figure 4, DAPC Profile with SNC disabled shows the lowest idle power consumption relative to other profiles. Both Performance and DAPC system profiles consume up to ~5% lower power in idle status with SNC disabled. In idle state, Performance Profile consumes ~28% more power than DAPC.

    The peak power consumption is similar with SNC enabled and with SNC disabled. Peak power consumption in DAPC Profile is ~16% less than in Performance Profile. 

    Conclusion

    Performance system profile is still the best profile to achieve maximum performance for HPC workloads. However, DAPC consumes less power than performance increase with performance profile, which makes DAPC the best suitable system profile.

    Reference:

    http://en.community.dell.com/techcenter/extras/m/white_papers/20444326

  • Deep Learning Inference on P40 vs P4 with Skylake

    Authors: Rengan Xu, Frank Han and Nishanth Dandapanthula. Dell EMC HPC Innovation Lab. July. 2017

    This blog evaluates the performance, scalability and efficiency of deep learning inference on P40 and P4 GPUs on Dell EMC’s PowerEdge R740 server. The purpose is to compare P40 versus P4 in terms of performance and efficiency. It also measures the accuracy differences between high precision and reduced precision floating point in deep learning inference.

    Introduction to R740 Server

    The PowerEdgeTM R740 is Dell EMC’s latest generation 2-socket, 2U rack server designed to run complex workloads using highly scalable memory, I/O, and network options. The system features the Intel Xeon Processor Scalable Family (architecture codenamed Skylake-SP), up to 24 DIMMs, PCI Express (PCIe) 3.0 enabled expansion slots, and a choice of network interface technologies to cover NIC and rNDC. The PowerEdge R740 is a general-purpose platform capable of handling demanding workloads and applications, such as data warehouses, ecommerce, databases, and high performance computing (HPC). It supports up to 3 Tesla P40 GPUs or 4 Tesla P4 GPUs.

    Introduction to P40 and P4 GPUs

    NVIDIA® launched Tesla® P40 and P4 GPUs for the inference phase of deep learning. Both GPU models are powered by NVIDIA PascalTM architecture and designed for deep learning deployment, but they have different purposes. P40 is designed to deliver maximum throughput, while P4’s is aimed to provide better energy efficiency. Aside from high floating point throughput and efficiency, both GPU models introduce two new optimized instructions designed specifically for inference computations. The two new instructions are 8-bit integer (INT8) 4-element vector dot product (DP4A) and 16-bit 2-element vector dot product (DP2A) instructions. Although many HPC applications require high precision computation with FP32 (32-bit floating point) or FP64 (64-bit floating point), deep learning researchers have found using FP16 (16-bit floating point) is able to achieve the same inference accuracy as FP32 and many applications only require INT8 (8-bit integer) or lower precision to keep an acceptable inference accuracy. Tesla P4 delivers a peak of 21.8 INT8 TIOP/s (Tera Integer Operations per Second), while P40 delivers a peak of 47.0 INT8 TIOP/s. Other differences between these two GPU models are shown in Table 1. This blog uses both types of GPUs in the benchmarking.

    Table 1: Comparison between Tesla P40 and P4

     

    Tesla P40

    Tesla P4

    CUDA Cores

    3840

    2560

    Core Clock

    1531 MHz

    1063 MHz

    Memory Bandwidth

    346 GB/s

    192 GB/s

    Memory Size

    24 GB GDDR5

    8 GB GDDR5

    FP32 Compute

    12.0 TFLOPS

    5.5 TFLOPS

    INT8 Compute

    47 TIOPS

    22 TIOPS

    TDP

    250W

    75W

    Introduction to NVIDIA TensorRT

    NVIDIA TensorRTTM, previously called GIE (GPU Inference Engine), is a high performance deep learning inference engine for production deployment of deep learning applications that maximizes inference throughput and efficiency. TensorRT provides users the ability to take advantage of fast reduced precision instructions provided in the Pascal GPUs. TensorRT v2 supports the new INT8 operations that are available on both P40 and P4 GPUs, and to the best of our knowledge it is the only library that supports INT8 to date.

    Testing Methodology

    This blog quantifies the performance of deep learning inference using NVIDIA TensorRT on one PowerEdge R740 server which supports up to 3 Tesla P40 GPUs or 4 Tesla P4 GPUs. Table 2 shows the hardware and software details. The inference benchmark we used was giexec in TensorRT sample codes. The synthetic images, which were filled with random non-zero numbers to simulate real images, were used in this sample code. Two classic neural networks were tested: AlexNet (2012 ImageNet winner) and GoogLeNet (2014 ImageNet winner) which is much deeper and more complicated than AlexNet.

    We measured the inference performance in images/sec which means the number of images that can be processed per second.

    Table 2: Hardware configuration and software details

    Platform

    PowerEdge R740

    Processor

    2 x Intel Xeon Gold 6150

    Memory

    192GB DDR4 @ 2667MHz

    Disk

    400GB SSD

    Shared storage

    9TB NFS through IPoIB on EDR Infiniband

    GPU

    3x Tesla P40 with 24GB GPU memory, or

    4x Tesla P4 with 8 GB GPU memory

    Software and Firmware

    Operating System

    RHEL 7.2

    BIOS

    0.58 (beta version)

    CUDA and driver version

    8.0.44 (375.20)

    NVIDIA TensorRT Version

    2.0 EA and 2.1 GA

    Performance Evaluation

     

    In this section, we will present the inference performance with NVIDIA TensorRT on GoogLeNet and AlexNet. We also implemented the benchmark with MPI so that it can be run on multiple GPUs within a server. Figure 1 and Figure 2 show the inference performance with AlexNet and GoogLeNet on up to three P40s and four P4s in one R740 server. In these two figures, batch size 128 was used. The power consumption of each configuration was also measured and the energy efficiency of the configurations is plotted as a “performance per watt” metric. The power consumption was measured by subtracting the power when the system was idle from the power when running the inference. Both the images/sec and images/sec/watt metrics numbers are relative to the numbers on one P40. Figure 3 shows the performance with different batch sizes with 1 GPU, and both metrics numbers are relative to the numbers on P40 with batch size 1. In all figures, INT8 operations were used. The following conclusions can be observed:

    • Performance: with the same number of GPUs, the inference performance on P4 is around half of that on P40. This is consistent with the theoretical INT8 performance on both types of GPUs: 22 TIOPS on P4 vs 47 TIOPS on P40 on single GPU. Also since inference with larger batch sizes gives higher overall throughput but consumes more memory, and P4 has only 8GB memory compared to P40 24GB memory, P4 could not complete the inference with batch size 2048 or larger.
    • Scalability: the performance scales linearly on both P40s and P4s when multiple GPUs are used, because of no communication happens between the GPUs used in the test.
    • Efficiency (performance/watt): the performance/watt on P4 is ~1.5x than that on P40. This is also consistent with the theoretical efficiency difference. Because the theoretical performance of P4 is 1/2 of P40 and its TDP is around 1/3 of P40 (75W vs 250W), therefore its performance/watt is ~1.5x than P40.

    Figure 1: The inference performance with AlexNet on P40 and P4

    Figure 2: The performance of inference with GoogLeNet on P40 and P4

    Figure 3: P40 vs P4 for AlexNet with different batch sizes

    In our previous blog, we compared the inference performance using both FP32 and INT8 and the conclusion is that INT8 is ~3x faster than FP32. In this study, we also compare the accuracy when using both operations to verify that using INT8 can get comparable performance to FP32. We used the latest TensorRT 2.1 GA version to do this benchmarking. To make INT8 data encode the same information as FP32 data, a calibration method is applied in TensorRT to convert FP32 to INT8 in a way that minimizes the loss of information. More details of this calibration method can be found in the presentation “8-bit Inference with TensorRT” from GTC 2017. We used ILSVRC2012 validation dataset for both calibration and benchmarking. The validation dataset has 50,000 images and was divided into batches where each batch has 25 images. The first 50 batches were used for calibration purpose and the rest of the images were used for accuracy measurement. Several pre-trained neural network models were used in our experiments, including ResNet-50, ResNet-101, ResNet-152, VGG-16, VGG-19, GoogLeNet and AlexNet. Both top-1 and top-5 accuracies were recorded using FP32 and INT8 and the accuracy difference between FP32 and INT8 was calculated. The result is shown in Table 3. From this table, we can see the accuracy difference between FP32 and INT8 is between 0.02% - 0.18% which means very minimum accuracy loss is achieved, while 3x speed up can be achieved.

    Table 3: The accuracy comparison between FP32 and INT8

    Network

    FP32

    INT8

    Difference

    Top-1

    Top-5

    Top-1

    Top-5

    Top-1

    Top-5

    ResNet-50

    72.90%

    91.14%

    72.84%

    91.08%

    0.07%

    0.06%

    ResNet-101

    74.33%

    91.95%

    74.31%

    91.88%

    0.02%

    0.07%

    ResNet-152

    74.90%

    92.21%

    74.84%

    92.16%

    0.06%

    0.05%

    VGG-16

    68.35%

    88.45%

    68.30%

    88.42%

    0.05%

    0.03%

    VGG-19

    68.47%

    88.46%

    68.38%

    88.42%

    0.09%

    0.03%

    GoogLeNet

    68.95%

    89.12%

    68.77%

    89.00%

    0.18%

    0.12%

    AlexNet

    56.82%

    79.99%

    56.79%

    79.94%

    0.03%

    0.06%


    Conclusions

    In this blog, we compared the inference performance on both P40 and P4 GPUs in the latest Dell EMC PowerEdge R740 server and concluded that P40 has ~2x higher inference performance compared to P4. But P4 is more power efficient and the performance/watt is ~1.5x than P40. Also with NVIDIA TensorRT library, INT8 can achieve comparable accuracy compared to FP32 while outperforming it with 3x in terms of performance.



  • Dell EMC HPC Systems - SKY is the limit

    Munira Hussain, HPC Innovation Lab, July 2017

    This is an announcement about the Dell EMC HPC refresh that introduces support for 14th Generation servers based on the new Intel® Xeon® Processor Scalable Family (micro-architecture also known as “Skylake”). This includes the addition of PowerEdge R740, R740xd, R640, R940 and C6420 servers to the portfolio. The portfolio consists of fully tested, validated, and integrated solution offerings. These provide high speed interconnects, storage, an option for both hardware and cluster level system management and monitoring software.

     

    On a high level, the new generation Dell EMC Skylake servers for HPC provide greater computation power, which includes support for up to 28 cores and memory speed up to 2667 MT/s; the architecture extends AVX instructions to AVX512. The AVX512 instructions can execute up to 32 DP FLOP per cycle, which is twice the capability of the previous 13th generation servers that used Intel Xeon E5-2600 v4 processors (“Broadwell”). Additionally, the number of core counts per socket is 20% higher per system when compared to the previous generation, which consisted of a maximum 22 cores. It consists of six memory channels per socket; therefore, a minimum of 12 DIMMs are needed for a dual socket server to provide up to full memory bandwidth. The chipset also has 48 PCI-E lanes per socket, up from 40 lanes in the previous generation.

     

    The table below notes the enhancements in the latest PowerEdge servers over the previous generations:

     

    High Level Comparison of the Dell EMC Server Generations for HPC Offering:

     

     

    The HPC release supporting Dell EMC 14G servers is based on the Red Hat Enterprise Linux 7.3 operating system. It is based on the 3.10.0-514.el7.x86_64 kernel. The release also supports the new version of Bright Cluster Manager 8.0. Bright Cluster Manager (BCM) is integrated with Dell EMC supported tools, drivers, and third-party software components for the ease of deployment, configuration, and management of the cluster. It includes Dell EMC System Management tools based on OpenManage 9.0.1 and Dell EMC Deployment ToolKit 6.0.1 that help manage, monitor, and administer Dell EMC hardware. Additionally, updated third party drivers and development tools from Mellanox OFED for InfiniBand, Intel IFS for Omni-Path, NVIDIA CUDA for latest Accelerators, and other packages for Machine Learning are also included. Details of the components are as below:

    • Based on Red Hat Enterprise Linux 7.3 (Kernel 3.10.0-514.el7.x86_64)
    • Dell EMC System Management tools from Open Manage 9.0.1 and DTK 6.0.1 for 14G and Open Manage 8.5 and DTK 5.5 for up to 13G Dell EMC servers
    • Updated Dell EMC supported drivers for network and storage deployed during install

      • megaraid_sas = 7.700.50
      • igb = 5.3.5.7
      • ixgbe = 4.6.3
      • i40e = 1.6.44
      • tg3 = 1.137q
      • bnx2 = 2.2.5r
      • bnx2x = 1.714.2
    • Mellanox OFED 3.4 and 4.0 for InfiniBand
    • Intel IFS 10.3.1 drivers for Omni-Path
    • CUDA 8.0 drivers for NVidia accelerators
    • Intel XPPSL 1.5.1 for Intel Xeon Phi processors
    • Additional Machine Learning packages such as TensorFlow, Caffe, Cudnn, Digits and required dependencies are also supported and available for download

     

    Below are some images of the Bright Cluster Manager 8.0 BrightView:

    Figure1: This shows the overview of the Cluster. It displays the total capacity, usage, and job status.

     

    Figure 2: Displays the cascading view of Cluster configuration and respective settings within a group. The settings can be modified and applied from the console.


     

    Figure 3: Dell EMC Settings Tab shows the parsed info on hardware configuration and the required BIOS level settings.

     

    Dell EMC HPC Systems based on the 14th Generation servers expand HPC computation capacity and demands. They are fully balanced and architected solutions that are validated and verified for the customers, and the configurations are scalable. Please stay tuned as follow-on blogs will cover performance and application study; these will be posted here: http://en.community.dell.com/techcenter/high-performance-computing/


     

  • Dell EMC HPC System for Research - Keeping it fresh

    Dell EMC has announced an update to the PowerEdge C6320p modular server, introducing support for the Intel® Xeon Phi x200 processor with Intel Omni-Path™ fabric integration (KNL-F).  This update is a processor-only change, which means that changes to the PowerEdge C6320p motherboard were not required.  New purchases of the PowerEdge C6320p server can be configured with KNL or KNL-F processors.  For customers utilizing Omni-Path as a fabric, the KNL-F processor will improve cost and power efficiencies, as it eliminates the need to purchase and power discrete Omni-Path adapters.  Figure 1, below, illustrates the conceptual design differences between the KNL and KNL-F solutions.

    Late last year, we introduced the Dell EMC PowerEdge C6230p Server, which delivers a high performance processor node based on the Intel Xeon Phi processor (KNL).  This exciting server delivers a compute node optimized for HPC workloads, supporting highly parallelized processes with up to 72 out-of-order cores in a compact half-width 1U package.  High-speed fabric options include InfiniBand or Omni-Path, ideal for data intensive computational applications, such as life sciences, and weather simulations.   

    Figure 1: Functional design view of KNL and KNL-F Omni-Path support.

    As seen in the figure, the integrated fabric option eliminates the dependency on dual x16 PCIe lanes on the motherboard and allows support for a denser configuration, with two QSFP connectors on a single carrier circuit board.   For continued support of both processors, the PowerEdge C6230p server will retain the PCIe signals to the PCIe slots.  Inserting the KNL-F processor will disable these signals, and expose a connector supporting two QSFP ports carried on an optional adapter using the same PCIe x16 slot for power.

    Additional improvements to the PowerEdge C6320p server include support for 64GB LRDIMMs, bumping memory capacity to 384GB, and support for the LSI 2008 RAID controller via the PCIe x4 mezzanine slot.

    Current HPC solution offers from Dell EMC

    Dell EMC offers several HPC solutions optimized for customer usage and priorities.  Domain-specific HPC compute solutions from Dell EMC include the following scalable options:

    • HPC System for Life Sciences – A customizable and scalable system optimized for the needs of researchers in the biological sciences.
    • HPC System for Manufacturing – A customizable and scalable system designed and configured specifically for engineering and manufacturing solutions including design simulation, fluid dynamics, or structural analysis.
    • HPC System for Research – A highly configurable and scalable platform for supporting a broad set of HPC-related workloads and research users.

    For HPC storage needs, Dell EMC offers two high performance, scalable, and robust options:

    • Dell EMC HPC Lustre Storage - This enterprise solution handles big data and high-performance computing demands with a balanced configuration — designed for parallel input/output — and no single point of failure.
    • Dell EMC HPC NFS Storage Solution – Provides high data throughput, flexible, reliable, and hassle-free storage.

    Summary

    The Dell EMC HPC System for Research, an ideal HPC platform for IT administrators serving diverse and expanding user demands, now supports KNL-F, with its improved cost and power efficiencies, eliminating the need to purchase and power discrete Omni-Path adapters. 

    Dell EMC is the industry leader in HPC computing, and we are committed to delivering increased capabilities and performance in partnership with Intel and other technology leaders in the HPC community.   To learn more about Dell EMC HPC solutions and services, visit us online.

    http://www.dell.com/en-us/work/learn/high-performance-computing

    http://en.community.dell.com/techcenter/high-performance-computing/

    www.dellhpc.org/

  • Virtualized HPC Performance with VMware vSphere 6.5 on a Dell PowerEdge C6320 Cluster

    This article presents performance comparisons of several typical MPI applications — LAMMPS, WRF, OpenFOAM, and STAR-CCM+ — running on a traditional, bare-metal HPC cluster versus a virtualized cluster running VMware’s vSphere virtualization platform. The tests were performed on a 32-node, EDR-connected Dell PowerEdge C6320 cluster, located in the Dell EMC HPC Innovation Lab in Austin, Texas. In addition to performance results, virtual cluster architecture and configuration recommendations for optimal performance are described.

    Why HPC virtualization

    Interest in HPC virtualization and cloud have grown rapidly. While much of the interest stems from gaining the general value of cloud technologies, there are specific benefits of virtualizing HPC and supporting it in a cloud environment, such as centralized operation, cluster resource sharing, research environment reproducibility, multi-tenant data security, fault isolation and resiliency, dynamic load balancing, efficient power management, etc. Figure 1 illustrates several HPC virtualization benefits.

    Despite the potential benefits of moving HPC workloads to a private, public, or hybrid cloud, performance concerns have been a barrier to adoption. We focus here on the use of on-premises, private clouds for HPC — environments in which appropriate tuning can be applied to deliver maximum application performance. HPC virtualization performance is primarily determined by two factors; hardware virtualization support and virtual infrastructure capability. With advances in both VMware vSphere as well as x86 microprocessor architecture, throughput applications can generally run at close to full speed in the VMware virtualized environment — with less than 5% performance degradation compared to native, and often just 1 – 2% [1]. MPI applications by nature are more challenging, requiring sustained and intensive communication between nodes, making them sensitive to interconnect performance. With our continued performance optimization efforts, we see decreasing overheads running these challenging HPC workloads [2] and this blog post presents some MPI results as examples.

    Figure 1: Illustration of several HPC virtualization benefits

    Testbed Configuration

    As illustrated in Figure 2, the testbed consists of 32 Dell PowerEdge C6320 compute nodes and one management node. vCenter [3], the vSphere management component, as well as NFS and DNS are running in virtual machines (VMs) on the management node. VMware DirectPath I/O technology [4] (i.e., passthrough mode) is used to allow the guest OS (the operating system running within a VM) to directly access the EDR InfiniBand device, which shortens the message delivery path by bypassing the network virtualization layer to deliver best performance. Native tests were run using CentOS on each host, while virtual tests were run with the VMware ESXi hypervisor running on each host along with a single virtual machine running the same CentOS version.

    Figure 2: Testbed Virtual Cluster Architecture

    Table 1 shows all cluster hardware and software details, and Table 2 shows a summary of BIOS and vSphere settings.

    Table 1: Cluster Hardware and Software Details

    Hardware

    Platform

    Dell PowerEdge C6320

    Processor

    Dual 10-core Intel Xeon E5-2660 v3 processors@2.6GHz (Haswell)

    Memory

    128GB DDR4

    Interconnect

    Mellanox ConnectX-4 VPI adapter card; EDR IB (100Gb/s)

    Software

    VMware vSphere

    ESXi hypervisor

    6.5

    vCenter management server

    6.5

    BIOS, Firmware and OS

    BIOS

    1.0.3

    Firmware

    2.23.23.21

    OS Distribution (virtual and native)

    CentOS 7.2

    Kernel

    3.10.0-327.el7.x86_64

    OFED and MPI

    OFED

    MLNX_OFED_LINUX-3.4-1.0.0.0

    Open MPI

    (LAMMPS, WRF and OpenFOAM)

    1.10.5a1

    Intel MPI (STAR-CCM+)

    5.0.3.048

    Benchmarks

    LAMMPS

    v20Jul16

    WRF

    v3.8.1

    OpenFOAM

    v1612+

    STAR-CCM+

    v11.04.012

     

    Table 2: BIOS and vSphere Settings

    BIOS Settings

    Hardware-assisted virtualization

    Enabled

    Power profile

    Performance Per Watt (OS)

    Logical processor

    Enabled

    Node interleaving

    Disabled (default)

    vSphere Settings

    ESXi power policy

    Balanced (default)

    DirectPath I/O

    Enabled for EDR InfiniBand

    VM size

    20 virtual CPUs, 100GB memory

    Virtual NUMA topology (vNUMA)

    Auto detected (default)

    Memory reservation

    Fully reserved

    CPU Scheduler affinity

    None (default)

    Results

    Figures 3-6 show native versus virtual performance ratios with the settings in Table 2 applied. A value of 1.0 means that virtual performance is identical to native. Applications were benchmarked using a strong scaling methodology — problem sizes remained constant as job sizes were scaled. In the Figure legends, ‘nXnpY’ indicates a test run on X nodes using a total of Y MPI ranks. Benchmark problems were selected to achieve reasonable parallel efficiency at the largest scale tested. All MPI processes were consecutively mapped from node 1 to node 32.

    As can be seen from the results, the majority of tests show degradations under 5%, though there are increasing overheads as we scale. At the highest scale tested (n32np640), performance degradation varies by applications and benchmark problems, with the largest degradation seen with LAMMPS atomic fluid (25%) and the smallest seen with STAR-CCM+ EmpHydroCyclone_30M (6%). Single-node STAR-CCM+ results are anomalous and currently under study. As we continue our performance optimization work, we expect to report better and more scalable results in the future.

    Figure 3: LAMMPS native vs. virtual performance. Higher is better.

    Figure 4: WRF native vs. virtual performance. Higher is better.

     

    Figure 5: OpenFOAM native vs. virtual performance. Higher is better.

    Figure 6: STAR-CCM+ native vs. virtual performance. Higher is better.

    Best Practices

    The following configurations are suggested to achieve optimal virtual performance for HPC. For more comprehensive vSphere performance guidance, please see [5] and [6].

    BIOS:

    • Enable hardware-assisted virtualization features , e.g. Intel VT.
    • Enable logical processors. Though logical processors (hyper-threading) usually does not help HPC performance, enable it but configure the virtual CPUs (vCPUs) of a VM to each use a physical core and leave extra threads/logical cores for ESXi hypervisor helper threads to run.
    • It’s recommended to configure BIOS settings to allow ESXi the most flexibility in using power management features. In order to allow ESXi to control power-saving features, set the power policy to the “OS Controlled” profile.
    • Leave node interleaving disabled to let the ESXi hypervisor detect NUMA and apply NUMA optimizations

    vSphere:

    • Configure EDR InfiniBand in DirectPath I/O mode for each VM
    • Properly size VMs:

    MPI workloads are CPU-heavy and can make use of all cores, thus requiring a large VM. However, CPU or memory overcommit would greatly impact performance. In our tests, each VM is configured with 20vCPUs, using all physical cores, and 100 GB fully reserved memory, leaving some free memory to consume ESXi hypervisor memory overhead.

    • ESXi power management policy:

    There are three ESXi power management policies: “High Performance”, “Balanced” (default), “Low Power” and “Custom”. Though “High performance” power management would slightly increase performance of latency-sensitive workloads, in situations in which a system’s load is low enough to allow Turbo to operate, it will prevent the system from going into C/C1E states, leading to lower Turbo boost benefits. The “Balanced” power policy will reduce host power consumption while having little or no impact on performance. It’s recommended to use this default.

    • Virtual NUMA

    Virtual NUMA (vNUMA) exposes NUMA topology to the guest OS, allowing NUMA-aware OSes and applications to make efficient use of the underlying hardware. This is an out-of-the-box feature in vSphere.

    Conclusion and Future Work

    Virtualization holds promise for HPC, offering new capabilities and increased flexibility beyond what is available in traditional, unvirtualized environments. These values are only useful, however, if high performance can be maintained. In this short post, we have shown that performance degradations for a range of common MPI applications can be kept under 10%, with our highest scale testing showing larger slowdowns in some cases. With throughput applications running at very close to native speeds, and with the results shown here, it is clear that virtualization can be a viable and useful approach for a variety of HPC use-cases. As we continue to analyze and address remaining sources of performance overhead, the value of the approach will only continue to expand.

    If you have any technical questions regarding VMware HPC virtualization, please feel free to contact us!

    Acknowledgements

    These results have been produced in collaboration with our Dell Technology colleagues in the Dell EMC HPC Innovation Lab who have given us access to the compute cluster used to produce these results and to continue our analysis of remaining performance overheads.

    References

    1. J. Simons, E. DeMattia, and C. Chaubal, “Virtualizing HPC and Technical Computing with VMware vSphere,” VMware Technical White Paper, http://www.vmware.com/files/pdf/techpaper/vmware-virtualizing-hpc-technical-computing-with-vsphere.pdf.
    2. N.Zhang, J.Simons, “Performance of RDMA and HPC Applications in Virtual Machines using FDR InfiniBand on VMware vSphere,” VMware Technical White Paper, http://www.vmware.com/files/pdf/techpaper/vmware-fdr-ib-vsphere-hpc.pdf.
    3. vCenter Server for vSphere Management, VMware Documentation, http://www.vmware.com/products/vcenter-server.html
    4. DirectPath I/O, VMware Docuementation, http://tpub-review.eng.vmware.com:8080/vsphere-65/index.jsp#com.vmware.vsphere.networking.doc/GUID-BF2770C3-39ED-4BC5-A8EF-77D55EFE924C.html
    5. VMware Performance Team, "Performance Best Practices for VMware vSphere 6.0," VMware Technical White Paper, https://www.vmware.com/content/***/digitalmarketing/vmware/en/pdf/techpaper/vmware-perfbest-practices-vsphere6-0-white-paper.pdf.
    6. Bhavesh Davda, "Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs," VMware Technical White Paper, http://www.vmware.com/techpapers/2011/best-practices-for-performance-tuning-of-latency-s-10220.html.

    Na Zhang is member of the technical staff working on HPC within VMware’s Office of the CTO. Her current focus is on performance and solutions of HPC virtualization. Na has Ph.D. degree in Applied Mathematics from Stony Brook University. Her research primarily focused on design and analysis of parallel algorithms for large- and multi-scale simulations running on supercomputers.

  • Deep Learning Inference on P40 GPUs

    Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Mar. 2017

    Introduction to P40 GPU and TensorRT

    Deep Learning (DL) has two major phases: training and inference/testing/scoring. The training phase builds a deep neural network (DNN) model with the existing large amount of data. And the inference phase uses the trained model to make prediction from new data. The inference can be done in the data center, embedded system, auto and mobile devices, etc. Usually inference must respond to user request as quickly as possible (often in real time). To meet the low-latency requirement of inference, NVIDIA® launched Tesla® P4 and P40 GPUs. Aside from high floating point throughput and efficiency, both GPUs introduce two new optimized instructions designed specifically for inference computations. The two new instructions are 8-bit integer (INT8) 4-element vector dot product (DP4A) and 16-bit 2-element vector dot product (DP2A) instructions. Deep learning researchers have found using FP16 is able to achieve the same inference accuracy as FP32 and many applications only require INT8 or lower precision to keep an acceptable inference accuracy. Tesla P4 delivers a peak of 21.8 INT8 TIOP/s (Tera Integer Operations per Second), while P40 delivers a peak of 47.0 INT8 TIOP/s. This blog only focuses on P40 GPU.

    TensorRTTM, previously called GIE (GPU Inference Engine), is a high performance deep learning inference engine for production deployment of deep learning applications that maximizes inference throughput and efficiency. TensorRT provides users the ability to take advantage of fast reduced precision instructions provided in the Pascal GPUs. TensorRT v2 supports the INT8 reduced precision operations that are available on the P40.

    Testing Methodology

    This blog quantifies the performance of deep learning inference using TensorRT on Dell’s PowerEdge C4130 server which is equipped with 4 Tesla P40 GPUs. Since TensorRT is only available for Ubuntu OS, all the experiments were done on Ubuntu. Table 1 shows the hardware and software details. The inference benchmark we used was giexec in TensorRT sample codes. The synthetic images which were filled with random non-zero numbers to simulate real images were used in this sample code. Two classic neural networks were tested: AlexNet (2012 ImageNet winner) and GoogLeNet (2014 ImageNet winner) which is much deeper and complicated than AlexNet.

    We measured the inference performance in images/sec which means the number of images that can be processed per second. To measure the performance improvement of the current generation GPU P40, we also compared its performance with the previous generation GPU M40. The most important goal of this testing is to measure the inference performance in INT8 mode, compared to FP32 mode. P40 uses the new Pascal architecture and supports the new INT8 instructions. The previous generation GPU M40 uses Maxwell architecture and does not support INT8 instructions. The theoretical performance of INT8, FP32 in both M40 and P40 is shown in Table 2. We measured the performance FP32 on both devices and both FP32 and INT8 on the P40.

    Table 1: Hardware configuration and software details

    Platform

    PowerEdge C4130 (configuration G)

    Processor

    2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)

    Memory

    256GB DDR4 @ 2400MHz

    Disk

    400GB SSD

    GPU

    4x Tesla P40 with 24GB GPU memory

    Software and Firmware

    Operating System

    Ubuntu 14.04

    BIOS

    2.3.3

    CUDA and driver version

    8.0.44 (375.20)

    TensorRT Version

    2.0 EA


    Table 2: Comparison between Tesla M40 and P40

     

    Tesla M40

    Tesla P40

    INT8 (TIOP/s)

    N/A

    47.0

    FP32 (TFLOP/s)

    6.8

    11.8


    Performance Evaluation

    In this section, we will present the inference performance with TensorRT on GoogLeNet and AlexNet. We also implemented the benchmark with MPI so that it can be run on multiple P40 GPUs within a node. We will also compare the performance of P40 with M40. Lastly we will show the performance impact when using different batch sizes.

    Figure 1 shows the inference performance with TensorRT library for both GoogLeNet and AlexNet. We can see that INT8 mode is ~3x faster than FP32 in both neural networks. This is expected since the theoretical speedup of INT8 is 4x compared to FP32 if only multiplications are performed and no other overhead is incurred. However, there are kernel launches, occupancy limits, data movement and math other than multiplications, so the speedup is reduced to about 3x faster.


    Figure 1: Inference performance with TensorRT library

    Dell’s PowerEdge C4130 supports up to 4 GPUs in a server. To make use of all GPUs, we implemented the inference benchmark using MPI so that each MPI process runs on each GPU. Figure 2 and Figure 3 show the multi-GPU inference performance on GoogLeNet and AlexNet, respectively. When using multiple GPUs, linear speedup were achieved for both neural networks. This is because each GPU processes its own images and there is no communications and synchronizations among used GPUs.


    Figure 2: Multi-GPU inference performance with TensorRT GoogLeNet


    Figure 3: Multi-GPU inference performance with TensorRT AlexNet

    To highlight the performance advantage of P40 GPU and its native support for INT8, we compared the inference performance between P40 with the previous generation GPU M40. The result is shown in Figure 5 and Figure 6 for GoogLeNet and AlexNet, respectively. In FP32 mode, P40 is 1.7x faster than M40. And the INT8 mode in P40 is 4.4x faster than FP32 mode in M40.


    Figure 4: Inference performance comparison between P40 and M40


    Figure 5: Inference performance comparison between P40 and M40

    Deep learning inference can be applied in different scenarios. Some scenarios require large batch size and some scenarios even requires no batching at all (i.e. batch size is 1). Therefore we also measured the performance difference when using different batch sizes and the result is shown in Figure 6. Note that the purpose here is not comparing the performance of GoogLeNet and AlexNet, instead the purpose is to check how the performance changes with different batch sizes for each neural network. It can be seen that without batch processing the inference performance is very low. This is because the GPU is not assigned enough workloads to keep it busy. The larger the batch size is, the higher the inference performance is, although the rate of the speed increasing becomes slower. When batch size is 4096, GoogLeNet stopped running because the required GPU memory for this neural network exceeds the GPU memory limit. But AlexNet was able to run because it is a less complicated neural network than GoogLeNet and therefore it requires less GPU memory. So the largest batch size is only limited by GPU memory.


    Figure 6: Inference performance with different batch sizes

    Conclusions and Future Work

    In this blog, we presented the inference performance in deep learning with NVIDIA® TensorRT library on P40 and M40 GPUs. As a result, the INT8 support in P40 is about 3x faster than FP32 mode in P40 and 4.4x faster than FP32 mode in the previous generation GPU M40. Multiple GPUs can increase the inferencing performance linearly because of no communications and synchronizations. We also noticed that higher batch size leads to higher inference performance and the largest batch size is only limited by GPU memory size. In the future work, we will evaluate the inference performance with real world deep learning applications.


  • Application Performance on P100-PCIe GPUs

    Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Feb 2017

    Introduction to P100-PCIe GPU

    This blog describes the performance analysis on NVIDIA® Tesla® P100™ GPUs on a cluster of Dell PowerEdge C4130 servers. There are two types of P100 GPUs: PCIe-based and SXM2-based. In PCIe-based server, GPUs are connected by PCIe buses and one P100 delivers around 4.7 and 9.3 TeraFLOPS of double and single precision performance, respectively. And in P100-SXM2, GPUs are connected by NVLink and one P100 delivers around 5.3 and 10.6 TeraFLOPS of double and single precision performance, respectively. This blog focuses on P100 for PCIe-based servers, i.e. P100-PCIe. We have already analyzed the P100 performance for several deep learning frameworks in this blog. The objective of this blog is to compare the performance of HPL, LAMMPS, NAMD, GROMACS, HOOMD-BLUE, Amber, ANSYS Mechanical and RELION. The hardware configuration of the cluster is the same as in the deep learning blog. Briefly speaking, we used a cluster of four C4130 nodes, each node has dual Intel Xeon E5-2690 v4 CPUs and four NVIDIA P100-PCIe GPUs and all nodes are connected with EDR Infiniband. Table 1 shows the detailed information about the hardware and software used in every compute node.

     

    Table 1: Experiment Platform and Software Details

    Platform

    PowerEdge C4130 (configuration G)

    Processor

    2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)

    Memory

    256GB DDR4 @ 2400MHz

    Disk

    9TB HDD

    GPU

    P100-PCIe with 16GB GPU memory

    Nodes Interconnects

    Mellanox ConnectX-4 VPI (EDR 100Gb/s Infiniband)

    Infiniband Switch

    Mellanox SB7890

    Software and Firmware

    Operating System

    RHEL 7.2 x86_64

    Linux Kernel Version

    3.10.0-327.el7

    BIOS

    Version 2.3.3

    CUDA version and driver

    CUDA 8.0.44 (375.20)

    OpenMPI compiler

    Version 2.0.1

    GCC compiler

    4.8.5

    Intel Compiler

    Version 2017.0.098

    Applications

    HPL

    Version hpl_cuda_8_ompi165_gcc_485_pascal_v1

    LAMMPS

    Version Lammps-30Sep16

    NAMD

    Version NAMD_2.12_Source

    GROMACS

    Version 2016.1

    HOOMD-blue

    Version 2.1.2

    Amber

    Version 16update7

    ANSYS Mechanical

    Version 17.0

    RELION

    Version 2.0.3


    High Performance Linpack (HPL)

    HPL is a multicomputer parallel application to measure how fast computers solve a dense n by n system of linear equations using LU decomposition with partial row pivoting and designed to be run at very large scale. The HPL running on the experimented cluster uses the double precision floating point operations. Figure 1 shows the HPL performance on the tested P100-PCIe cluster. It can be seen that 1 P100 is 3.6x faster than 2 x E5-2690 v4 CPUs. HPL also scales very well with more GPUs within nodes or across nodes. Recall that 4 P100 is within a server and therefore 8, 12 and 16 P100 are in 2, 3 and 4 servers. 16 P100 GPUs has the speedup of 14.9x compared to 1 P100. Note that the overall efficiency is calculated as: HPL Efficiency = rMax / (CPUs rPeak + GPUs rPeak), where rPeak is the highest theoretical FLOPS result that could be achieved with base clock, and the number reported by HPL is rMax and is the real performance that can be achieved. HPL cannot be run at the max boost clock. It is typically run at some number in between but the average is close to the base clock then to the max boost clock. That is why we used base clock for rPeak calculation. Although we also included CPU rPeak in the efficiency calculation, when running HPL on P100 we set DGEMM_SPLIT=1.0 which means CPU is not really contributing to the DGEMM, but only handling other overhead so it is not actually contributing a lot of FLOPS. Although we observed that CPUs stayed fully utilized they were just handling the overhead and data movement to keep the GPUs fed. What is the most important for P100 is that rMax is really big.

    Figure 1: HPL performance on P100-PCIe

     
    NAMD

    NAMD (for NAnoscale Molecular Dynamics) is a molecular dynamics application designed for high-performance simulation of large biomolecular systems. The dataset we used is Satellite Tobacco Mosaic Virus (STMV) which is a small, icosahedral plant virus that worsens the symptoms of infection by Tobacco Mosaic Virus (TMV). This dataset has 1,066,628 atoms and it is the largest dataset on NAMD utilities website. The performance metric in the output log of this application is “days/ns” (the lower the better). But its inverted metric “ns/day” is used in our plot since that is what most molecular dynamics users focus on. The average of all occurrences of this value in the output log was used. Figure 2 shows the performance within 1 node. It can be seen that the performance of using 2 P100 is better than that of using 4 P100. This is probably because of the communications among different CPU threads. This application launches a set of workers threads that handle the computation and communication threads that handle the data communication. As more GPUs are used, more communication threads are used and more synchronization is needed. In addition, based on the profiling result from NVIDIA’s CUDA profiler called nvprof, with 1 P100 the GPU computation takes less than 50% of the whole application time. According to Amdahl’s law, the speedup with more GPUs will be limited by another 50% work that is not parallelized by GPU. Based on this observation, we further ran this application on multiple nodes with two different settings (2 GPUs/node and 4 GPUs/node) and the result is shown in Figure 3. The result shows that no matter how many nodes are used, the performance of 2 GPUs/node is always better than 4 GPUs/node. Within a node, 2 P100 GPUs is 9.5x faster than dual CPUs.

     Figure 2: NAMD Performance within 1 P100-PCIe node

     Figure 3: NAMD Performance across Nodes

    GROMACS

    GROMACS (for GROningen MAchine for Chemical Simulations) primarily does simulations for biochemical molecules (bonded interactions). But because of its efficiency in calculating non-bonded interactions (atoms not linked by covalent bonds), the user base is expanding to non-biological systems. Figure 4 shows the performance of GROMACS on CPU, K80 GPUs and P100-PCIe GPUs. Since one K80 has two internal GPUs, from now on when we mention one K80 it always refers to two internal GPUs instead of one of the two internal GPUs. When testing with K80 GPUs, the same P100-PCIe GPUs based servers were used. Therefore, the CPUs and memory were kept the same and the only difference is that P100-PCIe GPUs were replaced to K80 GPUs. In all tests, there were four GPUs per server and all GPUs were utilized. For example, the 3 node data point is with 3 servers and 12 total GPUs. The performance of P100-PCIe is 4.2x – 2.8x faster than CPU from 1 node to 4 nodes, and is 1.5x – 1.1x faster than K80 GPU from 1 node to 4 nodes.

      Figure 4: GROMACS Performance on P100-PCIe

    LAMMPS

    LAMMPS (for Large Scale Atomic/Molecular Massively Parallel Simulator) is a classic molecular dynamics code, capable of simulations for solid-state materials (metals, semi-conductors), soft matter (biomolecules, polymers) and coarse-grained or mesoscopic systems. It can be used to model atoms or more generically as a parallel particle simulator at the atomic, meso or continuum scale. The dataset we used was LJ (Lennard-Jones liquid benchmark) which contains 512000 atoms. There are two GPU implementations in LAMMPS: GPU library version and kokkos version. In the experiment, we used kokkos version since it was much faster than the GPU library version.

    Figure 5 shows LAMMPS performance on CPU and P100-PCIe GPUs. Using 16 P100 GPUs is 5.8x faster than using 1 P100. The reason that this application did not scale linearly is that the data transfer (CPU->GPU, GPU->CPU and GPU->GPU) time increases as more GPUs are used although the computation part reduces linearly. And the reason that the data transfer time increases is because this application requires the data communication among all GPUs used. However, the configuration G we used only allows Peer-to-Peer (P2P) access for two pairs of GPUs: GPU 1 - GPU 2 and GPU 3 - GPU 4. GPU 1/2 cannot communicate with GPU 3/4 directly. If the communication is needed, the data must go through CPU which slows the communication. The configuration B is able to ease this issue as it allows P2P access among all four GPUs within a node. The comparison between configuration G and configuration B is shown in Figure 6. By running LAMMPS on a configuration B server with 4 P100, the performance metric “timesteps/s” was improved to 510 compared to 505 in configuration G, resulting in 1% improvement. The reason why the improvement is not significant is because the data communication takes only less than 8% of the whole application time when running on configuration G with 4 P100. Figure 7 also compared the performance of P100-PCIe with that of CPU and K80 GPUs for this application. It is shown that within 1 node, 4 P100-PCIe is 6.6x faster than 2 E5-2690 v4 CPUs and 1.4x faster than 4 K80 GPUs.

     Figure 5: LAMMPS Performance on P100-PCIe

     

      

    Figure 6 : Comparison between Configuration G and Configuration B


    Figure 7: LAMMPS Performance Comparison

    HOOMD-blue

    HOOMD-blue (for Highly Optimized Object-oriented Many-particle Dynamics - blue) is a general purpose molecular dynamic simulator. Figure 8 shows the HOOMD-blue performance. Note that the y-axis is in logarithmic scale. It is observed that 1 P100 is 13.4x faster than dual CPU. The speedup of using 2 P100 is 1.5x compared to using only 1 P100. This is a reasonable speedup. However, with 4 P100 to 16 P100, the speedup is from 2.1x to 3.9x which is not high. The reason is that similar to LAMMPS, this application also involves lots of communications among all used GPUs. Based on the analysis in LAMMPS, using configuration B should reduce this communication bottleneck significantly. To verify this, we ran the same application again on a configuration B server. With 4 P100, the performance metric “hours for 10e6 steps” was reduced to 10.2 compared to 11.73 in configuration G, resulting in 13% performance improvement and the speedup compared to 1 P100 was improved to 2.4x from 2.1x.

     Figure 8: HOOMD-blue Performance on CPU and P100-PCIe

    Amber

    Amber is the collective name for a suite of programs that allow users to carry out molecular dynamics simulations, particularly on biomolecules. The term Amber is also used to refer to the empirical force fields that are implemented in this suite. Figure 9 shows the performance of Amber on CPU and P100-PCIe. It can be seen that 1 P100 is 6.3x faster than dual CPU. Using 2 P100 GPUs is 1.2x faster than using 1 P100. However, the performance drops significantly when 4 or more GPUs are used. The reason is that similar to LAMMPS and HOOMD-blue, this application heavily relies on P2P access but configuration G only supports that between 2 pair GPUs. We verified this by again testing this application on a configuration B node. As a result, the performance of using 4 P100 was improved to 791 ns/day compared to 315 ns/day in configuration G, resulting in 151% performance improvement and the speedup of 2.5x. But even in configuration B, the multi-GPU scaling is still not good enough. This is because when the Amber multi-GPU support was originally designed the PCI-E bus speed was gen 2 x 16 and the GPUs were C1060 or C2050s. However, the current Pascal generation GPUs are > 16x faster than the C1060s while the PCI-E bus speed has only increased by 2x (PCI Gen2 x 16 to PCI Gen3 x 16) and Infiniband interconnects by about the same amount. Amber website explicitly states that “It should be noted that while the legacy MPI and GPU-Direct methods of multi-GPU communication are still supported, and will be used by the code automatically if peer to peer communication is not available, you are very unlikely to see any speedup by using multiple GPUs for a single job if the GPUs are newer than C2050s. Multi-node runs are almost impossible to get to scale.” This is consistent with our results on multi-node. Because it is obvious to see that in Figure 9, the more nodes are used, the worse the performance is.

      Figure 9: Amber Performance on CPU and P100-PCIe

    ANSYS Mechanical

     ANSYS® Mechanicalsoftware is a comprehensive finite element analysis (FEA) tool for structural analysis, including linear, nonlinear dynamic, hydrodynamic and explicit studies. It provides a complete set of element behavior, material models and equation solvers for a wide range of mechanical design problems. The finite element method is used to solve the partial differential equations which is a compute and memory intensive task. Our testing focused on the Power Supply Module (V17cg-1) benchmark. This is a medium sized job for iterative solvers and a good test for memory bandwidth. Figure 10 shows the performance of ANSYS Mechanical on CPU and P100-PCIe. It is shown that within a node, 4 P100 is 3.8x faster than dual CPUs. And with 4 nodes, 16 P100 is 2.3x faster than 8 CPUs. The figure also shows that the performance scales well with more nodes. The speedup with 4 nodes is 2.8x compared to 1 node.

     Figure 10: ANSYS Mechanical Performance on CPU and P100-PCIe

    RELION

    RELION (for REgularised Likelihood OptimisationN) is a program that employs an empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM). Figure 11 shows the performance of RELION on CPU and P100-PCIe. Note that y-axis is in logarithmic scale. It demonstrates that 1 P100 is 8.8x faster than dual CPU. From the figure we also notice that it does not scale well starting from 4 P100 GPUs. Because of the long execution time, we did not perform the profiling for this application. But it is possible that the reason of the weak scaling is similar to LAMMPS, HOOMD-blue and Amber.

     Figure 11: RELION Performance on CPU and P100-PCIe


    Conclusions and Future Work


    In this blog, we presented and analyzed the performance of different applications on Dell PowerEdge C4130 servers with P100-PCIe GPUs. In all of the tested applications, HPL, GROMACS and ANSYS Mechanical benefit from the balanced CPU-GPU configuration in configuration G, because they do not require P2P access among GPUs. However, LAMMPS, HOOMD-blue, Amber (and possibly RELION) rely on P2P accesses. Therefore, with configuration G, they scale well up to 2 P100 GPUs, then scale weakly with 4 or more P100 GPUs. But with Configuration B, they scale better than G with 4 GPUs, so configuration B is more suitable and recommended for applications implemented with P2P accesses.

    In the future work, we will run these applications on P100-SXM2 and compare the performance difference between P100-PCIe and P100-SXM2.



  • HPCG Performance study with Intel KNL

    By Ashish Kumar Singh. January 2017 (HPC Innovation Lab)

    This blog presents an in-depth analysis of the High Performance Conjugate Gradient (HPCG) benchmark on the Intel Xeon Phi processor, which is based on Intel Xeon Phi architecture codenamed “Knights Landing”. The analysis has been performed on PowerEdge C6320p platform with the new Intel Xeon Phi 7230 processor. 

    Introduction to HPCG and Intel Xeon Phi 7230 processor

    The HPCG benchmark constructs a logically global, physically distributed sparse linear system using a 27-point stencil at each grid point in 3D domain such that the equation at the point (I, j, k) depend on its values and 26 surrounding neighbors. The global domain computed by benchmark is (NRx * Nx) X (NRy*Ny) X (NRz*Nz), where Nx, Ny and Nz are dimensions of local subgrids, assigned to each MPI process and number of MPI ranks are NR = (NRx X NRy X NRz). These values can be defined in hpcg.dat file or passed in the command line arguments.       

    The HPCG benchmark is based on conjugate gradient solver, where the pre-conditioner is a three level hierarchical multi-grid (MG) method with Gauss-Seidel. The algorithm starts with MG and contains Symmetric Gauss-Seidel (SymGS) and Sparse Matrix-vector multiplication (SPMV) routines for each level. Both SYMGS and SPMV require data from their neighbor as data is distributed across nodes which is provided by their predecessor, the Exchange Halos routine. The residual should be lower than 1-6 which is locally computed by Dot Product (DDOT), while MPI_Allreduce follows the DDOT and completes the global operation. WAXPBY only updates a vector with sum of two scaled vectors. Scaled vector addition is a simple operation that calculates the output vector by scaling the input vectors with a constant and performing an addition on the values of the same index. So, HPCG has four computational blocks SPMV, SymGS, WAXPBY and DDOT, while two communication blocks MPI_Allreduce and Halos Exchange.

    Intel Xeon Phi Processor is a new generation of processors from the Intel Xeon Phi family. Previous generations of Intel Xeon Phi were available as a coprocessor, in a PCI card form factor and required an Intel Xeon processor. The Intel Xeon Phi 7230 contains 64 cores @ 1.3GHz of core frequency along with the turbo speed of 1.5GHz and 32MB of L2 cache. It supports DDR4-2400MHz memory up to 384GB and instruction set of AVX512. Intel Xeon Phi processor also encloses 16GB of MCDRAM memory on socket with a sustained memory bandwidth of up to ~480GB/s measured by the Stream benchmark. Intel Xeon Phi 7230 delivers up to ~1.8TFLOPS of double precision HPL performance.

    This blog showcases the performance of HPCG benchmark on the Intel KNL processor and compares the performance to that on the Intel Broadwell E5-2697 v4 processor. The Intel Xeon Phi cluster comprises of one head node which is PowerEdge R630 and 12 PowerEdge C6320p as compute nodes. While Intel Xeon processor cluster includes one PowerEdge R720 as head node and 12 PowerEdge R630 as compute nodes. All compute nodes are connected by Intel Omni-Path of 100GB/s. The cluster shares the storage of head node over NFS. The detailed information of the clusters are mentioned below in table1. All HPCG tests on Intel Xeon Phi has been performed with the BIOS settings of “quadrant” cluster mode and “Memory” memory mode.    

    Table1: Cluster Hardware and software details

    Testbed configuration

      

    HPCG Performance analysis with Intel KNL

    Choosing the right problem size for HPCG should follow the following rules. The problem size should be large enough not to fit in the cache of the device. The problem size should be able to occupy the significant fraction of main memory, at least 1/4th of total. For HPCG performance characterization, we have chosen the local domain dimension of 128^3, 160^3, and 192^3 with the execution time of t=30 seconds. The local domain dimension defines the global domain dimension by (NR*Nx) x (NR*Ny) x (NR*Nz), where Nx=Ny=Nz=160 and NR is the number of MPI processes.

    Figure 1: HPCG Performance comparison with multiple local dimension grid size

    As shown in figure 1, the local dimension grid size of 160^3 gives the best performance of 48.83GFLOPS. The problem size bigger than 128^3 allows for more parallelism and it fits well inside the MCDRAM while 192^3 does not. All these tests have been carried out with 4 MPI processes and 32 OpenMP threads per MPI process on a single Intel KNL server.

    Figure 2: HPCG performance comparison with multiple execution time.

    Figure 2 demonstrates HPCG performance with multiple execution times for grid size of 160^3 on a single Intel KNL server. As per the graph, HPCG performance doesn’t change even by changing the execution time. It means execution time does not appear to be a factor for HPCG performance. So, we may not need to spend hours or days of time to benchmark large clusters, which in result, will save both time and power. Although, the official execution time should be >=1800 seconds as reported in the output file. If you decide to submit your results to TOP 500 ranking list, execution time should be not less than 1800seconds.

    Figure 3: Time consumed by HPCG computational routines.

    Figure 3 shows the time consumed by each computational routine from 1 to 12 KNL nodes. Time spent by each routine is mentioned in HPCG output file as shown in the figure 4. As per the above graph, HPCG spends its most of the time in the compute intensive pre-conditioning of SYMGS function and matrix vector multiplication of sparse matrix (SPMV). The vector update phase (WAXPBY) consumes very less time in comparison to SymGS and least time by residual calculation (DDOT) out of all four computation routines. As the local grid size is same across all multi-node runs, the time spent by all four compute kernels for each multi-node run are approximately same. The output file shown in figure 4, shows performance of all four computation routines. In which, MG consists both SymGS and SPMV.

    Figure 4: A slice of HPCG output file

    Performance Comparison

    Here is the HPCG multi-nodes performance comparison between Intel Xeon E5-2697 v4 @2.3GHz (Broadwell processor) and Intel KNL processor 7230 with Intel Omni-path interconnect.

      

    Figure 5: HPCG performance comparison between Intel Xeon Broadwell processor and Intel Xeon Phi processor

    Figure 5 shows HPCG performance comparison between dual Intel Broadwell 18 cores processors and one Intel Xeon phi 64 cores processor. Dots in figure 5 show the performance acceleration of KNL servers over Broadwell dual socket servers. For single KNL node, HPCG performs 2.23X better than Intel Broadwell node. For Intel KNL multi-nodes also HPCG show more than 100% performance increase over Broadwell processor nodes. With 12 Intel KNL nodes, HPCG performance scales out well and shows performance up to ~520 GFLOPS.

    Conclusion

    Overall, HPCG shows ~2X higher performance with Intel KNL processor on PowerEdge C6320p over Intel Broadwell processor server. HPCG performance scales out well with more number of nodes. So, PowerEdge C6320p platform will be a prominent choice for HPC applications like HPCG.     

    Reference:

    https://software.sandia.gov/hpcg/doc/HPCG-Specification.pdf

    http://www.hpcg-benchmark.org/custom/index.html?lid=158&slid=281