Dell Community
High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • New NVIDIA V100 32GB GPUs, Initial performance results

    Deepthi Cherlopalle, HPC and AI Innovation Lab. June 2018

    GPUs are useful for accelerating large matrix operations, analytics, deep learning workloads and several other use cases. NVIDIA introduced the Pascal line of their Tesla GPUs in 2016, the Volta line of GPUs in 2017, and recently announced their latest Tesla GPU based on the Volta architecture with 32GB of GPU memory. The V100 GPU is available with both PCIe and NVLink version, allowing GPU-to-GPU communication over PCIe or over NVLink. The NVLink version of the GPU is also called an SXM2 module.

    This blog will give an introduction to the new Volta V100-32GB GPUs and compare the HPL performance between different V100 models. Tests were performed using a Dell EMC PowerEdge C4140 with both PCIe and SXM2 configurations.  There are several other platforms which support GPUs:  PowerEdge R740, PowerEdge R740XD, PowerEdge R840, and PowerEdge R940xa.  A similar study was conducted in the past comparing the performance of the P100 and V100 GPUs with the HPL, HPCG, AMBER, and LAMMPS applications. 

    Table 1 below provides an overview of Volta device specifications.

    Table 1: GPU Specifications
     
     
     
    GPU Architecture
    Volta
    NVIDIA Tensor cores
    640
    NVIDIA CUDA Cores
    5140
    GPU Max Clock Rate
    1380MHz
    1530MHz
    Double precision performance
    7TFlops
    7.8TFlops
    Single precision performance
    14TFlops
    15.7TFlops
    GPU memory
    16/32GB
    16/32GB
    Interconnect Bandwidth
    32GB/s
    300GB/s
    System Interface
    PCIe Gen3
    NVIDIA NVLink
    Max Power Consumption
    250 watts
    300 watts

    The PowerEdge C4140 Server is an accelerator optimized server with support for two Intel Xeon Scalable processors and four NVIDIA Tesla GPUs (PCIe or NVLink) in a 1U form factor. The PCIe version of the GPUs is supported with standard PCIe Gen3 connections between GPU to CPU. The NVLink configuration allows GPU-to-GPU communication over the NVLink interconnect. Applications that can take advantage of the higher NVLink bandwidth and the higher clock rate of the V100-SXM2 module can benefit from this option. The PowerEdge C4140 platform is available in four different Configurations: B, C, K, and G. The configurations are distinct in their PCIe lane layout and NVLink capability and are shown in Figure 1 through Figure 4.

    In Configuration B, the GPU to GPU communication is through a PCIe switch, and the PCIe switch is connected to a single CPU. In Configuration C and G two GPUs are connected to each CPU, however in Configuration C the two GPUs are directly connected to each CPU, where as in Configuration G the GPUs are connected to the CPU via a PCIe switch. The PCIe Switch in Configuration G is logically divided into two virtual switches mapping 2GPUs to each CPU. In Configuration K, GPU-to-GPU communication is over NVLink, with all GPUs connected to a single CPU. As seen in the figures below all the configurations have additional x16 slots available apart from the GPU slots.

    Figure 1: PowerEdge C4140 Configuration B                                          Figure 2: PowerEdge C4140 Configuration C

    Figure 3: PowerEdge C4140 Configuration G                                         Figure 4: PowerEdge C4140 Configuration K

    The PowerEdge C4140 platform can support a variety of Intel Xeon CPU models, up to 1.5 TB of memory with 24 DIMM slots, multiple network adapters and provides several local storage options. For more information on this server click here.

    To evaluate the performance difference between the V100-16GB and the V100-32GB GPUs, a series of tests were conducted. These tests were run on a single PowerEdge C4140 server with the configurations detailed below in Table 2-4.

    Table 2: Tested Configurations Details

    Table 3: Hardware Configuration

     

    Table 4: Software/Firmware Configuration:

    HPL performance

    High Performance Linpack (HPL) is a standard HPC benchmark used to measure computing power. It is also used as a reference benchmark by the Top500 list to rank supercomputers worldwide. This benchmark provides a measurement of the peak computational performance of the entire system. There are few parameters that are significant in this benchmark:

    • N is the problem size
    • NB is the block size
    • Rpeak is the theoretical peak of the system.
    • Rmax is the maximum measured performance achieved on the system.
    • The efficiency is defined as the ratio of Rmax to Rpeak.

     The resultant performance of HPL is reported in GFLOPS.

     N is the problem size provided as input to the benchmark and determines the size of the dense linear matrix that is solved by HPL. HPL performance tends to increase with increasing N value (problem size) until limits of system memory, CPU or data communication bandwidth begins to limit the performance. For GPU system, the highest HPL performance will commonly occur when the problem size is close to the size of the GPUs memory and the performance will be higher when a larger problem size will fit in that memory.

    In this section of the blog, the HPL performance of the NVIDIA V100-16GB and the V100-32GB GPUs is compared using PowerEdge C4140 configuration B and K (refer to Table 2). Recall that configuration B uses PCIe V100s with 250W power limit and configuration K uses SXM2 V100s with higher clocks and 300W power limit. Figure 5 shows the maximum performance that can be achieved on different configurations. We measured a 14% improvement when running HPL on V100-32GB with PCIe versus V100-16GB with PCIe, and there was a 16% improvement between V100-16GB SXM2 and V100-32GB SXM2.  The size of the GPU memory made a big difference in terms of performance as the larger memory GPU can accommodate a larger problem size, a larger N.

    As seen in Table 1 the V100-16GB, V100-32GB PCIe and V100-16GB, V100-32GB SXM2 have the same number of cores, double precision performance and GPU Bandwidth except for the GPU memory. We also measured ~6% HPL performance improvement from PCIe to SXM2 GPUs which is a small delta in HPL performance but Deep learning frameworks like Tensor Flow and Caffe show much more performance improvement.

    Running HPL using only CPUs yields ~2.3TFLOPS with the Xeon Gold 6148; therefore, one PowerEdge C4140 system with four GPUs provides floating point capabilities equal to about nine two socket Intel Xeon 6148 servers. 

     

    Figure 5:  HPL Performance on different C4140 configurations.

    Figure 6 and Figure 7 shows the performance of V100 16GB vs 32GB GPU for different values of N. Table 2  shows the configurations used for this test. These graphs helps us visualize how the GPU cards perform with different problem sizes. As explained above, the problem size is calculated based on the size of the GPU memory, the 32GB GPU can accommodate a larger problem size than the 16GB GPU. When a problem size that is larger than what will fit in GPU memory is executed on a GPU system, the system memory attached to the CPU is used, and this leads to a drop in performance as the data must move from system memory to GPU memory. For ease of understanding the test data is split into two different graphs.

     

    Figure 6:  HPL performance with different problem sizes (N)

    In Figure 6 we notice that the HPL performance for both the cards is similar until the problem size (N) approximately fills up V100-16GB memory, the same problem size (N) would approximately fill up half the memory for V100-32GB GPUs. In the second graph in Figure 7 we notice that the performance of the V100 16GB GPU drops as it cannot fit larger problem sizes in the GPU memory and must start to use system host memory. The 32GB GPU continues to perform better with larger and larger N until the problem size reaches the maximum capacity of the V100 32GB memory.

     

    Figure 7:  HPL performance with different problem sizes (N)

    Conclusion and Future work:

    PowerEdge C4140 is one of the most prominent GPU based server options for HPC related solutions. We measured a 14-17% improvement in HPL performance when moving from the smaller memory V100-16GB GPU to the larger memory V100-32GB GPU. For memory bound applications, the new Volta 32GB cards would be the preferred option.

    For future work, we will run molecular dynamic applications, deep learning workloads and compare the performance between different Volta cards and C4140 configurations.  

    Please contact HPC innovation lab if you’d like to evaluate the performance of your application on PowerEdge Servers.

     

  • Collaboration Showcase: Dell EMC, TACC and Intel join forces on Stampede2 performance studies

    Dell EMC Solutions, June 2018

     

    Stampede2 system, is the result of collaboration between the Texas Advanced Computing Center (TACC), Dell EMC and Intel. Stampede2 consists of 1,736 Dell EMC PowerEdge C6420 nodes with dual-socket Intel Skylake processors, 4,204 Dell EMC PowerEdge C6320p nodes with Intel Knights Landing bootable processors, a total of 5,940 compute nodes, and 24 additional login and management servers, Dell EMC Networking H-series switches, all interconnected by an Intel Omni-Path Architecture (OPA) fabric.

     

    Two technical white papers were recently published through the joint efforts of TACC, Dell EMC and Intel. One white paper describes the Network Integration and Testing Best Practices on the Stampede2 cluster. The other white paper discusses the Application Performance of Intel Skylake and Intel Knights Landing Processors on Stampede2 and highlights the significant performance advantage of Intel Skylake processor at a multi node scale in four commonly used applications: NAMD, LAMMPS, Gromacs and WRF. For build details, please contact your Dell EMC representative. If you have VASP license, we are happy to share VASP benchmark results as well.

     

    Deploying Intel Omni-Path Architecture Fabric in Stampede2 at the Texas Advanced Computing CenterNetwork Integration and Testing Best Practices (H17245)

     

    Application Performance of Intel Skylake and Intel Knights Landing Processors on Stampede2 (H17212)

  • Deep Learning Performance with Intel® Caffe - Training, CPU model choice and Scalability

    Authors: Alex Filby and Nishanth Dandapanthula

    HPC Engineering, HPC Innovation Lab, March 2018

     

    Overview

    To get the most out of deep learning technologies requires careful attention to both hardware and software considerations. There are a myriad of choices for compute, storage and networking. The software component does not stop at choosing a framework, there are many parameters for a particular model that can be tuned to alter performance. The Dell EMC Deep Learning Ready Bundle with Intel provides a complete solution with tuned hardware and software. This blog covers some of the benchmarks and results that influenced the design. Specifically we studied the training performance across different generations of servers/CPUs, and the scalability of Intel Caffe to hundreds of servers.

    Introduction to Intel® Caffe and Testing Methodology

    Intel Caffe is a fork of BVLC (Berkeley Vision and Learning Center) Caffe, maintained by Intel. The goal of the fork is to provide architecture specific optimizations for Intel CPUs (Broadwell, Skylake, Knights Landing, etc). In addition to Caffe optimization, "Intel optimized" models are also included with the code. These take popular models such as Alexnet, Googlenet, Resnet-50 and tweak their hyperparamters to provide increased performance and accuracy on Intel systems for both single node and multi-node runs. These models are frequently updated as the state of the art advances.

    For these tests we chose the Resnet-50 model, due to its wide availability across frameworks for easy comparison, and since it is computationally more intensive than other common models. Resnet is short for Residual Network, which strives to make deeper networks easier to train and more accurate by learning the residual function of the underlying data set as opposed to the identity mapping. This is accomplished by adding “skip connections” that pass output from upper layers to lower ones and skipping over some number of intervening layers. The two outputs are then added together in an element wise fashion and passed into a nonlinearity (activation) function.

     

    Table 1.    Hardware configuration for Skylake, Knights Landing and Broadwell nodes

    Test bed

    Table 2. Software details



    Performance tests were conducted on three generations of servers supporting different Intel CPU technology. The system configuration of these test beds is shown in Table 1 and the software configuration is listed in Table 2. The software landscape is rapidly evolving, with frameworks being regularly updated and optimized. We expect performance to continue to improve with subsequent releases; as such the results are intended to provide insights and not be taken as absolute.

    As shown in Table 2 we used the Intel Caffe optimized multi-node version for all tests. There are differences between Intel’s implementation of the single-node and multi-node Caffe models, and using the multi-node model across all configurations allows for an accurate comparison between single and multi-node scaling results. Unless otherwise stated all tests were run using the compressed ILSVRC 2012 (Imagenet) database which contains 1,281,167 images. The dataset is loaded into /dev/shm before the start of the test. For each data point a parameter sweep was performed across three parameters: batch size, prefetch size, and thread count. Batch size is the number of training examples fed into the model at one time, prefetch is the number of batches (of batch size) buffered in memory, and thread count is the number of threads used per node. The results shown used the best results from the parameter sweep for each test case. The metric used for comparison is images per second, which is calculated by taking the total number of images the model has seen (batch_size * iterations * nodes) divided by the total training time. Training time does not include Caffe startup time.


    Single Node Performance

    To determine what processors might be best suited for these workloads we tested a variety of SKUs including Intel Xeon E5-2697 v4 (Broadwell – BDW); Silver, Gold and Platinum Intel Xeon Scalable Processor Family CPUs (Skylake - SKL), as well as an Intel Xeon Phi CPU (KNL). The single node results are plotted in Figure 1 with the line graph showing results relative to the performance of the E5–2697 v4 BDW system.



    Figure 1.    Processor model performance comparison relative to Broadwell


    The difference in performance between the Gold 6148 and Platinum 8168 SKUs is around 5%. These results show that for this workload and version of Intel Caffe the higher end Platinum SKUs do not offer much in the way of additional performance over the Gold CPUs. The KNL processor model tested provides very similar results to the Platinum models.

    Multi-node Performance and Scaling

    The multi-node runs were conducted on the HPC Innovation Lab’s Zenith cluster, which is a Top500 ranked cluster (#292 on the Nov 2017 list). Zenith contains over 324 Skylake nodes and 160 KNL nodes configured as listed in Table 1. The system uses Intel’s Omni-Path Architecture for its high speed interconnect. The Omni-Path network consists of a single 768 port director switch, with all nodes directly connected, providing a fully non-blocking fabric.

    Scaling Caffe beyond a single node requires additional software, we used Intel’s Machine Learning Scalability Library (MLSL). MLSL provides an interface for common deep learning communication patterns built on top of Intel MPI. It supports various high speed interconnects and the API can be used by multiple frameworks.

    The performance numbers on Zenith were obtained using /dev/shm, the same as we did for the single node tests. KNL multi-node tests used a Dell EMC NFS Storage Solution (NSS), an optimized NFS solution. Batch sizes were constrained as node count increased to keep the total batch size less than or equal to 8k, to keep it within the bounds of this particular model. As node count increases, the total batch size across all the nodes in the test increases as well (assuming you keep the batch size per node constant). Very large batch sizes complicate the gradient descent algorithm used to optimize the model, causing accuracy to suffer. Facebook has done work getting distributed training methods to scale to 8k batch sizes.

    Figure 2.    Scaling on Zenith with Gold 6148 processors using /dev/shm as the storage

    Figure 2 shows the results of our scalability tests on Skylake. When scaling from 1 node to 128 nodes, speedup is within 90% of perfect scaling. Above that scaling starts to drop off more rapidly, falling to 83% and 76% of perfect for 256 and 314 nodes respectively. This is mostly likely due to a combination of factors the first being decreasing node batch size. Individual nodes tend to offer the best performance with larger batch sizes, but to keep the overall batch below 8k, the node batch size is decreased. Each node is then running a suboptimal batch size. The second is communication overhead; the Intel Caffe default for multi-node weight updates utilizes MPI collectives at the end of each batch to distribute the model weight data to all nodes. This allows each node to ‘see’ the data from all other nodes without having to process all of the other images in the batch. It is why you get a training time improvement when using multiple nodes instead of just training hundreds of individual models. Communication patterns and overhead is an area we plan to investigate in the future.



    Figure 3. Scaling Xeon 7230 KNL using Dell NFS Storage Solution

    The scalability results on the KNL cluster are shown in Figure 3. The results are similar to SKL results in Figure 2. For this test, batch size was able to remain constant due to the smaller number of nodes and the fact that a smaller batch size was optimal on KNL systems. With multi-node runs some performance is lost due to threads being needed for communication, and not pure computation as with single node tests.

    Conclusions and Future Work

    For this blog we have focused on single node deep learning training performance comparing a range of different Intel CPU models and generations, and conducted initial scaling studies for both SKL and KNL clusters. Our key takeaways are summarized as follows:

    • Intel Caffe with Intel MLSL scales to hundreds of nodes.

    • Skylake Gold 6148, 6150, and 6152 processors offer similar performance to Platinum SKUs.

    • KNL performance is also similar to Platinum SKUs.


    Our future work will focus on other aspects of deep learning solutions including performance of other frameworks, inference performance, and I/O considerations. TensorFlow is a very popular framework which we did not discuss here but will do so in a future part of this blog series. Inferencing is a very important part of the workflow, as a model must be deployed for it to be of use! Finally we’ll also compare the various storage options and tradeoffs as well as discuss the I/O patterns (network and storage) of TensorFlow and Intel Caffe.


  • Deep Learning Performance on R740 with V100-PCIe GPUs

    Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula

    Dell EMC HPC Innovation Lab. February 2018

    Overview

    The Dell EMC PowerEdge R740 is a 2-socket, 2U rack server. The system features the Intel Skylake processors, up to 24 DIMMs, and up to 3 double width or 6 single width GPUs. In our previous blog Deep Learning Inference on P40 vs P4 with SkyLake, we presented the deep learning inference performance on Dell EMC’s PowerEdge R740 server with P40 and P4 GPUs. This blog will present the performance of the deep learning training performance on single R740 with multiple V100-PCIe GPUs. The deep learning frameworks we benchmarked include Caffe2, MXNet and Horovod+TensorFlow. Horovod is a distributed framework for TensorFlow. We used Horovod because it has better scalability implementation (using MPI model) than TensorFlow, which has been explained in the article “Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow”. Table 1 shows the hardware configuration and software details we tested. To test the deep learning performance and scalability on R740 server, we used the same neural network, the same dataset and the same measurement as in our other deep learning blog series such as Scaling Deep Learning on Multiple V100 Nodes and Deep Learning on V100.

    Table 1: The hardware configuration and software details


    Performance Evaluation

    The Figure 1, Figure 2 and Figure 3 show the Resnet50 performance and speedup of multiple V100 GPUs with Caffe2, MXNet and TensorFlow, respectively. We can obtain the following conclusions based on these results:

    • Overall the performance of Resnet50 scales well on multiple V100 GPUs within one node. With 3 V100:

      • Caffe2 achieved the speedup of 2.61x and 2.65x in FP32 and FP16 mode, respectively.

      • MXNet achieved the speedup of 2.87x and 2.82x in FP32 and FP16 mode, respectively.

      • Horovod+TensorFlow achieved the speedup of 2.12x in FP32 mode. (FP16 still under development)

    • The performance in FP16 mode is around 80%-90% faster than FP32 for both Caffe2 and MXNet. TensorFlow still has not supported FP16 yet, so we will test its FP16 performance once this feature is supported.

     Figure 1: Caffe2: Performance and speedup of V100


    Figure 2: MXNet: Performance and speedup of V100


    Figure 3: TensorFlow: Performance and speedup of V100

    Conclusions

    In this blog, we presented the deep learning performance and scalability of popular deep learning frameworks like Caffe2, MXNet and Horovod+TensorFlow. Overall the three frameworks scale as expected on all GPUs within single R740 server.

  • Digital Manufacturing with 14G

    Author: Joshua Weage, HPC Innovation Lab, February 2018

     

    Dell EMC Ready Bundle for HPC Digital Manufacturing

     

    Four technical white papers were recently published describing the recently released Dell EMC Ready Bundle for HPC Digital Manufacturing. These papers discuss the architecture of the system as well as the performance of ANSYS® Mechanical™, ANSYS® Fluent® and ANSYS® CFX®, LSTC LS-DYNA®, Simcenter STAR-CCM+™ and Dassault Systѐmes’ Simulia Abaqus on the Dell EMC Ready Bundle for HPC.

     

    Dell EMC Ready Bundle for HPC Digital ManufacturingANSYS Performance

      

    Dell EMC Ready Bundle for HPC Digital ManufacturingLSTC LS-DYNA Performance

     

    Dell EMC Ready Bundle for HPC Digital ManufacturingSimcenter STAR-CCM+ Performance

     

    Dell EMC Ready Bundle for HPC Digital ManufacturingDassault Systѐmes’ Simulia Abaqus Performance

  • HPC Applications Performance on R740 with V100 GPUs

     

    Authors: Frank Han, Rengan Xu, Nishanth Dandapanthula.

    HPC Innovation Lab. February 2018

    Overview

     

    Not long ago, PowerEdge R740 Server was released as part of Dell’s 14th Generation server portfolio. It is a 2U Intel SkyLake based rack mount server and provides the ideal balance between storage, I/O and application acceleration. Besides VDI and Cloud, the server is also designed for HPC workloads. Compared to the previous R730 server, one of the major changes on GPU support is that, R740 supports up to 3 double width cards, which is one more than what R730 could support. This blog will focus on the performance of a single R740 server with 1, 2 and 3 Nvidia Tesla V100-PCIe GPUs. Multiple cards scaling number for HPC applications like High-Performance Linpack (HPL), High Performance Conjugate Gradients benchmark (HPCG) and Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) will be presented.

     

    Table1: Details of R740 configuration and software version

    Processor

    2 x Intel(R) Xeon(R) Gold 6150 @2.7GHz, 18c

    Memory

    384G(12*32G@2666MHz)

    Local Disk

    480G SSD

    Operating System

    Red Hat Enterprise Linux Server release 7.3

    GPU

    Nvidia Tesla V100-PCIe

    CUDA Driver

    387.26

    CUDA Toolkit

    9.1.85

    Processor Settings > Logical Processors

    Disabled

    System Profiles

    Performance

     

    High Performance Linpack (HPL)

    Figure 1: HPL Performance and efficiency with R740

    Figure 1 shows HPL performance and efficiency numbers. Performance data increases with multiple GPUs nearly linearly. Efficiency line isn’t flat, the peak 67.8% appears with the 2-card, which means configuration with 2 GPUs is the most optimized one for HPL. Number of 1 and 3 cards are about 7% lower than 2 card and they are affected by different factors:

    • For the 1 card case, this GPU based HPL application designs to bond CPU and GPU. While running in large scale with multiple GPU cards and nodes, it is known to make data access more efficient to bond GPU to the CPU. But testing with only 1 V100 here is a special case that only the 1st CPU bonded with the only GPU, and no workload on the 2nd CPU. Comparing with 2 cards result, the non-using part in Rpeak increased so efficiency dropped. This doesn’t mean GPU less efficient, just because HPL application designs and optimizes for large scales, and 1 GPU only situation is special for HPL.  

    • For the 3 cards case, one of the major limitation factor is 3x1(P and Q) matrix need to be used. HPL is known to perform better with squared PxQ matrix. We also verified 2x2 and 4x1 Matrix on C4140 with 4x GPUs, and as a result 2x2 did perform better. But keep in mind, with 3 cards, the Rmax increased significantly as well. HPL is an extreme benchmark, in real world the capability of having the additional 3rd GPU will give a big advantage for efficiency with different application and different dataset.    

     

    HPCG

    Figure 2: HPCG Performance with R740

    As shown in Figure 2, comparing with dual Xeon 6150 CPU only performance, single V100 is already 3 times faster, 2 V100 is 8 times and 3 V100 (CPU only node vs GPU node) is nearly 12 times. Just like HPL, HPCG is also designed for a large scale, so single card performance isn’t as efficient as multiple cards on HPCG. But unlike HPL, 3 card performance on HPCG is linearly scaled. It is 1.48 times higher than 2 cards’, which is very close to the theoretical 1.5 times. This is because all the HPCG workload is run on GPU and its data fits in GPU memory. This proves application like HPCG can take the advantage of having the 3rd GPU.

     

    LAMMPS

    Figure 3: LAMMPS Performance with R740

     

    The LAMMPS version used for this testing is 17Aug2017, which is the latest stable version at the time of testing. The testing dataset is in.intel.lj, which is the same one in all pervious GPU LAMMPS testing and it can be found here. With the same parameters set from previous testing, the initial values of space were x=4, y=z=2, the simulation executes with 512k atoms. Weak scaling obvious as timesteps/s number of 2 and 3 cards only 1.5 and 1.7 times than single card’s. The reason for this is that the workload isn’t heavy enough for 3 V100 GPUs. After adjusting all x,y,z to 8, 16M atoms generated in simulation, and then the performance scaled well with multiple cards. As shown in Figure 3, 2 and 3 cards is 1.8 and 2.4 times faster than single card, respectively. This results of LAMMPS is another example for GPU accelerated HPC applications that can benefit from having more GPUs in the system.

     

    Conclusion

    The R740 server with multiple Nvidia Tesla V100-PCIe GPUs demonstrates exceptional performance for applications like HPL, HPCG and LAMMPS. Besides balanced I/O, R740 has the flexibility for running HPC applications with 1, 2 or 3 GPUs. The newly added support for an additional 3rd GPU provides more compute power as well as larger total memory in GPU. Many applications work best when data fits in GPU memory and having the 3rd GPU allows fitting larger problems with R740.

    References:

    PowerEdge R740 Technical Guide: http://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/PowerEdge_R740_R740xd_Technical_Guide.pdf

     

     

  • Performance of LS-DYNA on Singularity Containers

    Authors: Nirmala Sundararajan, Joshua Weage, Nishanth Dandapanthula

    HPC Innovation Lab, February 2018

     

    Overview

    We often look at containers today and see the potential of accelerated application delivery, scaling and portability and wonder how we ever got by without them. This blog discusses the performance of LSTC LS-DYNA® within Singularity containers and on bare metal. This blog is the third in the series of blogs regarding container technology by the HPC Engineering team. The first blog Containers, Docker, Virtual Machines and HPC explored the use of containers. The second blog Containerizing HPC Applications with Singularity provided an introduction to Singularity and discussed the challenges of using Singularity in HPC environments. This third blog will focus on determining if there is a performance penalty while running the application (LS-DYNA) in a containerized environment.

    Containerizing LS-DYNA using Singularity

    An application specific container is a lightweight bundle of an application and its dependencies. The primary advantages of an application container are its portability and reproducibility. When we started assessing what type of workloads/applications could be containerized using singularity, interestingly enough, the first application that we tried to containerize using singularity was Fluent®, which presented a constraint. Since Fluent bundles MPI libraries, mpirun has to be invoked from within the container, instead of calling mpirun from outside the container. Adoption of containers is difficult in this scenario and requires an sshd wrapper. So we shifted gears, and started working on LS-DYNA  which is a general-purpose finite element analysis (FEA) program, capable of simulating complex real world problems. LS-DYNA consists of a single executable file and is entirely command line driven. Therefore, all that is required to run LS-DYNA is a command line shell, the executable, an input file, and enough disk space to run the calculation. It is used to simulate a whole range of different engineering problems using its fully automated analysis capabilities. LS-DYNA is used worldwide in multiple engineering disciplines such as automobile, aerospace, construction, military, manufacturing, and bioengineering industries. It is robust and has worked extremely well over the past 20 years in numerous applications such as crashworthiness, drop testing and occupant safety.

    The first step in creating the container for LS-DYNA would be to have a minimal operating system (CentOS 7.3 in this case), basic tools to run the application and support for InfiniBand within the container. With this, the container gets a runtime environment, system tools, and libraries. The next step would be to install the application binaries within the container. The definition file used to create the LS-DYNA container is given below. In this file, the bootstrap references the kind of base to be used and a number of options are available. “shub” pulls images hosted on Singularity Hub, “docker” pulls images from Docker Hub and here yum is used to install Centos-7.

    Definition File:

    BootStrap: yum

    OSVersion: 7

    MirrorURL: http://vault.centos.org/7.3.1611/os/x86_64/

    Include: yum

     

    %post

    # basic-tools and dev-tools

    yum -y install evince wget vim zip unzip gzip tar perl

    yum -y groupinstall "Development tools" --nogpgcheck

     

    # InfiniBand drivers

    yum -y --setopt=group_package_types=optional,default,mandatory groupinstall "InfiniBand Support"

    yum -y install rdma
    yum -y install libibverbs-devel libsysfs-devel

    yum -y install infinipath-psm

     

    # Platform MPI

    mkdir -p /home/inside/platform_mpi

    chmod -R 777 /home/inside/platform_mpi

    cd /home/inside/platform_mpi

    /usr/bin/wget http://192.168.41.41/platform_mpi-09.1.0.1isv.x64.bin

    chmod 777 platform_mpi-09.1.0.1isv.x64.bin

    ./platform_mpi-09.1.0.1isv.x64.bin -installdir=/home/inside/platform_mpi -silent

    rm platform_mpi-09.1.0.1isv.x64.bin

     

    # Application

    mkdir -p /home/inside/lsdyna/code

    chmod -R 777 /home/inside/lsdyna/code

    cd /home/inside/lsdyna/code/usr/bin/wget http://192.168.41.41/ls-dyna_mpp_s_r9_1_113698_x64_redhat54_ifort131_avx2_platformmpi.tar.gz

    tar -xvf ls-dyna_mpp_s_r9_1_113698_x64_redhat54_ifort131_avx2_platformmpi.tar.gz

    rm ls-dyna_mpp_s_r9_1_113698_x64_redhat54_ifort131_avx2_platformmpi.tar.gz

     

    %environment

    PATH=/home/inside/platform_mpi/bin:$PATH

    LSTC_LICENSE=network

    LSTC_MEMORY=auto

    export PATH LSTC_LICENSE LSTC_MEMORY

    %runscript

    exec ls

    /

    It is quick and easy to build a ready to use singularity image using the above definition file. In addition to the environment variables specified in the definition file, the value of any other variable such as LSTC_LICENSE_SERVER, can be set inside the container, with the use of the prefix “SINGULARITYENV_”. Singularity adopts a hybrid model for MPI support and the ‘mpirun’ is called, from outside the container, as follows:

    /home/nirmala/platform_mpi/bin/mpirun -env_inherit -np 80 -hostfile hostfile singularity exec completelsdyna.simg /home/inside/lsdyna/code/ls-dyna_mpp_s_r9_1_113698_x64_redhat54_ifort131_avx2_platformmpi i=Caravan-V03c-2400k-main-shell16-120ms.k memory=600m memory2=60m endtime=0.02 

     

    LS_DYNA car2car Benchmark Test

    The car2car benchmark presented here is a simulation of a two vehicle collision. The performance results for LS-DYNA are presented by using the Elapsed Time metric. This metric is the total elapsed time for the simulation in seconds as reported by LS-DYNA. A lower value represents better performance. The performance tests were conducted on two clusters in the HPC Innovation Lab, one with Intel® Xeon® Scalable Processor Family processors (Skylake) and another with Intel® Xeon® E5-2600 v4 processors (Broadwell). The software installed on the Skylake cluster is shown in Table 1

    The configuration details of the Skylake cluster are shown below in Table 2.

    Figure1 shows that there is no perceptible performance loss while running LS-DYNA inside a containerized environment, both on a single node and across the four node cluster. The results for the bare-metal and container tests are within 2% of each other, which is within the expected run-to-run variability of the application itself.

     

    Figure 1 Performance inside singularity containers relative to bare metal on Skylake Cluster.

    The software installed on the Broadwell cluster is shown in Table 3

    The configuration details of the Broadwell cluster are shown below in Table 4 .

    Figure 2 shows that there is no significant performance penalty while running LS-DYNA car2car inside Singularity containers on Broadwell cluster. The performance delta is within 2% only.

    Figure 2 Performance inside singularity containers relative to bare metal on Broadwell Cluster

    Conclusion and Future Work

    In this blog, we discussed how the performance of LS-DYNA within singularity containers is almost at par with running the application on bare metal servers. The performance difference while running LS-DYNA within Singularity containers remains within 2%, which is within the run-to-run variability of the application itself.. The Dell EMC HPC team will focus on containerizing other applications, and this blog series will be continued. So stay tuned for the next blog in this series!


  • Design Principles for HPC

    Dell EMC HPC Innovation Lab, February 2018.

    HPC system configuration can be a complex task, especially at scale, requiring a balance between user requirements, performance targets, data center power, cooling and space constraints and pricing. Furthermore, many choices among competing technologies complicates configuration decisions. The document below describes a modular approach to HPC system design at scale where sub-systems are broken down into modules that integrate well together. The design and options for each module are backed by measured results, including application performance. Data center considerations are taking into account during design phase. Enterprise class services, including deployment, management and support, are available for the whole HPC system.

  • Skylake memory study

    Authors: Joseph Stanfield, Garima Kochhar and Donald Russell, Bruce Wagner.

    HPC Engineering and System Performance Analysis Teams, HPC Innovation Lab, January 2018.

    To function efficiently in an HPC environment, a cluster of compute nodes must work in tandem to compile complex data and achieve desired results. The user expects each node to function at peak performance as an individual system, as well as a part of an intricate group of nodes processing data in parallel. To enable efficient cluster performance, we first need good single system performance. With that in mind, we evaluated the impact of different memory configurations on single node memory bandwidth performance using the STREAM benchmark. The servers used here support the latest Intel Skylake processor (Intel Scalable Processor Family) and are from the Dell EMC 14th generation (14G) server product line.

    Less than 6 DIMMS per socket

    The Skylake processor has a built-in memory controller similar to previous generation Xeons but now supports *six* memory channels per socket. This is an increase from the four memory channels found in previous generation Xeon E5-2600 v3 and E5-2600 v4 processors. Different Dell EMC server models offer a different number of memory slots based on server density, but all servers offer at least one memory module slot on each of the six memory channels per socket.

    For applications that are sensitive to memory bandwidth and require predictable performance, configuring memory for the underlying architecture is an important consideration. For optimal memory performance, all six memory channels of a CPU should be populated with memory modules (DIMMs), and populated identically. This is a called a balanced memory configuration. In a balanced configuration all DIMMs are accessed uniformly and the full complement of memory channels are available to the application. An unbalanced memory configuration will lead to lower memory performance as some channels will be unused or used unequally. Even worse, an unbalanced memory configuration can lead to unpredictable memory performance based on how the system fractures the memory space into multiple regions and how Linux maps out these memory domains.

    Figure 1: Relative memory bandwidth with different number of DIMMs on one socket.

    PowerEdge C6420. Platinum 8176. 32 GB 2666 MT/s DIMMs.

    Figure 1 shows the drop in performance when all six memory channels of a 14G server are not populated. Using all six memory channels per socket is the best configuration, and will give the most predictable performance. This data was collected using the Intel Xeon Platinum 8176 processor. While the exact memory performance of a system depends on the CPU model, the general trends and conclusions presented here apply across all CPU models.

    Balanced memory configurations

    Focusing now on the recommended configurations that use all 12 memory channels in a two socket 14G system, there are different memory module options that allow different total system memory capacities. Memory performance will also vary depending on whether the DIMMs used are single ranked, double ranked, RDIMMS or LR-DIMMs. These variations are, however, significantly lower than any unbalanced memory configuration as shown in Figure 2.

    • 8GB 2666 MT/s memory is single ranked and have lower memory bandwidth than the dual ranked 16GB and 32GB memory modules.

    • 16GB and 32GB are both dual ranked and have similar memory bandwidth with 16G DIMMs demonstrating higher memory bandwidth.

    • 64GB memory modules are LR-DIMMS are have slightly lower memory bandwidth than the dual ranked RDIMMS.

    128GB memory modules are also LR-DIMMS but are lower performing than the 64GB modules and their prime attraction is the additional memory capacity. Note that LR-DIMMs also have higher latency and higher power draw. Here is an older study on previous generation 13G servers that describes these characteristics in detail.

     

    Figure 2: Relative memory bandwidth for different system capacities (12D balanced configs).

    PowerEdge R740. Platinum 8180. DIMM configuration as noted, all 2666 MT/s memory.

    Data for Figure 2 was collected on the Intel Xeon Platinum 8180 processor. As mentioned above, the memory subsystem performance depends on the CPU model since the memory controller is integrated with the processor, and the speed of the processor and number of cores also influence memory performance. The trends presented here will apply across the Skylake CPUs, though the actual percentage differences across configurations may vary. For example, here we see the 96 GB configuration has 7% lower memory bandwidth than the 384 GB configuration. With a different CPU model, that difference could be 9-10%.

    Figure 3 shows another example of balanced configurations, this time using 12 or 24 identical DIMMs in the 2S system where one DIMM per channel is populated (1DPC with 12DIMMs) or two DIMMs per channel are populated (2DPC using 24 DIMMs). The information plotted in Figure 3 was collected across two CPU models and shows the same patterns as Figure 2. Additionally, the following observations can be made:

    • With two 8GB single ranked DIMMs giving two ranks on each channel, some of the memory bandwidth lost with 1DPC SR DIMMs can be recovered with the 2DPC configuration.

    • 16GB dual ranked DIMMS perform better than 32GB DIMMs in 2DPC configurations too.

    We also measured the impact of this memory population when 2400 MT/s memory is used, and the conclusions were identical to those for 2666 MT/s memory. For brevity, the 2400 MT/s results are not presented here.

    Figure 3: Relative memory bandwidth for different system capacities (12D, 24D balanced configs).

    PowerEdge R640. Processor and DIMM configuration as noted. All 2666 MT/s memory.

    Unbalanced memory configurations

    In previous generation systems, the processor supported four memory channels per socket. This led to balanced configurations with eight or sixteen memory modules per dual socket server. Configurations of 8x16GB (128 GB), 16x16 GB or 8x32GB (256 GB), 16x32 GB (512 GB) were popular and recommended.

    With 14G and Skylake, these absolute memory capacities will lead to unbalanced configurations as these memory capacities do not distribute evenly across 12 memory channels. A configuration of 512 GB on 14G Skylake is possible but suboptimal, as shown in Figure 4. Across CPU models (Platinum 8176 down to Bronze 3106), there is a 65% to 35% drop in memory bandwidth when using an unbalanced memory configuration when compared to a balanced memory configuration! The figure compares 512 GB to 384 GB, but the same conclusion holds for 512 GB vs 768 GB as Figure 2 has shown us that a balanced 384 GB configuration performs similarly to a balanced 768 GB configuration.

    Figure 4: Impact of unbalanced memory configurations.

    PowerEdge C6420. Processor and DIMM configuration as noted. All 2666 MT/s memory.

    Near-balanced memory configurations

    The question that arises is - Is there a reasonable configuration that would work for capacities close to 256GB without having to go all the way to a 384GB configuration, and close to 512GB without having to raise the capacity all the way to 768GB?

    Dell EMC systems do allow mixing different memory modules, and this is described in more detail in the server owner manual. For example, the Dell EMC PowerEdge R640 has 24 memory slots with 12 slots per processor. Each processor’s set of 12 slots is organized across 6 channels with 2 slots per channel. In each channel, the first slot is identified by the white release tab while the second slot tab is black. Here is an extract of the memory population guidelines that permit mixing DIMM capacities.

    The PowerEdge R640 supports Flexible Memory Configuration, enabling the system to be configured and run in any valid chipset architectural configuration. Dell EMC recommends the following guidelines to install memory modules:

    RDIMMs and LRDIMMs must not be mixed.

    Populate all the sockets with white release tabs first, followed by the black release tabs.

    When mixing memory modules with different capacities, populate the sockets with memory modules with the highest capacity first. For example, if you want to mix 8 GB and 16 GB memory modules, populate 16 GB memory modules in the sockets with white release tabs and 8 GB memory modules in the sockets with black release tabs.

    Memory modules of different capacities can be mixed provided other memory population rules are followed (for example, 8 GB and 16 GB memory modules can be mixed).

    Mixing of more than two memory module capacities in a system is not supported.

    In a dual-processor configuration, the memory configuration for each processor should be identical. For example, if you populate socket A1 for processor 1, then populate socket B1 for processor 2, and so on.

    Populate six memory modules per processor (one DIMM per channel) at a time to maximize performance.

    *One important caveat is that 64 GB LRDIMMs and 128 GB LRDIMMs cannot be mixed; they are different technologies and are not compatible.

    So the question is, how bad are mixed memory configurations for HPC? To address this, we tested valid “near-balanced configurations” as described in Table 1, with the results displayed in Figure 5.

    Table 1: Near balanced memory configurations

    Figure 5: Impact of near-balanced configurations.

    PowerEdge R640. Processor and DIMM configuration as noted. All 2666 MT/s memory.

    Figure 5 illustrates that near-balanced configurations are a reasonable alternative when the memory capacity requirements demand a compromise. All memory channels are populated, and this helps with the memory bandwidth. The 288 GB configuration uses single ranked 8GB DIMMs and we see the penalty single ranked DIMMS impose on the memory bandwidth.

    Conclusion

    The memory subsystem performance differences with balanced vs. unbalanced configurations and with different types of memory modules is not new to Skylake or Dell EMC servers. Previous studies for previous generations of servers and CPUs are listed below and show similar conclusions.
    • Memory should ideally be populated in a balanced configuration with all memory channels populated and populated identically for best performance. The number of memory channels are determined by the CPU and system architecture.
    • DIMM rank, type and memory speed have an impact on performance.
    • Unbalanced configurations are not recommended when optimizing for performance
    • Some near-balanced configurations are a reasonable alternative when the memory capacity requirements demand a compromise

    Previous memory configuration studies

    1. Performance and Energy Efficiency of the 14th Generation Dell PowerEdge Servers – Memory bandwidth results across Intel Skylake Processor Family (Skylake) CPU models (14G servers)

    2. http://en.community.dell.com/techcenter/extras/m/white_papers/20444326/download

    3. 13G PowerEdge Server Performance Sensitivity to Memory Configuration – Intel Xeon 2600 v3 and 2600 v4 systems (13G servers)

    4. Unbalanced Memory Performance – Intel Xeon E5-2600 and 2600 v2 systems (12G servers)

    5. Memory Selection Guidelines for HPC and 11G PowerEdge Servers – Intel Xeon 5500 and 5600 systems (11G servers)

  • Scaling Deep Learning on Multiple V100 Nodes

    Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula.

    HPC Innovation Lab. November 2017

     

    Abstract

    In our previous blog, we presented the deep learning performance on single Dell PowerEdge C4130 node with four V100 GPUs. For very large neural network models, a single node is still not powerful enough to quickly train those models. Therefore, it is important to scale the training model to multiple nodes to meet the computation demand. In this blog, we will evaluate the multi-node performance of deep learning frameworks MXNet and Caffe2. The Interconnect in use is Mellanox EDR InfiniBand. The results will show that both frameworks scale well on multiple V100-SXM2 nodes.

    Overview of MXNet and Caffe2

    In this section, we will give an overview about how MXNet and Caffe2 are implemented for distributed training on multiple nodes. Usually there are two ways to parallelize neural network training on multiple devices: data parallelism and model parallelism. In data parallelism, all devices have the same model but different devices work on different pieces of data. While in model parallelism, difference devices have parameters of different layers of a neural network. In this blog, we only focus on the data parallelism in deep learning frameworks and will evaluate the model parallelism in the future. Another choice in most deep learning frameworks is whether to use synchronous or asynchronous weight update. The synchronous implementation aggregates the gradients over all workers in each iteration (or mini-batch) before updating the weights. However, in asynchronous implementation, each worker updates the weight independently with each other. Since the synchronous way guarantees the model convergence while the asynchronous way is still an open question, we only evaluate the synchronous weight update.  

    MXNet is able to launch jobs on a cluster in several ways including SSH, Yarn, MPI. For this evaluation, SSH was chosen. In SSH mode, the processes in different nodes use rsync to synchronize the working directory from root node into slave nodes. The purpose of synchronization is to aggregate the gradients over all workers in each iteration (or mini-batch). Caffe2 uses Gloo library for multi-node training and Redis library to facilitate management of nodes in distributed training. Gloo is a MPI like library that comes with a number of collective operations like barrier, broadcast and allreduce for machine learning applications. The Redis library used by Gloo is used to connect all participating nodes.

    Testing Methodology

    We chose to evaluate two deep learning frameworks for our testing, MXNet and Caffe2. As with our previous benchmarks, we will again use the ILSVRC 2012 dataset which contains 1,281,167 training images and 50,000 validation images. The neural network in the training is called Resnet50 which is a computationally intensive network that both frameworks support. To get the best performance, CUDA 9 compiler, CUDNN 7 library and NCCL 2.0 are used for both frameworks, since they are optimized for V100 GPUs. The testing platform has four Dell EMC’s PowerEdge C4130 servers in configuration K. The system layout of configuration K is shown in Figure 1. As we can see, the server has four V100-SXM2 GPUs and all GPUs are connected by NVLink. The other hardware and software details are shown in Table 1. Table 2 shows the input parameters that are used to train Resnet50 neural network in both frameworks.

    Figure 1: C4130 configuration K

    Table 1: The hardware configuration and software details

    Table 2: Input parameters used in different deep learning frameworks

    Performance Evaluation

    Figure 2 and Figure 3 show the Resnet50 performance and speedup results on multiple nodes with MXNet and Caffe2, respectively. As we can see, the performance scales very well with both frameworks. With MXNet, compared to 1*V100, the speedup of using 16*V100 (in 4 nodes) is 15.4x in FP32 mode and 13.8x in FP16, respectively. And compared to FP32, FP16 improved the performance by 63.28% - 82.79%. Such performance improvement was contributed to the Tensor Cores in V100.

    In Caffe2, compared to 1*V100, the speedup of using 16*V100 (4 nodes) is 14.8x in FP32 and 13.6x in FP16, respectively. And the performance improvement of using FP16 compared to FP32 is 50.42% - 63.75% excluding the 12*V100 case. With 12*V100, using FP16 is only 29.26% faster than using FP32. We are still investigating the exact reason of it, but one possible explanation is that 12 is not the power of 2, which may make some operations like reductions slower.

    Figure 2: Performance of MXNet Resnet50 on multiple nodes

    Figure 3: Performance of Caffe2 Resnet50 on multiple nodes

    Conclusions and Future Work

    In this blog, we present the performance of MXNet and Caffe2 on multiple V100-SXM2 nodes. The results demonstrate that the deep learning frameworks are able to scale very well on multiple Dell EMC’s PowerEdge servers. At this time the FP16 support in TensorFlow is still experimental, our evaluation is in progress and the results will be included in future blogs. We are also working on containerizing these frameworks with Singularity to make their deployment much easier.