Dell Community
High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • Containerizing HPC Applications with Singularity

    Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula.

    HPC Innovation Lab. October 2017

    Overview

    In this blog, we will give an introduction to Singularity containers and how they should be used to containerize HPC applications. We run different deep learning frameworks with and without Singularity containers and show that there is no performance loss with Singularity containers. We also show that Singularity can be easily used to run MPI applications.

    Introduction to Singularity

    Singularity is a container system developed by Lawrence Berkeley Lab to provide container technology like Docker for High Performance Computing (HPC). It wraps applications into an isolated virtual environment to simplify application deployment. Unlike virtual machines, the container does not have a virtual hardware layer and its own Linux kernel inside the host OS. It is just sandboxing the environment; therefore, the overhead and the performance loss are minimal. The goal of the container is reproducibility. The container has all environment and libraries an application needs to run, and it can be deployed anywhere so that anyone can reproduce the results the container creator generated for that application.

    Besides Singularity, another popular container is Docker, which has been widely used for many applications. However, there are several reasons that Docker is not suitable for an HPC environment. The following are various reasons that we choose Singularity rather than Docker:

    • Security concern. The Docker daemon has root privileges and this is a security concern for several high performance computing centers. In contrast, Singularity solves this by running the container with the user’s credentials. The access permissions of a user are the same both inside the container and outside the container. Thus, a non-root user cannot change anything outside of his/her permission.

    • HPC Scheduler. Docker does not support any HPC job scheduler, but Singularity integrates seamlessly with all job schedulers including SLURM, Torque, SGE, etc.

    • GPU support. Docker does not support GPU natively. Singularity is able to support GPUs natively. Users can install whatever CUDA version and software they want on the host which can be transparently passed to Singularity.

      MPI support. Docker does not support MPI natively. So if a user wants to use MPI with Docker, a MPI-enabled Docker needs to be developed. If a MPI-enabled Docker is available, the network stacks such as TCP and those needed by MPI are private to the container which makes Docker containers not suitable for more complicated networks like Infiniband. In Singularity, the user’s environment is shared to the container seamlessly.

    Challenges with Singularity in HPC and Workaround

    Many HPC applications, especially deep learning applications, have deep library dependences and it is time consuming to figure out these dependences and debug build issues. Most deep learning frameworks are developed in Ubuntu but they need to be deployed to Red Hat Enterprise Linux. So it is beneficial to build those applications once in a container and then deploy them anywhere. The most important goal of Singularity is portability which means once a Singularity container is created, the container can be run on any system. However, so far it is still not easy to achieve this goal. Usually we build a container on our own laptop, a server, a cluster or a cloud, and then deploy that container on a server, a cluster or a cloud. When building a container, one challenge is in GPU-based systems which have GPU driver installed. If we choose to install GPU driver inside the container, but the driver version does not match the host GPU driver, then an error will occur. So the container should always use the host GPU driver. The next option is to bind the paths of the GPU driver binary file and libraries to the container so that these paths are visible to the container. However, if the container OS is different than the host OS, such binding may have problems. For instance, assume the container OS is Ubuntu while the host OS is RHEL, and on the host the GPU driver binaries are installed in /usr/bin and the driver libraries are installed in /usr/lib64. Note that the container OS also have /usr/bin and /usr/lib64; therefore, if we bind those paths from the host to the container, the other binaries and libraries inside the container may not work anymore because they may not be compatible in different Linux distributions. One workaround is to move all those driver related files to a central place that does not exist in the container and then bind that central place.

    The second solution is to implement the above workaround inside the container so that the container can use those driver related files automatically. This feature has already been implemented in the development branch of Singularity repository. A user just need to use the option “--nv” when launching the container. However, based on our experience, a cluster usually installs GPU driver in a shared file system instead of the default local path on all nodes, and then Singularity is not able to find the GPU driver path if the driver is not installed in the default or common paths (e.g. /usr/bin, /usr/sbin, /bin, etc.). Even if the container is able to find the GPU driver and the corresponding driver libraries and we build the container successfully, if in the deployment system the host driver version is not new enough to support the GPU libraries which were linked to the application when building the container, then an error will occur. Because of the backward compatibility of GPU driver, the deployment system should keep the GPU driver up to date to ensure its libraries are equal to or newer than the GPU libraries that were used for building the container.

    Another challenge is to use InfiniBand with the container because InfiniBand driver is kernel dependent. There is no issue if the container OS and host OS are the same or compatible. For instance, RHEL and Centos are compatible, and Debian and Ubuntu are compatible. But if these two OSs are not compatible, then it will have library compatibility issue if we let the container use the host InfiniBand driver and libraries. If we choose to install the InfiniBand driver inside the container, then the drivers in the container and the host are not compatible. The Singularity community is still trying hard to solve this InfiniBand issue. Our current solution is to make the container OS and host OS be compatible and let the container reuse the InfiniBand driver and libraries on the host.

    Singularity on Single Node

    To measure the performance impact of using Singularity container, we ran the neural network Inception-V3 with three deep learning frameworks: NV-Caffe, MXNet and TensorFlow. The test server is a Dell PowerEdge C4130 configuration G. We compared the training speed in images/sec with Singularity container and bare-metal (without container). The performance comparison is shown in Figure 1. As we can see, there is no overhead or performance penalty when using Singularity container.

    Figure 1: Performance comparison with and without Singularity

     

    Singularity at Scale

    We ran HPL across multiple nodes and compared the performance with and without container. All nodes are Dell PowerEdge C4130 configuration G with four P100-PCIe GPUs, and they are connected via Mellanox EDR InfiniBand. The result comparison is shown in Figure 2. As we can see, the percent performance difference is within ± 0.5%. This is within normal variation range since the HPL performance is slightly different in each run. This indicates that MPI applications such as HPL can be run at scale without performance loss with Singularity.

     

    Figure 2: HPL performance on multiple nodes

    Conclusions

    In this blog, we introduced what Singularity is and how it can be used to containerize HPC applications. We discussed the benefits of using Singularity over Docker. We also mentioned the challenges of using Singularity in a cluster environment and the workarounds available. We compared the performance of bare metal vs Singularity container and the results indicated that there is no performance loss when using Singularity. We also showed that MPI applications can be run at scale without performance penalty with Singularity.

  • HPC Applications Performance on V100

    Authors: Frank Han, Rengan Xu, Nishanth Dandapanthula.

    HPC Innovation Lab. August 2017

     

    Overview

    This is one of two articles in our Tesla V100 blog series. In this blog, we present the initial benchmark results of NVIDIA® Tesla® Volta-based V100™ GPUs on 4 different HPC benchmarks, as well as a comparative analysis against previous generation Tesla P100 GPUs. We are releasing another V100 series blog, which discusses our V100 and deep learning applications. If you haven’t read it yet, it is highly recommend to take a look here.

    PowerEdge C4130 with V100 GPU support

    The NVIDIA® Tesla® V100 accelerator is one of the most advanced accelerators available in the market right now and was launched within one year of the P100 release. In fact, Dell EMC is the first in the industry to integrate Tesla V100 and bring it to market. As was the case with the P100, V100 supports two form factors: V100-PCIe and the mezzanine version V100-SXM2. The Dell EMC PowerEdge C4130 server supports both types of V100 and P100 GPU cards. Table 1 below notes the major enhancements in V100 over P100:

    Table 1: The comparison between V100 and P100

    PCIe

    SXM2

    P100

    V100

    Improvement

    P100

    V100

    Improvement

    Architecture

    Pascal

    Volta

    Pascal

    Volta

    CUDA Cores

    3584

    5120

    3584

    5120

    GPU Max Clock rate (MHz)

    1329

    1380

    1481

    1530

    Memory Clock rate (MHz)

    715

    877

    23%

    715

    877

    23%

    Tensor Cores

    N/A

    640

    N/A

    640

    Tensor Cores/SM

    N/A

    8

    N/A

    8

    Memory Bandwidth (GB/s)

    732

    900

    23%

    732

    900

    23%

    Interconnect Bandwidth
    Bi-Directional (GB/s)

    32

    32

    160

    300

    Deep Learning (TFlops)

    18.6

    112

    6x

    21.2

    125

    6x

    Single Precision (TFlops)

    9.3

    14

    1.5x

    10.6

    15.7

    1.5x

    Double Precision (TFlops)

    4.7

    7

    1.5x

    5.3

    7.8

    1.5x

    TDP (Watt)

    250

    300

     

    V100 not only significantly improves performance and scalability as will be shown below, but also comes with new features. Below are some highlighted features important for HPC Applications:

    • Second-Generation NVIDIA NVLink™

      All four V100-SXM2 GPUs in the C4130 are connected by NVLink™ and each GPU has six links. The bi-directional bandwidth of each link is 50 GB/s, so the bi-directional bandwidth between different GPUs is 300 GB/s. This is useful for applications requiring a lot of peer-to-peer data transfers between GPUs.

    • New Streaming Multiprocessor (SM)

      Single precision and double precision capability of the new SM is 50% more than the previous P100 for both PCIe and SXM2 form factors. The TDP (Thermal Design Power) of both cards are the same, which means V100 is ~1.5 times more energy efficient than the previous P100.

    • HBM2 Memory: Faster, Higher Efficiency

      The 900 GB/sec peak memory bandwidth delivered by V100, is 23% higher than P100. Also the DRAM utilization has been improved from 76% to 95%, which allows for a 1.5x improvement in delivered memory bandwidth.

      More in-depth details of all new features of V100 GPU card can be found at this Nvidia website.

       

    Hardware and software specification update

     

    All the performance results in this blog were measured on a PowerEdge Server C4130 using Configuration G (4x PCIe V100) and Configuration K (4x V100-SXM2). Both these configurations have been used previously in P100 testing. Also except for the GPU, the hardware components remain identical to those used in the P100 tests as well: dual Intel Xeon E5-2690 v4 processors, 256GB (16GB*16 2400 MHz) Memory and an NFS file system mounted via IPoIB on InfiniBand EDR were used. Complete specs details are included in our previous blog. Moreover, if you are interested in other C4130 configurations besides G and K, you can find them in our K80 blog.

    There are some changes on the software front. In order to unleash the power of the V100, it was necessary to use the latest version of all software components. Table 2 lists the versions used for this set of performance tests. To keep the comparison fair, the P100 tests were reran using the new software stack to normalize for the upgraded software.

    Table 2: The changes in software versions

    Software

    Current Version

    Previous version in P100 blog

    OS

    RHEL 7.3

    RHEL 7.2

    GPU Driver

    384.59

    361.77/375.20

    CUDA Toolkit

    9.0.103RC

    8.0.44

    OpenMPI

    1.10.7 & 2.1.2

    1.10.1 & 2.0.1

    HPL

    Compiled with sm7.0

    Compiled with sm6.0

    HPCG

    Compiled with sm7.0

    -

    AMBER

    16, AmberTools17 update 20

    16 AmberTools16 update3

    LAMMPS

    patch_17Aug2017

    30Sep16

     

    p2pBandwidthLatencyTest

     

    p2pBandwidthLatencyTest is a micro-benchmark included in the CUDA SDK. It tests the card to card bandwidth and latency with and without GPUDirect™ Peer-to-Peer enabled. Since the full output matrix is fairly long, the unidirectional P2P result is listed below as an example here to demonstrate the way to verify the NVLINKs speed on both V100 and P100.

    In theory, V100 has 6x 25GB/s uni-directional links, giving 150GB/s throughput. The previous P100-SXM2 only has 4x 20GB/s links, delivering 80GB/s. The results of p2pBandwitdhtLatencyTest on both cards are in Table 3. “D/D” represents “device-to-device”, that is the bandwidth available between two devices (GPUs). The achievable bandwidth of GPU0 was calculated by aggregating the second, third and fourth value in the first line, which represent the throughput from GPU0 to GPU1, GPU2 and GPU3 respectively.

    Table 3: Unidirectional peer-to-peer bandwidth

    Unidirectional P2P=Enabled Bandwidth Matrix (GB/s).   Four GPUs cards in the server.

    P100

    V100

    D\D

    0

    1

    2

    3

    D\D

    0

    1

    2

    3

    0

    231.53

    18.36

    18.31

    36.39

    0

    727.38

    47.88

    47.9

    47.93

    1

    18.31

    296.74

    36.54

    18.33

    1

    47.92

    725.61

    47.88

    47.89

    2

    18.35

    36.08

    351.51

    18.36

    2

    47.91

    47.91

    726.41

    47.95

    3

    36.59

    18.42

    18.42

    354.79

    3

    47.96

    47.89

    47.9

    725.02

    It is clearly seen that V100-SXM2 on C4130 configuration K is significant faster than P100-SXM2, on:

    1. Achievable throughput. V100-SXM2 has 47.88+47.9+47.93= 143.71 GB/s aggregated achievable throughput, which is 95.8% of the theoretical value 150GB/s and significant higher than 73.06GB/s and 91.3% on P100-SXM2. The bandwidth for bidirectional traffic is twice that of unidirectional traffic and is also very close to the theoretically 300 GB/s throughput.

    2. Real world application. Symmetric access is the key for real world applications, on each chipset, P100 has 4 links, out of which three are connected to each of the other three GPUS. The remaining fourth link is connected to one of the other three GPUs. So, there are two links between GPU0 and GPU3, but only 1 link between GPU0 and GPU1 as well as GPU0 and GPU2. This is not symmetrical. The above numbers of p2pBandwidthLatencyTest in blue show this imbalance, as the value between GPU0 to GPU3 reaches 36.39 GB/s, which is double the bandwidth between GPU0 and GPU1 or GPU0 and GPU2. In most real world applications, it is common for the developer to treat all cards equally and not take such architectural differences into account. Therefore it will be likely that the faster pair of GPUs will need to wait for the slowest transfers, which means that 18.31 GB/s is the actual speed between all pairs of GPUs.

      On the other hand, V100 has a symmetrical design with 6 links as seen in Figure 1. GPU0 to GPU1, GPU2, or GPU3 all have 2 links between each pair. So 47.88 GB/s is the achievable link bandwidth for each, which is 2.6 times faster than the P100.


    Figure1: V100 and P100 Topologies on C4130 configuration K

     

    High Performance Linpack (HPL)

    Figure2: HPL Multi-GPU results with V100 and P100 on C4130 configuration G and K

    Figure 2 shows the HPL performance on the C4130 platform with 1, 2 and 4 V100-PCIe and V100-SXM2 installed. P100’s performance number is also listed for comparison. It can be observed:

    1. Both P100 and V100 scaling well, performance increases as more GPUs are added.

    2. V100 is ~30% faster than P100 on both PCIe (Config G) and SMX2 (Config K).

    3. A single C4130 server with 4x V100 reaches over 20TFlops on PCIe (Config G).

      HPL is a system level benchmark and its performance is limited by other components like CPU, memory and PCIe bandwidth. Configuration G is a balanced design, which has 2 PCIe links between CPU and GPU and this is why it outperforms configuration K with 4x GPUs in the HPL benchmark. We do see some other applications perform better in Configuration K, since SXM2 (Config K) supports NVLink, higher core clock speed and peer-to-peer data transfer, these are described below.

       

    HPCG

    Figure 3: HPCG Performance results with 4x V100 and P100 on C4130 configuration G and K

    HPCG, the High Performance Conjugate Gradients benchmark, is another well-known metric for HPC system ranking. Unlike HPL, its performance is strongly influenced by memory bandwidth. Credit to the faster and higher efficient HBM2 memory of V100, the performance improvement observed is 44% over P100 on both Configuration G and K.

    AMBER

    Figure 4: AMBER Multi-GPU results with V100 and P100 on C4130 configuration G and K

    Figure 4 illustrates AMBER’s results with Satellite Tobacco Mosaic Virus (STMV) dataset. On SXM2 system (Config K), AMBER scales weakly with 2 and 4 GPUs. Even though the scaling is not strong, V100 has noticeable improvement than P100, giving ~78% increase in single card runs, and 1x V100 is actually 23% faster than 4x P100. On the PCIe (Config G) side, 1 and 2 cards perform similar to SXM2, but 4 cards’ results dropped sharply. This is because PCIe (Config G) only supports Peer-to-Peer access between GPU0/1 and GPU2/3 and not among all four GPUs. Since AMBER has redesigned the way data transfers among GPUs to address the PCIe bottleneck, it relies heavily on Peer-to-Peer access for performance with multiple GPU cards. Hence a fast, direct interconnect like NVLink between all GPUs in SXM2 (Config K) is vital for AMBER multiple GPU performance.

    Figure 5: AMBER Multi-GPU Aggregate results with V100 and P100 on C4130 configuration G and K

    To compensate for a single job’s weak scaling on multiple GPUs, there is another use case promoted by AMBER developers, which is running multiple jobs in the same node concurrently but where each job uses only 1 or 2 GPUs. Figure 5 shows the results of 1-4 individual jobs on one C4130 with V100s and the numbers indicate that those individual jobs have little impact on each other. This is because AMBER is designed to run pretty much entirely on the GPUs and has very low dependency on the CPU. The aggregate throughput of multiple individual jobs scales linearly in this case. Without any card to card communication, the 5% better performance on SXM2 is contributed by its higher clock speed.

     

    LAMMPS

     

    Figure 6: LAMMPS 4-GPU results with V100 and P100 on C4130 configuration G and K

    Figure 6 shows LAMMPS performance on both configurations G and K. The testing dataset is Lennard-Jones liquid dataset, which contains 512000 atoms, and LAMMPS compiled with the kokkos package. V100 is 71% and 81% faster on Config G and Config K respectively. Comparing V100-SXM2 (Config K) and V100-PCIe (Config G), the former is 5% faster due to NVLINK and higher CUDA core frequency.

     

    Conclusion

     

    Figure 7: V100 Speedups on C4130 configuration G and K

    The C4130 server with NVIDIA® Tesla® V100™ GPUs demonstrates exceptional performance for HPC applications that require faster computational speed and highest data throughput. Applications like HPL, HPCG benefit from the additional PCIe links between CPU and GPU that are offered by Dell PowerEdge C4130 configuration G. On the other hand, applications like AMBER and LAMMPS were boosted with C4130 configuration K, owing to P2P access, higher bandwidth of NVLink and higher CUDA core clock speed. Overall, a PowerEdge C4130 with Tesla V100 GPUs performs 1.24x to 1.8x faster than a C4130 with P100 for HPL, HPCG, AMBER and LAMMPS.

     

  • Deep Learning on V100

    Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula.

    HPC Innovation Lab. September 2017

    Overview

    In this blog, we will introduce the NVIDIA Tesla Volta-based V100 GPU and evaluate it with different deep learning frameworks. We will compare the performance of the V100 and P100 GPUs. We will also evaluate two types of V100: V100-PCIe and V100-SXM2. The results indicate that in training V100 is ~40% faster than P100 with FP32 and >100% faster than P100 with FP16, and in inference V100 is 3.7x faster than P100. This is one blog of our Tesla V100 blog series. Another blog of this series is about the general HPC applications performance on V100 and you can read it here

    Introduction to V100 GPU

    In the 2017 GPU Technology Conference (GTC), NVIDIA announced the Volta-based V100 GPU. Similar to P100, there are also two types of V100: V100-PCIe and V100-SXM2. V100-PCIe GPUs are inter-connected by PCIe buses and the bi-directional bandwidth is up to 32 GB/s. V100-SXM2 GPUs are inter-connected by NVLink and each GPU has six links and the bi-directional bandwidth of each link is 50 GB/s, so the bi-directional bandwidth between different GPUs is up to 300 GB/s. A new type of core added in V100 is called tensor core which was designed specifically for deep learning. These cores are essentially a collection of ALUs for performing 4x4 matrix operations: specifically a fused multiply add (A*B+C), multiplying two 4x4 FP16 matrices together, and then adding to a FP16/FP32 4x4 matrix to generate a final 4x4 FP16/FP32 matrix. By fusing matrix multiplication and add in one unit, the GPU can achieve high FLOPS for this operation. A single Tensor Core performs the equivalent of 64 FMA operations per clock (for 128 FLOPS total), and with 8 such cores per Streaming Multiprocessor (SM), 1024 FLOPS per clock per SM. By comparison, even with pure FP16 operations, the standard CUDA cores in a SM only generate 256 FLOPS per clock. So in scenarios where these cores can be used, V100 is able to deliver 4x the performance versus P100. The detailed comparison between V100 and P100 is in Table 1. 

                                                                      

    Table 1: The comparison between V100 and P100

    Testing Methodology

    As in our previous deep learning blog, we still use the three most popular deep learning frameworks: NVIDIA’s fork of Caffe (NV-Caffe), MXNet and TensorFlow. Both NV-Caffe and MXNet have been optimized for V100. TensorFlow still does not have any official release to support V100, but we applied some patches obtained from TensorFlow developers so that it is also optimized for V100 in these tests. For the dataset, we still use ILSVRC 2012 dataset whose training set contains 1281167 training images and 50000 validation images. For the testing neural network, we chose Resnet50 as it is a computationally intensive network. To get best performance, we used CUDA 9-rc compiler and CUDNN library in all of the three frameworks since they are optimized for V100. The testing platform is Dell EMC’s PowerEdge C4130 server. The C4130 server has multiple configurations, and we evaluated both P100-PCIe in configuration G and P100-SXM2 in configuration K. The difference between configuration G and configuration K is shown in Figure 1. There are mainly two differences: one is that configure G has two x16 PCIe link connecting from dual CPUs to the four GPUs, while configure K has only one x16 PCIe bus connecting from one CPU to four GPUs; another difference is that GPUs are connected by PCIe buses in configure G but by NVLink in configure K. The other hardware and software details are shown in Table 2.

      

    Figure 1: Comparison between configure G and configure K

    Table 2: The hardware configuration and software details


    In this experiment, we trained various deep learning frameworks with one pass on the whole dataset since we were comparing only the training speed, not the training accuracy. Other important input parameters for different deep learning frameworks are listed in Table 3. For NV-Caffe and MXNet, in terms of different batch size, we doubled the batch size for FP16 tests since FP16 consumes half the memory for floating points as FP32. As TensorFlow does not support FP16 yet, we did not evaluate its FP16 performance in this blog. Because of different implementations, NV-Caffe consumes more memory than MXNet and TensorFlow for the same neural network, the batch size in FP32 mode is only half of that in MXNet and TensorFlow. In NV-Caffe, if FP16 is used, then the data type of several parameters need to be changed. We explain these parameters as follows: the solver_data_type controls the data type for master weights; the default_forward_type and default_backward_type controls the data type for training values; the default_forward_math and default_backward_math controls the data type for matrix-multiply accumulator. In this blog we used FP16 for training values, FP32 for matrix-multiply accumulator and FP32 for master weights. We will explore other combinations in our future blogs. In MXNet, we tried different values for the parameter “--data-nthreads” which controls the number of threads for data decoding.
     

    Table 3: Input parameters used in different deep learning frameworks

     

    Performance Evaluation

    Figure 1, Figure 2, and Figure 3 show the performance of V100 versus P100 with NV-Caffe, MXNet and TensorFlow, respectively. And Table 4 shows the performance improvement of V100 compared to P100. From these results, we can obtain the following conclusions:

    • In both PCIe and SXM2 versions, V100 is >40% faster than P100 in FP32 for both NV-Caffe and MXNet. This matches the theoretical speedup. Because FP32 is single precision floating points, and V100 is 1.5x faster than P100 in single precision. With TensorFlow, V100 is more than 30% faster than P100. Its performance improvement is lower than the other two frameworks and we think that is because of different algorithm implementations in these frameworks.

    • In both PCIe and SXM2 versions, V100 is >2x faster than P100 in FP16. Based on the specification, V100 tensor performance is ~6x than P100 FP16. The reason that the actual speedup does not match the theoretical speedup is that not all data are stored in FP16 and so not all operations are tensor operations (the FMA matrix multiply and add operation).

    • In V100, the performance of FP16 is close to 2x than that of FP32. This is because FP16 only requires half storage compared to FP32 and therefore we could double the batch size in FP16 to improve the computation speed.

    • In MXNet, we set the “--data-nthreads” to 16 instead of the default value 4. The default value is often sufficient to decode more than 1K images per second but still not fast enough for V100 GPU. In our testing, we found the default value 4 is enough for P100 but for V100 we need to set it at least 12 to achieve good performance, with a value of 16 being ideal.

    Figure 2: Performance of V100 vs P100 with NV-Caffe

    Figure 3: Performance of V100 vs P100 with MXNet

     

     

    Figure 4: Performance of V100 vs P100 with TensorFlow

    Table 4: Improvement of V100 compared to P100

     

    Since V100 supports both deep learning training and inference, we also tested the inference performance with V100 using the latest TensorRT 3.0.0. The testing was done in FP16 mode on both V100-SXM2 and P100-PCIe and the result is shown in Figure 5. We used batch size 39 for V100 and 10 for P100. Different batches were chosen to make their inference latencies are close to each other (~7ms in the figure). The result shows that when their latencies are close, the inference throughput of V100 is 3.7x faster compared to P100.

     Figure 5: Resnet50 inference performance on V100 vs P100

    Conclusions and Future Work

    After evaluating the performance of V100 with three popular deep learning frameworks, we conclude that in training V100 is more than 40% faster than P100 in FP32 and more than 100% faster in FP16, and in inference V100 is 3.7x faster than P100. This demonstrates the performance benefits when the V100 tensor cores are used. In the future work, we will evaluate different data type combinations in FP16 and study the accuracy impact with FP16 in deep learning training. We will also evaluate the TensorFlow with FP16 once support is added into the software. Finally, we plan to scale the training to multiple nodes with these frameworks.

  • 14G with Skylake – how much better for HPC?

    Garima Kochhar, Kihoon Yoon, Joshua Weage. HPC Innovation Lab. 25 Sep 2017.

    The document below presents performance results on Skylake based Dell EMC 14th generation systems for a variety of HPC benchmarks and applications (Stream, HPL, WRF, BWA, Ansys Fluent, STAR-CCM+ and LS-DYNA). It compares the performance of these new systems to previous generations, going as far back as Westmere (Intel Xeon 5500 series) in Dell EMC's 11th generation servers, showing the potential improvements when moving to this latest instance of server technology.

  • HPCG Performance study with Intel Skylake processors

    Author: Somanath Moharana and Ashish Kumar Singh, Dell EMC HPC Innovation Lab, September 2017

    This blog presents analysis of the High Performance Conjugate Gradient (HPCG) benchmark on the Intel(R) Xeon(R) Gold 6150 CPU codename “Skylake”. It also compares the performance of Intel(R) Xeon(R) Gold 6150 processors with its previous generation Intel(R) Xeon(R) CPU E5-2697 v4 Codename “Broadwell-EP” processors.

    Introduction to HPCG

    The High Performance Conjugate Gradients (HPCG) Benchmark is a metric for ranking HPC systems. HPCG can be considered as a complement to the High Performance LINPACK (HPL) benchmark. HPCG is designed to exercise computational and data access patterns that more closely match a different and broad set of applications that have impact on the collective performance of these applications.

    The HPCG benchmark is based on a 3D regular 27-point discretization of an elliptic partial differential equation. The 3D domain is scaled to fill a 3D virtual process grid for all of the available MPI ranks. The preconditioned conjugate gradient (CG) algorithm is used to solve the intermediate systems of equations and incorporates a local and symmetric Gauss-Seidel pre-conditioning step that requires a triangular forward solve and a backward solve. The benchmark exhibits irregular accesses to memory and fine-grain recursive computations.

     

    HPCG has four computational blocks: Sparse Matrix-vector multiplication (SPMV), Symmetric Gauss-Seidel (SymGS), vector update phase (WAXPBY) and Dot Product (DDOT), while two communication blocks MPI_Allreduce and Halos Exchange.

     

    Introduction to Intel Skylake processor

     

    Intel Skylake is a microarchitecture redesign using the same 14 nm manufacturing process technology with support for up to 28 cores per socket, serving as a "tock" in Intel's "tick-tock" manufacturing and design model. It supports 6 DDR4 memory channels per socket with 2 DPC (DIMMs per channel), where supported full memory bandwidth is up to 2666 MT/s.

    Please visit BIOS characteristics of Skylake processor-blog for a better understanding of Skylake processors and their bios features on Dell EMC platforms.

    Table 1: Details of Servers used for HPCG analysis

    Platforms

    PowerEdge C6420

    PowerEdge R630

    Processor

    2 x Intel(R) Xeon(R) Gold 6150 @2.7GHz, 18c

    2 x Intel(R) Xeon(R) CPU E5-2697 v4 @2.3GHz, 18c

    Memory  

    192GB (12 x 16GB) DDR4

    @2666MT

    128GB( 8 x 16GB ) DDR4

    @2400MT

    Inter Connect

    Intel Omni Path

    Intel Omni path

    Operating System                                

    Red Hat Enterprise Linux Server release 7.3

    Red Hat Enterprise Linux Server release 7.2

    Compiler  

    version 2017.0.4.196

    version 2017.0.098

    MKL

    Intel® MKL 2017.0.3

    Intel® MKL 2017.0.0

    Processor Settings > Logical Processors

    Disabled

    Disabled

    Processor Settings > Sub NUMA cluster

    Disabled

    NA

    System Profiles

    Performance

    Performance

    HPCG

    Version 3.0

    Version 3.0

    HPCG Performance analysis with Intel Skylake

    In HPCG we have to set the problem size to get the best results out of it. For a valid run, the problem size should be large enough so that the arrays accessed in the CG iteration loop does not fit in the cache of the device. The problem size should be large enough to occupy the significant fraction of “main memory”, at least 1/4th of the total.

    Adjusting local domain dimensions can affect global problem size. For HPCG performance characterization, we have chosen the local domain dimension of 160^3,192^3 and 224^3 with the execution time of t=30 seconds. The local domain dimension defines the global domain dimension by (NR*Nx) x (NR*Ny) x (NR*Nz), where Nx=Ny=Nz=160 or 192 or 224 and NR is the number of MPI processes used for the benchmark.

    Figure 1: HPCG Performance on multiple grid sizes with Intel Xeon Gold 6150 processors

    As shown in figure 1, we can observe that the local dimension grid size of 192^3 gives the best performance compared to other local dimension grid sizes i.e. 160^3 and 224^3. Here we are getting a performance of 36.14 GFLOP/s for a single node and we can observe a linear increase in performance with the increase in number of nodes. All these tests have been carried out with 4 MPI processes and 9 OpenMP threads per MPI process.

    Figure 2: Time consumed by HPCG computational routines Intel Xeon Gold 6150 processors

    Time spent by each routine is mentioned in the HPCG output file as shown in the figure 2. As per the above graph, HPCG spends its most of the time in the compute intensive pre-conditioning of SymGS function and matrix vector multiplication of sparse matrix (SPMV). The vector update phase (WAXPBY) consumes very less time in comparison to SymGS and least time by residual calculation (DDOT) out of all four computation routines. As the local grid size is same across all multi-node runs, the time spent by all four compute kernels for each multi-node run are approximately same.


    Figure 3: HPCG performance over multiple generation of Intel processors

    Figure 3 compares HPCG performance between Intel Broadwell-EP processors and Intel Skylake processors. Dots in the figure shows the performance improvement of Intel Skylake over Broadwell-EP processors. For a single node, we can observe ~65% better performance with Skylake processor than Broadwell-EP processors and ~67% better performance for both two nodes and four nodes.

    Conclusion

    HPCG with Intel(R) Xeon(R) Gold 6150 processor shows ~65% higher performance over Intel(R) Xeon(R) CPU E5-2697 v4 processors. HPCG scales out well with more number of nodes and shows a linear increase in performance with the increase in number of nodes.

    References

    1. http://www.hpcg-benchmark.org/index.html
    2. http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2017/01/17/hpcg-performance-study-with-intel-knl
    3. https://software.intel.com/en-us/mkl-linux-developer-guide-overview-of-the-intel-optimized-hpcg
    4. https://www.intel.in/content/www/in/en/silicon-innovations/intel-tick-tock-model-general.html
    5. http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2017/08/01/bios-characterization-for-hpc-with-intel-skylake-processor
    6. https://www.top500.org/project/linpack/

     

  • Performance study of four Socket PowerEdge R940 Server with Intel Skylake processors

    Author:    Somanath Moharana, Dell EMC HPC Innovation Lab, August 2017

    This blog explores the performance of the four socket Dell EMC PowerEdge R940 server with Intel Skylake processors. The latest Dell EMC 14th generation servers supports the new Intel® Xeon® Processor Scalable Family (processor architecture codenamed “Skylake”), and the increased number of cores and higher memory speed benefit a wide variety of HPC applications.

    The PowerEdge R940 is Dell EMC’s latest 4-socket, 3U rack server designed to run complex workloads, which supports up to 6TB of DDR4 memory and up to 122 TB of storage. The system features the Intel® Xeon® Scalable Processor Family, 48 DDR4 DIMMs, up to 13 PCI Express® (PCIe) 3.0 enabled expansion slots and a choice of embedded NIC technologies. It is a general-purpose platform capable of handling demanding workloads and applications, such as data warehouses, ecommerce, databases, and high-performance computing (HPC). With the increase in storage capacity the PowerEdge R940 makes it well-suited for data intensive applications that require greater storage.

    This blog also describes the impact of BIOS tuning options on HPL, STREAM and scientific applications ANSYS Fluent and WRF and compares performance of the new PowerEdge R940 to the previous generation PowerEdge R930 platform. It also analyses the performance with Sub NUMA Cluster (SNC) modes (SNC=Enabled and SNC=Disabled). SNC enabled will expose eight NUMA nodes to the OS on a four socket PowerEdge R940. Each NUMA node can communicate with seven other remote NUMA nodes, six in other three sockets and one within same socket. NUMA domains on different sockets communicate over the UPI interconnect. Please visit BIOS characteristics of Skylake processor-blog for more details on BIOS options. Table 1 lists the server configuration and the application details used for this study.

     Table 1: Details of Server and HPC Applications used for R940 analysis

    Platform

    PowerEdge R930

    PowerEdge R930

    PowerEdge R940

    Processor

    4 x Intel Xeon E7-8890 v3@2.5GHz (18 cores) 45MB L3 cache 165W

     Codename=Haswell-EX

    4 x Intel Xeon E7-8890 v4@2.2GHz (24 cores) 60MB L3 cache 165W

    Codename=Broadwell-EX

    4 x Intel Xeon Platinum 8180@2.5GHz, 10.4GT/s (Cross-bar connection)

     Codename=Skylake

    Memory

    1024 GB = 64 x 16GB DDR4 @1866MHz

    1024 GB = 32 x 32GB DDR4 @1866 MHz

    384GB = (24 x 16GB) DDR4@2666MT/s

    CPU Interconnect

    Intel QuickPath Interconnect (QPI) 8GT/s

    Intel QuickPath Interconnect (QPI) 8GT/s

    Intel Ultra Path Interconnect (UPI) 10.4GT/s

    BIOS Settings

    BIOS

    Version 1.0.9

    Version 2.0.1

    Version 1.0.7

    Processor Settings > Logical Processors

    Disabled

    Disabled

    Disabled

    Processor Settings > UPI Speed

    Maximum Data Rate

    Maximum Data Rate

    Maximum Data Rate

    Processor Settings > Sub NUMA cluster

    N/A

    N/A

    Enabled, Disabled

    System Profiles

    PerfOptimized (Performance),

    PerfPerWattOptimizedDapc (DAPC),

    PerfOptimized (Performance),

    PerfPerWattOptimizedDapc (DAPC),
    PerfOptimized (Performance),

    PerfPerWattOptimizedDapc (DAPC),)

    Software and Firmware

    Operating System

    Red Hat Enterprise Linux Server release 6.6

    Red Hat Enterprise Linux Server release 7.2

    Red Hat Enterprise Linux Server release 7.3 (3.10.0-514.el7.x86_64)

    Intel Compiler

    Version 15.0.2

    Version 16.0.3

    version 17.0.4

    Intel MKL

    Version 11.2

    Version 11.3

    2017 Update 3

    Benchmark and Applications

    LINPACK

    V2.1 from MKL 11.3

    V2.1 from MKL 11.3

    V2.1 from MKL update 3

    STREAM

    v5.10, Array Size 1800000000, Iterations 100

    v5.10, Array Size 1800000000, Iterations 100

    v5.4, Array Size 1800000000, Iterations 100

    WRF

    V3.5.1 Input Data Conus12KM, Netcdf-4.3.1

    V3.8 Input Data Conus12KM, Netcdf-4.4.0

    v3.8.1, Input Data Conus12KM, Conus2.5KM, Netcdf-4.4.2

     ANSYS Fluent

    v15, Input Data: truck_poly_14m

    v16, Input Data: truck_poly_14m

    v17.2, Input Data: truck_poly_14m, aircraft_wing_14m, ice_2m, combustor_12m, exhaust_system_33m

     

    Note: The software versions were different for the older generation processors and results are compared against what was best configuration at that time. Due to big architectural changes in servers and processors generation over generation, the changes in software versions is not a significant factor.

    The High Performance Linpack (HPL) Benchmark is a measure of a system's floating point computing power. It measures how fast a computer solves a dense n by n system of linear equations Ax = b, which is a common task in engineering. HPL performed with block size of NB=384 and problem size of N=217754. Since HPL is an AVX-512-enabled workload, we would calculate HPL theoretical maximum performance as (rated base frequency of processor * number of cores * 32 FLOP/second).

          

                     Figure 1: Comparing HPL Performance across BIOS profiles                                        

    Figure 1 depicts the performance of the PowerEdge R940 server described in Table 1 with different BIOS options. Here the “Performance SNC = Disabled” gives better performance compared to other bios profiles. With “SNC=Disabled” we can observe 1-2% better performance as compared to “SNC = Enabled” for all the BIOS profiles.

    Figure 2: HPL performance with AVX2 and AVX512 instructions sets          Figure 3: HPL Performance over multiple generations of processors

    Figure 2 compares the performance of HPL ran with AVX2 instructions sets and AVX512 instructions sets on PowerEdge R940 (where AVX=Advanced Vector Instructions). AVX-512 are 512-bit extensions to the 256-bit AVX SIMD instructions for x86 instruction set architecture. By Setting “MKL_ENABLE_INSTRUCTIONS=AVX2/AVX512” environment variable we can set the AVX instructions. Here we can observe by running HPL with AVX512 instruction set gives around 75% improvement in performance compared to AVX2 instruction set.

    Figure 3: Compares the results of four socket R930 powered by Haswell-EX processors and Broadwell-EX processors with R940 powered by Skylake processors. For HPL, R940 server performed ~192% better in comparison to R930 server with four Haswell-EX processors and ~99% better with Broadwell-EX processors. The performance improvement we observed in Skylake over Broadwell-EX is due to a 27% increase in the number of cores and 75% increase in performance for AVX 512 vector instructions.

    The Stream benchmark is a synthetic benchmark program that measures sustainable memory bandwidth and the corresponding computation rate for simple vector kernels. The STREAM benchmark calculates the memory bandwidth by counting only the bytes that the user program requested to be loaded or stored. This study uses the results reported by the TRIAD function of the stream bandwidth test.

    Figure 4: STREAM Performance across BIOS profiles                                Figure 5: STREAM Performance over multiple generations of processors

    As per Figure 4, With “SNC = Enabled” we are getting up to 3% better bandwidth in comparison to “SNC = Disabled” across all bios profiles. Figure 5, shows the comparison of memory bandwidth of PowerEdge R930 server with Haswell-EX, Broadwell-EX processors and PowerEdge R940 server with Skylake processors. Both Haswell-EX and Broadwell-EX support DDR3 and DDR4 memories respectively, while the platform with this configuration supports 1600MT/s of memory frequency for both generation of processors. Due to use of DIMMs of same memory frequency for both generation of processors on PowerEdge R930, both Broadwell-EX and Haswell-EX processors have same memory bandwidth but there is ~51% increase in memory bandwidth with Skylake compared to Broadwell- EX processors. This is due to use of 2666MHz RDIMMS, which gives around ~66% increase in maximum memory bandwidth compared to Broadwell-EX and the second factor is ~50% increase in the number of memory channels per socket which is 6 channels per socket for Skylake Processors and 4 channels per socket in Broadwell-EX processors.

    Figure 6: Comparing STREAM Performance with “SNC = Enabled”                    Figure 7: Comparing STREAM Performance with “SNC = Disabled

    Figure 6 and Figure 7 describe the impact of traversing the UPI link to go across sockets on memory bandwidth for the PowerEdge R940 servers. With “SNC = Enabled” the local memory bandwidth and remote to same socket memory bandwidth is nearly same (0-1% variations) but in case of remote to other socket, the memory bandwidth shows ~57% decrease in comparison to local memory bandwidth. With “SNC = Disabled” remote memory bandwidth is 77% lower compared to local memory bandwidth.

    The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture facilitating parallel computation and system extensibility. The model serves a wide range of meteorological applications across scales from tens of meters to thousands of kilometers. WRF can generate atmospheric simulations using real data or idealized conditions. We used the CONUS12km and CONUS2.5km benchmark datasets for this study. 

    CONUS12km is a single domain and small size (48hours, 12km resolution case over the Continental U.S. (CONUS) domain from October 24, 2001) benchmark with 72 seconds of time step. CONUS2.5km is a single domain and large size (Latter 3hours of a 9hours, 2.5km resolution case over the Continental U.S. (CONUS) domain from June 4, 2005) benchmark with 15 seconds of time step.

    WRF decomposes the domain into tasks or patches. Each patch can be further decomposed into tiles that are processed separately, but by default there is only one tile for every run. If the single tile is too large to fit into the cache of the CPU and/or core, it slows down computation due to WRF’s memory bandwidth sensitivity. In order to reduce the size of the tile, it is possible to increase the number of tiles by defining “numtile = x” in input file or defining environment variable “WRF_NUM_TILES = x”. For both CONUS 12km and CONUS 2.5km the number of tiles are chosen based on best performance.

    Figure 8: WRF Performance across BIOS profiles (Conus12KM)              Figure 9: WRF Performance across BIOS profiles (Conus2.5KM)                                                 

    Figure 8 and Figure 9 demonstrates the comparison of WRF datasets on different BIOS profiles .With Conus 12KM data and CONUS 2.5KM “Performance SNC = Enabled” gives best performance. For both conus12km and conus2.5km, the “SNC = Enabled” performs ~1%-2% better than “SNC = Disabled”. The performance difference across different bios profile is nearly equal for Conus12Km as it uses a smaller dataset size, while in CONUS2.5km which is having larger dataset we can observe 1-2% performance variations across the system profiles as it utilizes larger number of processors more efficiently.

    Figure 10: Comparison over multiple generations of processors                     Figure 11: Comparison over multiple generations of processors

    Figure 10 and Figure 11 shows the performance comparison between PowerEdge R940 powered by Skylake processors and PowerEdge R930 powered by Broadwell-EX processors and Haswell-EX processors. From the Figure 11 we can observe that for Conus12KM performance of PowerEdge R940 with Skylake is ~18% better as compared to PowerEdge R930 with Broadwell EX and ~45% better compared to Haswell-EX processors. In case of for Conus2.5 Skylake performs ~29% better than Broadwell-EX and ~38% better than Haswell-EX processors.

    ANSYS Fluent is a computational fluid dynamics (CFD) software tool. Fluent includes well-validated physical modeling capabilities to deliver fast and accurate results across the widest range of CFD and multi physics applications.

    Figure 12: Ansys Fluent Performance across BIOS profiles

    We used five different datasets for our analysis which are truck_poly_14m, combuster_12m, exhaust_system_33m, ice_2m and aircraft_wing_14m. Fluent with ‘Solver Rating’ (Higher is better) as the performance metric. Figure 12 shows that all datasets performed better with “Performance SNC = Enabled” BIOS option than others. For all datasets, the “SNC = Enabled” performs 2% to 4% better than “SNC = Disabled”.

      

    Figure 13: Ansys Fluent (truck_poly_14m) performance over multiple generations of Intel processors

    Figure 13, shows the performance comparison of truck poly on PowerEdge R940 with Skylake processors and PowerEdge R930 with Broadwell-EX and Haswell-EX processors. For PowerEdge R940 fluent showed 46% better performance in-comparison to PowerEdge R930 with Broadwell-EX and 87% better performance compared to Haswell-EX processors.

     Conclusion:

    The PowerEdge R940 is a highly efficient 4 socket next generation platform which provides up to 122TB of storage capacity with 6.3TF of computing power options, making it well-suited for data intensive applications, while not sacrificing performance. The Skylake processors gives PowerEdge R940 a performance boost in comparison to its previous generation of server (PowerEdge R930), we can observe more than 45% performance improvement across all the applications.

    Considering our above analysis we can observe that if we compare system profiles “Performance” gives better performance with respect other system profiles.

    In conclusion, PowerEdge R940 with Skylake processors is good platform for all variety of applications and may fulfill the demands of more compute power for HPC applications. 

  • Application Performance on P100-PCIe GPUs

    Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Feb 2017

    Introduction to P100-PCIe GPU

    This blog describes the performance analysis on NVIDIA® Tesla® P100™ GPUs on a cluster of Dell PowerEdge C4130 servers. There are two types of P100 GPUs: PCIe-based and SXM2-based. In PCIe-based server, GPUs are connected by PCIe buses and one P100 delivers around 4.7 and 9.3 TeraFLOPS of double and single precision performance, respectively. And in P100-SXM2, GPUs are connected by NVLink and one P100 delivers around 5.3 and 10.6 TeraFLOPS of double and single precision performance, respectively. This blog focuses on P100 for PCIe-based servers, i.e. P100-PCIe. We have already analyzed the P100 performance for several deep learning frameworks in this blog. The objective of this blog is to compare the performance of HPL, LAMMPS, NAMD, GROMACS, HOOMD-BLUE, Amber, ANSYS Mechanical and RELION. The hardware configuration of the cluster is the same as in the deep learning blog. Briefly speaking, we used a cluster of four C4130 nodes, each node has dual Intel Xeon E5-2690 v4 CPUs and four NVIDIA P100-PCIe GPUs and all nodes are connected with EDR Infiniband. Table 1 shows the detailed information about the hardware and software used in every compute node.

     

    Table 1: Experiment Platform and Software Details

    Platform

    PowerEdge C4130 (configuration G)

    Processor

    2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)

    Memory

    256GB DDR4 @ 2400MHz

    Disk

    9TB HDD

    GPU

    P100-PCIe with 16GB GPU memory

    Nodes Interconnects

    Mellanox ConnectX-4 VPI (EDR 100Gb/s Infiniband)

    Infiniband Switch

    Mellanox SB7890

    Software and Firmware

    Operating System

    RHEL 7.2 x86_64

    Linux Kernel Version

    3.10.0-327.el7

    BIOS

    Version 2.3.3

    CUDA version and driver

    CUDA 8.0.44 (375.20)

    OpenMPI compiler

    Version 2.0.1

    GCC compiler

    4.8.5

    Intel Compiler

    Version 2017.0.098

    Applications

    HPL

    Version hpl_cuda_8_ompi165_gcc_485_pascal_v1

    LAMMPS

    Version Lammps-30Sep16

    NAMD

    Version NAMD_2.12_Source

    GROMACS

    Version 2016.1

    HOOMD-blue

    Version 2.1.2

    Amber

    Version 16update7

    ANSYS Mechanical

    Version 17.0

    RELION

    Version 2.0.3


    High Performance Linpack (HPL)

    HPL is a multicomputer parallel application to measure how fast computers solve a dense n by n system of linear equations using LU decomposition with partial row pivoting and designed to be run at very large scale. The HPL running on the experimented cluster uses the double precision floating point operations. Figure 1 shows the HPL performance on the tested P100-PCIe cluster. It can be seen that 1 P100 is 3.6x faster than 2 x E5-2690 v4 CPUs. HPL also scales very well with more GPUs within nodes or across nodes. Recall that 4 P100 is within a server and therefore 8, 12 and 16 P100 are in 2, 3 and 4 servers. 16 P100 GPUs has the speedup of 14.9x compared to 1 P100. Note that the overall efficiency is calculated as: HPL Efficiency = rMax / (CPUs rPeak + GPUs rPeak), where rPeak is the highest theoretical FLOPS result, and the number reported by HPL is rMax and is the real performance that can be achieved. HPL cannot be run at the max boost clock. It is typically run at some number in between but the average is close to the base clock than to the max boost clock. That is why the efficiency is not very high. Although we also included CPU rPeak in the efficiency calculation, when running HPL on P100 we set DGEMM_SPLIT=1.0 which means CPU is not really contributing to the DGEMM, but only handling other overhead so it is not actually contributing a lot of FLOPS. Although we observed that CPUs stayed fully utilized they were just handling the overhead and data movement to keep the GPUs fed. What is the most important for P100 is that rMax is really big.

    Figure 1: HPL performance on P100-PCIe

     
    NAMD

    NAMD (for NAnoscale Molecular Dynamics) is a molecular dynamics application designed for high-performance simulation of large biomolecular systems. The dataset we used is Satellite Tobacco Mosaic Virus (STMV) which is a small, icosahedral plant virus that worsens the symptoms of infection by Tobacco Mosaic Virus (TMV). This dataset has 1,066,628 atoms and it is the largest dataset on NAMD utilities website. The performance metric in the output log of this application is “days/ns” (the lower the better). But its inverted metric “ns/day” is used in our plot since that is what most molecular dynamics users focus on. The average of all occurrences of this value in the output log was used. Figure 2 shows the performance within 1 node. It can be seen that the performance of using 2 P100 is better than that of using 4 P100. This is probably because of the communications among different CPU threads. This application launches a set of workers threads that handle the computation and communication threads that handle the data communication. As more GPUs are used, more communication threads are used and more synchronization is needed. In addition, based on the profiling result from NVIDIA’s CUDA profiler called nvprof, with 1 P100 the GPU computation takes less than 50% of the whole application time. According to Amdahl’s law, the speedup with more GPUs will be limited by another 50% work that is not parallelized by GPU. Based on this observation, we further ran this application on multiple nodes with two different settings (2 GPUs/node and 4 GPUs/node) and the result is shown in Figure 3. The result shows that no matter how many nodes are used, the performance of 2 GPUs/node is always better than 4 GPUs/node. Within a node, 2 P100 GPUs is 9.5x faster than dual CPUs.

     Figure 2: NAMD Performance within 1 P100-PCIe node

     Figure 3: NAMD Performance across Nodes

    GROMACS

    GROMACS (for GROningen MAchine for Chemical Simulations) primarily does simulations for biochemical molecules (bonded interactions). But because of its efficiency in calculating non-bonded interactions (atoms not linked by covalent bonds), the user base is expanding to non-biological systems. Figure 4 shows the performance of GROMACS on CPU, K80 GPUs and P100-PCIe GPUs. Since one K80 has two internal GPUs, from now on when we mention one K80 it always refers to two internal GPUs instead of one of the two internal GPUs. When testing with K80 GPUs, the same P100-PCIe GPUs based servers were used. Therefore, the CPUs and memory were kept the same and the only difference is that P100-PCIe GPUs were replaced to K80 GPUs. In all tests, there were four GPUs per server and all GPUs were utilized. For example, the 3 node data point is with 3 servers and 12 total GPUs. The performance of P100-PCIe is 4.2x – 2.8x faster than CPU from 1 node to 4 nodes, and is 1.5x – 1.1x faster than K80 GPU from 1 node to 4 nodes.

      Figure 4: GROMACS Performance on P100-PCIe

    LAMMPS

    LAMMPS (for Large Scale Atomic/Molecular Massively Parallel Simulator) is a classic molecular dynamics code, capable of simulations for solid-state materials (metals, semi-conductors), soft matter (biomolecules, polymers) and coarse-grained or mesoscopic systems. It can be used to model atoms or more generically as a parallel particle simulator at the atomic, meso or continuum scale. The dataset we used was LJ (Lennard-Jones liquid benchmark) which contains 512000 atoms. There are two GPU implementations in LAMMPS: GPU library version and kokkos version. In the experiment, we used kokkos version since it was much faster than the GPU library version.

    Figure 5 shows LAMMPS performance on CPU and P100-PCIe GPUs. Using 16 P100 GPUs is 5.8x faster than using 1 P100. The reason that this application did not scale linearly is that the data transfer (CPU->GPU, GPU->CPU and GPU->GPU) time increases as more GPUs are used although the computation part reduces linearly. And the reason that the data transfer time increases is because this application requires the data communication among all GPUs used. However, the configuration G we used only allows Peer-to-Peer (P2P) access for two pairs of GPUs: GPU 1 - GPU 2 and GPU 3 - GPU 4. GPU 1/2 cannot communicate with GPU 3/4 directly. If the communication is needed, the data must go through CPU which slows the communication. The configuration B is able to ease this issue as it allows P2P access among all four GPUs within a node. The comparison between configuration G and configuration B is shown in Figure 6. By running LAMMPS on a configuration B server with 4 P100, the performance metric “timesteps/s” was improved to 510 compared to 505 in configuration G, resulting in 1% improvement. The reason why the improvement is not significant is because the data communication takes only less than 8% of the whole application time when running on configuration G with 4 P100. Figure 7 also compared the performance of P100-PCIe with that of CPU and K80 GPUs for this application. It is shown that within 1 node, 4 P100-PCIe is 6.6x faster than 2 E5-2690 v4 CPUs and 1.4x faster than 4 K80 GPUs.

     Figure 5: LAMMPS Performance on P100-PCIe

     

      

    Figure 6 : Comparison between Configuration G and Configuration B


    Figure 7: LAMMPS Performance Comparison

    HOOMD-blue

    HOOMD-blue (for Highly Optimized Object-oriented Many-particle Dynamics - blue) is a general purpose molecular dynamic simulator. Figure 8 shows the HOOMD-blue performance. Note that the y-axis is in logarithmic scale. It is observed that 1 P100 is 13.4x faster than dual CPU. The speedup of using 2 P100 is 1.5x compared to using only 1 P100. This is a reasonable speedup. However, with 4 P100 to 16 P100, the speedup is from 2.1x to 3.9x which is not high. The reason is that similar to LAMMPS, this application also involves lots of communications among all used GPUs. Based on the analysis in LAMMPS, using configuration B should reduce this communication bottleneck significantly. To verify this, we ran the same application again on a configuration B server. With 4 P100, the performance metric “hours for 10e6 steps” was reduced to 10.2 compared to 11.73 in configuration G, resulting in 13% performance improvement and the speedup compared to 1 P100 was improved to 2.4x from 2.1x.

     Figure 8: HOOMD-blue Performance on CPU and P100-PCIe

    Amber

    Amber is the collective name for a suite of programs that allow users to carry out molecular dynamics simulations, particularly on biomolecules. The term Amber is also used to refer to the empirical force fields that are implemented in this suite. Figure 9 shows the performance of Amber on CPU and P100-PCIe. It can be seen that 1 P100 is 6.3x faster than dual CPU. Using 2 P100 GPUs is 1.2x faster than using 1 P100. However, the performance drops significantly when 4 or more GPUs are used. The reason is that similar to LAMMPS and HOOMD-blue, this application heavily relies on P2P access but configuration G only supports that between 2 pair GPUs. We verified this by again testing this application on a configuration B node. As a result, the performance of using 4 P100 was improved to 791 ns/day compared to 315 ns/day in configuration G, resulting in 151% performance improvement and the speedup of 2.5x. But even in configuration B, the multi-GPU scaling is still not good enough. This is because when the Amber multi-GPU support was originally designed the PCI-E bus speed was gen 2 x 16 and the GPUs were C1060 or C2050s. However, the current Pascal generation GPUs are > 16x faster than the C1060s while the PCI-E bus speed has only increased by 2x (PCI Gen2 x 16 to PCI Gen3 x 16) and Infiniband interconnects by about the same amount. Amber website explicitly states that “It should be noted that while the legacy MPI and GPU-Direct methods of multi-GPU communication are still supported, and will be used by the code automatically if peer to peer communication is not available, you are very unlikely to see any speedup by using multiple GPUs for a single job if the GPUs are newer than C2050s. Multi-node runs are almost impossible to get to scale.” This is consistent with our results on multi-node. Because it is obvious to see that in Figure 9, the more nodes are used, the worse the performance is.

      Figure 9: Amber Performance on CPU and P100-PCIe

    ANSYS Mechanical

     ANSYS® Mechanicalsoftware is a comprehensive finite element analysis (FEA) tool for structural analysis, including linear, nonlinear dynamic, hydrodynamic and explicit studies. It provides a complete set of element behavior, material models and equation solvers for a wide range of mechanical design problems. The finite element method is used to solve the partial differential equations which is a compute and memory intensive task. Our testing focused on the Power Supply Module (V17cg-1) benchmark. This is a medium sized job for iterative solvers and a good test for memory bandwidth. Figure 10 shows the performance of ANSYS Mechanical on CPU and P100-PCIe. It is shown that within a node, 4 P100 is 3.8x faster than dual CPUs. And with 4 nodes, 16 P100 is 2.3x faster than 8 CPUs. The figure also shows that the performance scales well with more nodes. The speedup with 4 nodes is 2.8x compared to 1 node.

     Figure 10: ANSYS Mechanical Performance on CPU and P100-PCIe

    RELION

    RELION (for REgularised Likelihood OptimisationN) is a program that employs an empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM). Figure 11 shows the performance of RELION on CPU and P100-PCIe. Note that y-axis is in logarithmic scale. It demonstrates that 1 P100 is 8.8x faster than dual CPU. From the figure we also notice that it does not scale well starting from 4 P100 GPUs. Because of the long execution time, we did not perform the profiling for this application. But it is possible that the reason of the weak scaling is similar to LAMMPS, HOOMD-blue and Amber.

     Figure 11: RELION Performance on CPU and P100-PCIe


    Conclusions and Future Work


    In this blog, we presented and analyzed the performance of different applications on Dell PowerEdge C4130 servers with P100-PCIe GPUs. In all of the tested applications, HPL, GROMACS and ANSYS Mechanical benefit from the balanced CPU-GPU configuration in configuration G, because they do not require P2P access among GPUs. However, LAMMPS, HOOMD-blue, Amber (and possibly RELION) rely on P2P accesses. Therefore, with configuration G, they scale well up to 2 P100 GPUs, then scale weakly with 4 or more P100 GPUs. But with Configuration B, they scale better than G with 4 GPUs, so configuration B is more suitable and recommended for applications implemented with P2P accesses.

    In the future work, we will run these applications on P100-SXM2 and compare the performance difference between P100-PCIe and P100-SXM2.



  • NAMD Performance Analysis on Skylake Architecture

    Author: Joseph Stanfield

    The purpose of this blog is to provide a comparative performance analysis of the Intel® Xeon® Gold 6150 processor and the previous generation Xeon® E5-2697 v4 processors using the NAMD benchmark. The Xeon® Gold 6150 CPU features 18 physical cores or 36 logical cores when utilizing hyper threading. This processor is based on Intel’s new micro-architecture codenamed “Skylake”. Intel significantly increased the L2 cache per core from 256 KB on Broadwell to 1 MB on Skylake. The 6150 also touts 24.75 MB of L3 cache and a six channel DDR4 memory interface.

     

    Nanoscale Molecular Dynamics (NAMD) is an application developed using the Charm++ parallel programming model for molecular dynamics simulation. It is popular due to its parallel efficiency, scalability, and the ability to simulate millions of atoms.

    .
    Test Cluster Configurations:

     

    Dell EMC PowerEdge C6420

     

    Dell EMC PowerEdge C6320

    CPU

    2x Xeon® Gold 6150 18c 2.7 GHz
    (Skylake)

    2x Xeon® E5-2697 v4 16c 2.3 GHz
    (Broadwell)

    RAM

    12x 16GB @2666 MHz

    8x 16GB @2400 MHz

    HDD

    1TB SATA

    1 TB SATA

    OS

    RHEL 7.3

    RHEL 7.3

    InfiniBand

    EDR ConnectX-4

    EDR ConnectX-4

    CHARM++

    6.7.1

    NAMD

    2.12_Source

     
    BIOS Settings

    BIOS Options

    Settings

    System Profile

    Performance Optimized

    Logical Processor

    Disabled

    Virtualization Technology

    Disabled


    The benchmark dataset selected for this series of tests was the Satellite Tobacco Mosaic Virus, or STMV. STMV contains 1,066,628 atoms, which makes it ideal for demonstrating scaling to large clustered environments. The performance is measured in nanoseconds per day (ns/day), which is the number of days required to simulate 1 nanosecond of real-time. A larger value indicates faster performance.

     

    The first series of benchmark tests conducted were to measure the CPU performance. The test environment consisted of a single node, two nodes, four nodes, and eight nodes with the NAMD STMV dataset run three times for each configuration. The network interconnect between the nodes used was EDR InfiniBand as noted in the table above. Average results from a single node showed 0.70 ns/day. While for a two-node run performance increased by 80% to 1.25 ns/days. The trend of an average of 80% increase in performance for each doubling of node count remained relatively consistent as the environment was scaled to eight nodes, as seen in Figure 1.

    Figure 1.

     

    The second series of benchmarks were run to compare the Xeon® Gold 6150 against the previous generation Xeon® E5-2697v4. The same dataset, STMV was used for both benchmark environments. As you can see below in Figure 2, the Xeon® Gold CPU results surpass the Xeon E5 V4 by 111% on a single node, and the relative performance advantage decreases to 63% at eight nodes.

     


    Figure 2.

     

     

    Summary

    In this blog, we analyzed and presented the performance of a Dell EMC PowerEdge C6420 cluster scaling from a single node to eight nodes running NAMD with the STMV dataset. Results show that performance of NAMD scales linearly with the increased number of nodes.

    At the time of publishing this blog, there is an issue with the Intel Parallel Studio v, 2017.x and NAMD compilation. Intel recommends using Parallel Studio 2016.4 or 2018 (which is still in beta) with -xCORE-AVX512 under the FLOATOPS variable for best performance.

    A comparative analysis was also conducted with the previous generation Dell EMC PowerEdge C6320 server and Xeon® E5 v4 (Broadwell) processor. The Xeon® Gold outperformed the E5 V4 by 111% and maintained a linear performance increase as the cluster was scaled and the number of nodes multiplied.


    Resources

    Intel NAMD Recipe: https://software.intel.com/en-us/articles/building-namd-on-intel-xeon-and-intel-xeon-phi-processor

    Intel Fabric Tunining and Application Performance: https://www.intel.com/content/www/us/en/high-performance-computing-fabrics/omni-path-architecture-application-performance-mpi.html

  • LAMMPS Four Node Comparative Performance Analysis on Skylake Processors

    Author: Joseph Stanfield
     

    The purpose of this blog is to provide a comparative performance analysis of the Intel® Xeon® Gold 6150 processor (architecture code named “Skylake”) and the previous generation Xeon® E5-2697 v4 processor using the LAMMPS benchmark. The Xeon® Gold 6150 CPU features 18 cores or 36 when utilizing hyper threading. Intel significantly increased the L2 cache per core from 256 KB on previous generations of Xeon to 1 MB. The new processor also touts 24.75 MB of L3 cache and a six channel DDR4 memory interface.

    LAMMPS, or Large Scale Atom/Molecular Massively Parallel Simulator, is an open-source molecular dynamics program originally developed by Sandia National Laboratories, Temple University, and the United States Department of Energy. The main function of LAMMPS is to model particles in a gaseous, liquid, or solid state.

     

    Test cluster configuration

     

    Dell EMC PowerEdge C6420

     

    Dell EMC PowerEdge C6320

    CPU

    2x Xeon® Gold 6150 18c 2.7 GHz
    (Skylake)

    2x Xeon® E5-2697 v4 16c 2.3 GHz
    (Broadwell)

    RAM

    12x 16GB @2666 MT/s

    8x 16GB @2400 MT/s

    HDD

    1TB SATA

    1 TB SATA

    OS

    RHEL 7.3

    RHEL 7.3

    InfiniBand

    EDR ConnectX-4

    EDR ConnectX-4

    BIOS Settings

    BIOS Options

    Settings

    System Profile

    Performance Optimized

    Logical Processor

    Disabled

    Virtualization Technology

    Disabled


    The LAMMPS version used for testing release was lammps-6June-17. The in.eam dataset was used for the analysis on both configurations. In.eam is a dataset that simulates a metallic solid, Cu EAM potential with 4.95 Angstrom cutoff (45 neighbors per atom), NVE integration. The simulation was executed using 100 steps with 32,000 atoms. The first series of benchmarks conducted were to measure performance in units of timesteps/s. The test environment consisted of four servers interconnected with InfiniBand EDR, and tests were run on a single node, two nodes, and four nodes with LAMMPS, three times for each configuration. Average results from a single node showed 106 time steps per second while a two node result nearly doubled performance with 216 time steps per second. This trend remained consistent as the environment was scaled to four nodes as seen in Figure 1.

     

    Figure 1.

    The second series of benchmarks were run to compare the Xeon® Gold against the previous generation, Xeon® E5 v4. The same dataset, in.eam, was used with 32,000 atoms and 100 steps per run. As you can see below in Figure 2, the Xeon® Gold CPU outperforms the Xeon® E5 v4 by about 120% with each test, but the performance increase drops slightly as the cluster is scaled.


    Figure 2.

    Conclusion

    In this blog, we analyzed and presented the performance of a Dell EMC PowerEdge C6420 cluster scaling from a single node to four nodes running the LAMMPS benchmark. Results show that performance of LAMMPS scales linearly with the increased number of nodes.

     

    A comparative analysis was also conducted with the previous generation Dell EMC PowerEdge C6320 server with a Xeon® E5 v4 (Broadwell) processor. As with the first test, as node count was increased the linear scaling of the application was observed on Xeon® E5 v4, results similar to the Xeon® Gold.. But the Xeon® Gold processor outperformed the previous generation CPU by about 120% each run.


    Resources

     

     

  • BIOS characterization for HPC with Intel Skylake processor

    Ashish Kumar Singh. Dell EMC HPC Innovation Lab. Aug 2017

    This blog discusses the impact of the different BIOS tuning options available on Dell EMC 14th generation PowerEdge servers with the Intel Xeon® Processor Scalable Family (architecture codenamed “Skylake”) for some HPC benchmarks and applications. A brief description of the Skylake processor, BIOS options and HPC applications is provided below.  

    Skylake is a new 14nm “tock” processor in the Intel “tick-tock” series, which has the same process technology as the previous generation but with a new microarchitecture. Skylake requires a new CPU socket that is available with the Dell EMC 14th Generation PowerEdge servers. Skylake processors are available in two different configurations, with an integrated Omni-Path fabric and without fabric. The Omni-Path fabric supports network bandwidth up to 100Gb/s. The Skylake processor supports up to 28 cores, six DDR4 memory channels with speed up to 2666MT/s, and additional vectorization power with the AVX512 instruction set. Intel also introduces a new cache coherent interconnect named “Ultra Path Interconnect” (UPI), replacing Intel® QPI, that connects multiple CPU sockets.

    Skylake offers a new, more powerful AVX512 vectorization technology that provides 512-bit vectors. The Skylake CPUs include models that support two 512-bit Fuse-Multiply-Add (FMA) units to deliver 32 Double Precision (DP) FLOPS/cycle and models with a single 512-bit FMA unit that is capable of 16 DP FLOPS/cycle. More details on AVX512 are described in the Intel programming reference. With 32 FLOPS/cycle, Skylake doubles the compute capability of the previous generation, Intel Xeon E5-2600 v4 processors (“Broadwell”).

    Skylake processors are supported in the Dell EMC PowerEdge 14th Generation servers. The new processor architecture allows different tuning knobs, which are exposed in the server BIOS menu. In addition to existing options for performance and power management, the new servers also introduce a clustering mode called Sub NUMA clustering (SNC). On CPU models that support SNC, enabling SNC is akin to splitting the single socket into two NUMA domains, each with half the physical cores and half the memory of the socket. If this sounds familiar, it is similar in utility to the Cluster-on-Die option that was available in E5-2600 v3 and v4 processors as described here. SNC is implemented differently from COD, and these changes improve remote socket access in Skylake when compared to the previous generation. At the Operating System level, a dual socket server with SNC enabled will display four NUMA domains. Two of the domains will be closer to each other (on the same socket), and the other two will be a larger distance away, across the UPI to the remote socket. This can be seen using OS tools like numactl –H.

    In this study, we have used the Performance and PerformancePerWattDAPC system profiles based on our earlier experiences with other system profiles for HPC workloads. The Performance Profile aims to optimize for pure performance. The DAPC profile aims to balance performance with energy efficiency concerns. Both of these system profiles are meta options that, in turn, set multiple performance and power management focused BIOS options like Turbo mode, Cstates, C1E, Pstate management, Uncore frequency, etc.

    We have used two HPC benchmarks and two HPC applications to understand the behavior of SNC and System Profile BIOS options with Dell EMC PowerEdge 14th generation servers. This study was performed with a single server only; cluster level performance deltas will be bounded by these single server results. The server configuration used for this study is described below.    

    Testbed configuration:

    Table 1: Test configuration of new 14G server

    Components                                          Details

    Server                                                     PowerEdge C6420 

    Processor                                               2 x Intel Xeon Gold 6150 – 2.7GHz, 18c, 165W

    Memory                                                  192GB (12 x 16GB) DDR4 @2666MT/s

    Hard drive                                              1 x 1TB SATA HDD, 7.2k rpm

    Operating System                                   Red Hat Enterprise Linux-7.3 (kernel - 3.10.0-514.el7.x86_64)

    MPI                                                         Intel® MPI 2017 update4

    MKL                                                        Intel® MKL 2017.0.3

    Compiler                                                 Intel® compiler 17.0.4

    Table 2: HPC benchmarks and applications

    Application                                Version                                               Benchmark

    HPL                                             From Intel® MKL                                Problem size - 92% of total memory

    STREAM                                      v5.04                                                  Triad

    WRF                                            3.8.1                                                  conus2.5km

    ANSYS Fluent                              v17.2                                                  truck_poly_14m, Ice_2m

           combustor_12m   

     

    Sub-NUMA cluster

    As described above, a system with SNC enabled will expose four NUMA nodes to the OS on a two socket PowerEdge server. Each NUMA node can communicate with three remote NUMA nodes, two in another socket and one within same socket. NUMA domains on different sockets communicate over the UPI interconnect. With the Intel® Xeon Gold 6150 18 cores processor, each NUMA node will have nine cores. Since both sockets are equally populated in terms of memory, each NUMA domain will have one fourth of the total system memory.    

                    

                                                 Figure 1: Memory bandwidth with SNC enabled

    Figure 1 plots the memory bandwidth with SNC enabled. Except SNC and logical processors, all other options are set to BIOS defaults. Full system memory bandwidth is ~195 GB/s on the two socket server. This test uses all available 36 cores for memory access and calculates aggregate memory bandwidth. The “Local socket – 18 threads” data point measures the memory bandwidth of single socket with 18 threads. As per the graph, local socket memory bandwidth is ~101 GB/s, which is about half of the full system bandwidth. By enabling SNC, a single socket is divided into two NUMA nodes. The memory bandwidth of a single SNC enabled NUMA node is noted by “Local NUMA node – 9 threads”. In this test, the nine local cores access their local memory attached to their NUMA domain. The memory bandwidth here is ~50 GB/s, which is half of the total local socket bandwidth.

    The data point “Remote to same socket” measures the memory bandwidth between two NUMA nodes, which are on the same socket with cores on one NUMA domain accessing the memory of the other NUMA domain. As per the graph, the server measures  ~ 50GB/s memory bandwidth for this case; the same as the “local NUMA node – 9 threads” case. That is, with SNC enabled, memory access within the socket is similar in terms of bandwidth even across NUMA domains. This is a big difference from the previous generation where there was a penalty when accessing memory on the same socket with COD enabled. See Figure 1 in the previous blog where a 47% drop in bandwidth was observed and compare that to the 0% performance drop here. The “Remote to other socket” test involves cores on one NUMA domain accessing the memory of a remote NUMA node on the other socket. This bandwidth is 54% lower due to non-local memory access over UPI interconnect.

    These memory bandwidth tests are interesting, but what do they mean? Like in previous generations, SNC is a good option for codes that have high NUMA locality. Reducing the size of the NUMA domain can help some codes run faster due to less snoops and cache coherence checks within the domain. Additionally, the penalty for remote accesses on Skylake is not as bad as it was for Broadwell.

                                 

     

     Figure 2: Comparing Sub-NUMA clustering with DAPC

    Figure 2 shows the effect of SNC on multiple HPC workloads; note that all of these have good memory locality. All options except SNC and Hyper Threading are set to BIOS default. SNC disabled is considered as the baseline for each workload. As per Figure 2, all tests measure no more than 2% higher performance with SNC enabled. Although this is well within the run-to-run variation for these applications, SNC enabled consistently shows marginally higher performance for STREAM, WRF and Fluent for these datasets. The performance delta will vary for larger and different datasets. For many HPC clusters, this level of tuning for a few percentage points might not be worth it, especially if applications with sub-optimal memory locality will be penalized.

     

    The Dell EMC default setting for this option is “disabled”, i.e. two sockets show up as just two NUMA domains. The HPC recommendation is to leave this at disabled to accommodate multiple types of codes, including those with inefficient memory locality, and to test this on a case-by-case basis for the applications running on your cluster.

     

    System Profiles

    Figure 3 plots the impact of different system profiles on the tests in this study. For these studies, all BIOS options are default except system profiles and logical processors. The DAPC profile with SNC disabled is used as the baseline. Most of these workloads show similar performance on both Performance and DAPC system profile. Only HPL performance is higher by a few percent. As per our earlier studies, DAPC profile always consumes less power than performance profile, which makes it suitable for HPC workloads without compromising too much on performance.  

                                                                               

     Figure 3: Comparing System Profiles

    Power Consumption

    Figure 4 shows the power consumption of different system profiles with SNC enabled and disabled. The HPL benchmark is suited to put stress on the system and utilize the maximum compute power of the system. We have measured idle power and peak power consumption with logical processor set to disabled.

                       

                                                 Figure 4: Idle and peak power consumption

    As per Figure 4, DAPC Profile with SNC disabled shows the lowest idle power consumption relative to other profiles. Both Performance and DAPC system profiles consume up to ~5% lower power in idle status with SNC disabled. In idle state, Performance Profile consumes ~28% more power than DAPC.

    The peak power consumption is similar with SNC enabled and with SNC disabled. Peak power consumption in DAPC Profile is ~16% less than in Performance Profile. 

    Conclusion

    Performance system profile is still the best profile to achieve maximum performance for HPC workloads. However, DAPC consumes less power than performance increase with performance profile, which makes DAPC the best suitable system profile.

    Reference:

    http://en.community.dell.com/techcenter/extras/m/white_papers/20444326