Dell Community

Blog Group Posts
Application Performance Monitoring Blog Foglight APM 105
Blueprint for HPC - Blog Blueprint for High Performance Computing 0
Custom Solutions Engineering Blog Custom Solutions Engineering 8
Data Security Data Security 8
Dell Big Data - Blog Dell Big Data 68
Dell Cloud Blog Cloud 42
Dell Cloud OpenStack Solutions - Blog Dell Cloud OpenStack Solutions 0
Dell Lifecycle Controller Integration for SCVMM - Blog Dell Lifecycle Controller Integration for SCVMM 0
Dell Premier - Blog Dell Premier 3
Dell TechCenter TechCenter 1,858
Desktop Authority Desktop Authority 25
Featured Content - Blog Featured Content 0
Foglight for Databases Foglight for Databases 35
Foglight for Virtualization and Storage Management Virtualization Infrastructure Management 256
General HPC High Performance Computing 227
High Performance Computing - Blog High Performance Computing 35
Hotfixes vWorkspace 66
HPC Community Blogs High Performance Computing 27
HPC GPU Computing High Performance Computing 18
HPC Power and Cooling High Performance Computing 4
HPC Storage and File Systems High Performance Computing 21
Information Management Welcome to the Dell Software Information Management blog! Our top experts discuss big data, predictive analytics, database management, data replication, and more. Information Management 229
KACE Blog KACE 143
Life Sciences High Performance Computing 9
OMIMSSC - Blogs OMIMSSC 0
On Demand Services Dell On-Demand 3
Open Networking: The Whale that swallowed SDN TechCenter 0
Product Releases vWorkspace 13
Security - Blog Security 3
SharePoint for All SharePoint for All 388
Statistica Statistica 24
Systems Developed by and for Developers Dell Big Data 1
TechCenter News TechCenter Extras 47
The NFV Cloud Community Blog The NFV Cloud Community 0
Thought Leadership Service Provider Solutions 0
vWorkspace - Blog vWorkspace 511
Windows 10 IoT Enterprise (WIE10) - Blog Wyse Thin Clients running Windows 10 IoT Enterprise Windows 10 IoT Enterprise (WIE10) 4
Latest Blog Posts
  • Life Sciences

    Variant Calling (BWA-GATK) pipeline benchmark with Dell EMC Ready Bundle for HPC Life Sciences

    13G/14G server performance comparisons with Dell EMC Isilon and Lustre Storage

    Overview

    Variant calling is a process by which we identify variants from sequence data. This process helps determine if there is single nucleotide polymorphisms (SNPs), insertions and deletions (indels), and/or structural variants (SVs) at a given position in an individual genome or transcriptome. The main goal of identifying genomic variations is linking to human diseases. Although not all human diseases are associated with genetic variations, variant calling can provide a valuable guideline for geneticists working on a particular disease caused by genetic variations. BWA-GATK is one of the Next Generation Sequencing (NGS) computational tools that is designed to identify germline and somatic mutations from human NGS data. There are a handful of variant identification tools, and we understand that there is not a single tool performs perfectly (Pabinger et al., 2014). However, we chose GATK which is one of most popular tools as our benchmarking tool to demonstrate how well Dell EMC Ready Bundle for HPC Life Sciences can process complex and massive NGS workloads.

    The purpose of this blog is to provide valuable performance information on the Intel® Xeon® Gold 6148 processor and the previous generation Xeon® E5-2697 v4 processor using a BWA-GATK pipeline benchmark on Dell EMC Isilon F800/H600 and Dell EMC Lustre Storage. The Xeon® Gold 6148 CPU features 20 physical cores or 40 logical cores when utilizing hyper threading. This processor is based on Intel’s new micro-architecture codenamed “Skylake”. Intel significantly increased the L2 cache per core from 256 KB on Broadwell to 1 MB on Skylake. The 6148 also touts 27.5 MB of L3 cache and a six channel DDR4 memory interface. The test cluster configurations are summarized in Table 1.

    Table 1 Test Cluster Configurations
    Dell EMC PowerEdge C6420 Dell EMC PowerEdge C6320
    CPU 2x Xeon® Gold 6148 20c 2.4GHz (Skylake) 2x Xeon® E5-2697 v4 18c 2.3GHz (Broadwell)
    RAM 12x 16GB @2666 MHz 8x 16GB @2400 MHz
    OS RHEL 7.3
    Interconnect Intel® Omni-Path
    BIOS System Profile Performance Optimized
    Logical Processor Disabled
    Virtualization Technology Disabled
    BWA 0.7.15-r1140
    sambamba 0.6.5
    samtools 1.3.1
    GATK 3.6

    The test clusters and F800/H600 storage systems were connected via 4 x 100GbE links between two Dell Networking Z9100-ON switches. Each of the compute nodes was connected to the test cluster side Dell Networking Z9100-ON switch via single 10GbE. Four storage nodes in the Dell EMC Isilon F800/H600 were connected to the other switch via 8x 40GbE links. The configuration of the storage is listed in Table 2. For the Dell EMC Lustre Storage connection, four servers (a Metadata Server pair (MDS) and an Object Storage Server pair (OSS)) were connected to 13/14th Generation servers via Intel® Omni-Path. The detailed network topology is illustrated in Figure 1.

    Figure 1 Illustration of Networking Topology: only 8 compute nodes are illustrated here for the simplicity, 64 nodes of 13G/14G servers used for the actual tests.

    Table 2 Storage Configurations
    Dell EMC Isilon F800 Dell EMC Isilon H600 Dell EMC Lustre Storage
    Number of nodes 4 4

    2x Dell EMC PowerEdge R730 as MDS

    2x Dell EMC PowerEdge R730 as OSS
    CPU per node Intel® Xeon™ CPU E5-2697A v4 @2.60 GHz Intel® Xeon™ CPU E5-2680 v4 @2.40GHz 2x Intel® Xeon™ E5-2630V4 @ 2.20GHz
    Memory per node 256GB
    Storage Capacity Total usable space: 166.8 TB, 41.7 TB per node Total usable space: 126.8 TB, 31.7 TB per node 960TB Raw, 768TB (698 TiB) usable MDS Storage Array: 1 x Dell EMC PowerVault MD3420 ( Total 24 - 2.5" 300GB 15K RPM SAS)OSS Storage Array: 4 x Dell EMC PowerVault MD3460 (Total 240 3.5" 4 TB 7.2K RPM NL SAS)
    SSD L3 Cache N/A 2.9 TB per node N/A
    Network

    Front end network: 40GbE

    Back end network: 40GbE

    Front end network: 40GbE

    Back end network: IB QDR

    Front end network: Intel® Omni-Path

    Internal connections: 12Gbps SAS

    OS Isilon OneFS v8.1.0 DEV.0  Isilon OneFS v8.1.0.0 B_8_1_0_011 Red Hat Enterprise Linux 7.2 (3.10.0-327.el7.x86_64)

    The test data was chosen from one of Illumina’s Platinum Genomes. ERR091571 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. Although the description of the data from the linked website shows that this sample has a 30x depth of coverage, in reality it is closer to 10x coverage according to the number of reads counted.

    Performance Evaluation

    BWA-GATK Benchmark over Generations

    Dell EMC PowerEdge C6320s and C6420s were configured as listed in Table 1. The tests performed here are designed to demonstrate performance at the server level, not for comparisons on individual components. At the same time, the tests were also designed to estimate the sizing information of Dell EMC Isilon F800/H600 and Dell EMC Lustre Storage. The data points in Figure 2 are calculated based on the total number of samples (X axis in the figure) that were processed concurrently. The number of genomes per day metric is obtained from total running time taken to process the total number of samples in a test. The smoothed curves are generated by using a polynomial spline with the piecewise polynomial degree of 3 generating B-spline basis matrix. The details of BWA-GATK pipeline information can be obtained from the Broad Institute web site.

    Figure 2 BWA-GATK Benchmark over 13/14 generation with Dell EMC Isilon and Dell EMC Lustre Storage: the number of compute nodes used for the tests are 64x C6420s and 63x C6320s (64x C6320s for testing H600). The number of samples per node was increased to get the desired total number of samples processed concurrently. For C6320 (13G), 3 samples per node was the maximum number of samples each node can process. 64, 104, and 126 test results for 13G system (blue) were with 2 samples per node while 129, 156, 180, 189 and 192 sample test results were obtained from 3 samples per node. For C6420 (14G), the tests were performed with maximum 5 samples per node. The plot for 14G was generated by processing 1, 2, 3, 4, and 5 samples per node. The number of samples per node is limited by the amount of memory in a system. 128 GB and 192 GB of RAM were used in 13G and 14G system, respectively as shown in Table 1. C6420s show a better scaling behavior than C6320s. 13G server with Broadwell CPUs seems to be more sensitive to the number of samples loaded onto system as shown from the results of 126 vs 129 sample tests on all the storages tested in this study.

    The results with Dell EMC Isilon F800 that indicate C6320 with Broadwell/128GB RAM performs roughly 50 genomes per day less when 3 samples are processed per compute node and 30 genomes per day less when 2 samples are processed in each compute node compared to C6420. It is not clear if C6320’s performance will drop again when more samples are added to each compute node; however, it is obvious that C6420 does not show this behavior when the number of samples is increased on each compute node. The results also allow estimating the maximum performance of Dell EMC Isilon F800. As the total number of genomes increases, the increment of the number of genomes per day metric is slow down. Unfortunately, we were not able to identify the exact number of C6420s that would saturate Dell EMC Isilon F800 with four nodes. However, it is safe to say that more than 64x C6420s will require additional Dell EMC Isilon F800/H600 nodes to maintain high performance with more than 320 10x whole human genome samples. Dell EMC Lustre Storage did not scale as well as Dell EMC Isilon F800/H600. However, we observed that some optimizations are necessary to make Dell EMC Lustre Storage perform better. For example, the aligning, sorting, and marking duplicates steps in the pipeline perform extremely well when the file system’s stripe size was set to 2MB while other steps perform very poorly with 2MB stripe size. This suggests that Dell EMC Lustre Storage needs to be optimized further for these heterogeneous workloads in the pipeline. Since there is not any concrete configuration for the pipeline, we will further investigate the idea of using multiple tier file systems to cover different requirements in each step for both Dell EMC Isilon and Lustre Storage.

    Dell EMC PowerEdge C6320 with Dell EMC Isilon H600 performance reached the maximum around 140 concurrent 10x human whole genomes. Running three 10x samples concurrently on a single node is not ideal. This limit appears to be on the compute node side, since H600 performance is much better with C6420s running a similar number of samples.

    Conclusion

    Dell EMC PowerEdge C6420 has at least a 12% performance gain compared to the previous generation. Each C6420 compute node with 192 GB RAM can process about seven 10x whole human genomes per day. This number could be increased if the C6420 compute node is configured with more memory. In addition to the improvement on the 14G server side, four Isilon F800 nodes in a 4U chassis can support 64x C6420s and 320 10x whole human genomes concurrently.

    Resources

    Internal web page http://en.community.dell.com/techcenter/blueprints/blueprint_for_hpc/m/mediagallery/20442903

    External web page https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3956068/

    Contacts
    Americas
    Kihoon Yoon
    Sr. Principal Systems Dev Eng
    Kihoon.Yoon@dell.com
    +1 512 728 4191

  • Custom Solutions Engineering Blog

    Modular System CPU and Memory Configurations can affect Performance

    Written by: Donald Russell

    This Blog will try to help you understand how CPU and memory choices affect memory performance on Dell PowerEdge servers, as it relates to new Skylake processors.

    Let’s start with the Skylake Processors architecture. The new Skylake Processors have 6 Memory channels per CPU, These are controller by 2 internal memory controllers, each handling 3 memory channels. See the figure 1 & 5 below to better visualize how this is implemented. These memory channels must be populated in certain ways and combined with the right processor choice to achieve the best memory performance that Dell PowerEdge Servers can deliver.

     

    Figure 1. SkyLake Memory controller layout.

    Another consideration on the Skylake Processors is the memory controller speeds are different for the classes of Skylake CPUs (Platinum, Gold, Silver and Bronze). Below is a memory speed table for the different classes of Skylake processors. Use this table when choosing the CPU as well as the memory for the server workload requirements of the Dell PowerEdge systems.

    Figure 2. Memory Controller Speed Vs. CPU selection

    The CPU vs. Memory Controller Speed table above shows there is up to a 20% difference in memory controller speed vs. the 81xx and 61xx Skylake processors and the 31xx Skylake processors. There is a 10% difference between the 81xx and 61xx and the 51xx and 41xx Skylake processors. If the server workload is memory intensive, it would be better severed by choosing an 81xx or 61xx Skylake Processor over the other classes of Skylake Processors.

    Figure 3. DIMM types vs. Ranks and Data Width

    Memory Ranks and Data Width can also affect memory performance in Skylake Processors. The table above helps explain why the 16GB RDIMMs gives the best memory performance on Skylake processors as seen in figure 4. The 16GB RDIMMs have 2 Ranks and a Data Width of 8 vs. other RDIMMs or LRDIMMs only have a Data width of 4 with 2 Ranks. The Skylake’s internal memory controller when combined with 16GB RDIMMs give up to 5% better memory performance over the other DIMM memory sizes. The combination of memory type and memory controller speed, there can be up to a 48% difference in memory performance between the highest Skylake CPU 8180(2667 MHz) vs. the 3106(2133MHz) the lowest Skylake CPU. The table below shows the memory speed differences between the Skylake classes of CPUs, as well as the difference between DIMM Ranking and Data Width. 8GB RDIMMs are only single rank and are generally 2 to 5% slower than the dual ranked 16GB and 32GB DIMMs. 32GB RDIMMS only have a Data Width of 4 vs. 8 for the 16GB RDIMMs, so the 16GB DIMMs are generally 2 to 3% faster than the 32GB RDIMMs.

    Figure 4. Memory speed difference by CPU class and memory size

     

    In this section of the blog, we will discuss the difference between balanced and unbalanced memory configurations and how it affects memory performance.

    Figure 5. Balanced vs. Unbalanced Memory vs. DIMM Count

    Let’s start with keeping the memory balanced while adding DIMMS to the Skylake’s memory configuration. As you can see in the figure above, memory must be added to the Skylake Processors in a certain way to keep the memory controllers and channels balanced. Sometimes it’s better to keep the memory on just one controller, especially when the memory configuration only has 1 or 3 DIMMs. But when you have 2, 4, or 6 DIMMS it is better to place 1, 2, or 3 DIMMS on each integrated memory controller. In addition, the table shows that there is no way to make 5 or 7 thru 11 DIMMS configurations balanced. This is because, there will be either 2 DIMMs per Memory Channel in the case of 7 thru 11 DIMM configurations or a Memory Channel without a DIMM, As in the 5 DIMM configuration in the table above. This also explains why if the memory configuration of the Dell PowerEdge system has 5 or 7 thru 11 or 13 thru 23 DIMMs there is no way to make it balanced. It is important to keep the memory balanced because it can result in a performance impact of up to 65%. See figure 8 for more details.

     

                 Now that we have an understanding of how the memory controllers works on a Skylake processor, let’s put it to practice on Dell Systems that have more than 6 Memory slots per CPU. The Dell Modular systems have 8 memory slots per CPU, which mean due to how the Skylake’s memory controllers works, if all 8 memory slots are populated you will achieve the Max Memory size of the Dell PowerEdge modular System. But the memory will be unbalanced and result in lower memory performance as seen in Figure 8. The figure below shows the Dell PowerEdge Modular Systems have 8 DIMM Slots, if the black slots on Memory Channels 0 and 3 of these systems are populated, it will result in an unbalanced memory configuration and effect memory performance. Placing memory in these black slots will place memory channel 0 and 3 in 2 DIMMs Per Channel mode while leaving memory channels 1,2,4,5 in single DPC mode, as explained in the section above.

    Figure 6. Skylake memory controller implementation on C6420, FC\M 640

    Below is a table of the max memory supported in the balanced configuration for the C6420 & FC\M 640 PowerEdge Servers. A good rule on these Dell Modular Systems is as the memory size requirement increases, the memory DIMM size must increase to keep the memory balance, so not to impact memory performance.

    Figure 7. Max Performance Balanced Memory DIMM configurations

    The black slots on the C6420 & FC\M640 can be used to increase the overall memory size of the system, but at the cost of up to 65% of the memory bandwidth. This must be kept in mind when choosing the memory for the server workload. To achieve the max memory performance on the Dell PowerEdge C6420 & FC\M640, only use the 12 white DIMM slots in a dual CPU configuration to increase the memory size. It is better to increase the DIMM size as seen in the table above to achieve maximum memory performance on these systems.

    The table below shows the unbalanced memory performance impact of up to 65% by using all 16 DIMM slots to achieve the max configurable memory size of the Dell Modular Systems.

    Figure 8. CPU choice (memory controller speed) vs. DIMM count vs. DIMM size.

    The figures below of the Dell PowerEdge C6420 and FC\M 640 show that the memory slots A7, B7, A8, B8 place memory controller channels 0 and 3 in 2 DIMMs per channel mode. Since the other 4 memory controller channels are in single DIMM per channel mode, this will cause a drop in memory performance, as reflected in the performance tables above.

    Figure 9. FC\M640 Memory Configuration table

    Figure 10. C6400\C6420 Memory Configuration table

     

    In conclusion, Processor and memory configurations must be considered when creating the system configurations of the 14G Dell PowerEdge Modular systems. Failing to choose the correct processor and memory configuration can cause dramatic performance impacts on Dell PowerEdge Modular Servers.

    To learn more about Dell Custom Solutions Engineering visit www.dell.com/customsolutions

  • General HPC

    Deep Learning Performance on R740 with V100-PCIe GPUs

    Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula

    Dell EMC HPC Innovation Lab. February 2018

    Overview

    The Dell EMC PowerEdge R740 is a 2-socket, 2U rack server. The system features the Intel Skylake processors, up to 24 DIMMs, and up to 3 double width or 6 single width GPUs. In our previous blog Deep Learning Inference on P40 vs P4 with SkyLake, we presented the deep learning inference performance on Dell EMC’s PowerEdge R740 server with P40 and P4 GPUs. This blog will present the performance of the deep learning training performance on single R740 with multiple V100-PCIe GPUs. The deep learning frameworks we benchmarked include Caffe2, MXNet and Horovod+TensorFlow. Horovod is a distributed framework for TensorFlow. We used Horovod because it has better scalability implementation (using MPI model) than TensorFlow, which has been explained in the article “Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow”. Table 1 shows the hardware configuration and software details we tested. To test the deep learning performance and scalability on R740 server, we used the same neural network, the same dataset and the same measurement as in our other deep learning blog series such as Scaling Deep Learning on Multiple V100 Nodes and Deep Learning on V100.

    Table 1: The hardware configuration and software details


    Performance Evaluation

    The Figure 1, Figure 2 and Figure 3 show the Resnet50 performance and speedup of multiple V100 GPUs with Caffe2, MXNet and TensorFlow, respectively. We can obtain the following conclusions based on these results:

    • Overall the performance of Resnet50 scales well on multiple V100 GPUs within one node. With 3 V100:

      • Caffe2 achieved the speedup of 2.61x and 2.65x in FP32 and FP16 mode, respectively.

      • MXNet achieved the speedup of 2.87x and 2.82x in FP32 and FP16 mode, respectively.

      • Horovod+TensorFlow achieved the speedup of 2.12x in FP32 mode. (FP16 still under development)

    • The performance in FP16 mode is around 80%-90% faster than FP32 for both Caffe2 and MXNet. TensorFlow still has not supported FP16 yet, so we will test its FP16 performance once this feature is supported.

     Figure 1: Caffe2: Performance and speedup of V100


    Figure 2: MXNet: Performance and speedup of V100


    Figure 3: TensorFlow: Performance and speedup of V100

    Conclusions

    In this blog, we presented the deep learning performance and scalability of popular deep learning frameworks like Caffe2, MXNet and Horovod+TensorFlow. Overall the three frameworks scale as expected on all GPUs within single R740 server.

  • General HPC

    Digital Manufacturing with 14G

    Author: Joshua Weage, HPC Innovation Lab, February 2018

     

    Dell EMC Ready Bundle for HPC Digital Manufacturing

     

    Four technical white papers were recently published describing the recently released Dell EMC Ready Bundle for HPC Digital Manufacturing. These papers discuss the architecture of the system as well as the performance of ANSYS® Mechanical™, ANSYS® Fluent® and ANSYS® CFX®, LSTC LS-DYNA®, Simcenter STAR-CCM+™ and Dassault Systѐmes’ Simulia Abaqus on the Dell EMC Ready Bundle for HPC.

     

    Dell EMC Ready Bundle for HPC Digital ManufacturingANSYS Performance

      

    Dell EMC Ready Bundle for HPC Digital ManufacturingLSTC LS-DYNA Performance

     

    Dell EMC Ready Bundle for HPC Digital ManufacturingSimcenter STAR-CCM+ Performance

     

    Dell EMC Ready Bundle for HPC Digital ManufacturingDassault Systѐmes’ Simulia Abaqus Performance

  • General HPC

    HPC Applications Performance on R740 with V100 GPUs

     

    Authors: Frank Han, Rengan Xu, Nishanth Dandapanthula.

    HPC Innovation Lab. February 2018

    Overview

     

    Not long ago, PowerEdge R740 Server was released as part of Dell’s 14th Generation server portfolio. It is a 2U Intel SkyLake based rack mount server and provides the ideal balance between storage, I/O and application acceleration. Besides VDI and Cloud, the server is also designed for HPC workloads. Compared to the previous R730 server, one of the major changes on GPU support is that, R740 supports up to 3 double width cards, which is one more than what R730 could support. This blog will focus on the performance of a single R740 server with 1, 2 and 3 Nvidia Tesla V100-PCIe GPUs. Multiple cards scaling number for HPC applications like High-Performance Linpack (HPL), High Performance Conjugate Gradients benchmark (HPCG) and Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) will be presented.

     

    Table1: Details of R740 configuration and software version

    Processor

    2 x Intel(R) Xeon(R) Gold 6150 @2.7GHz, 18c

    Memory

    384G(12*32G@2666MHz)

    Local Disk

    480G SSD

    Operating System

    Red Hat Enterprise Linux Server release 7.3

    GPU

    Nvidia Tesla V100-PCIe

    CUDA Driver

    387.26

    CUDA Toolkit

    9.1.85

    Processor Settings > Logical Processors

    Disabled

    System Profiles

    Performance

     

    High Performance Linpack (HPL)

    Figure 1: HPL Performance and efficiency with R740

    Figure 1 shows HPL performance and efficiency numbers. Performance data increases with multiple GPUs nearly linearly. Efficiency line isn’t flat, the peak 67.8% appears with the 2-card, which means configuration with 2 GPUs is the most optimized one for HPL. Number of 1 and 3 cards are about 7% lower than 2 card and they are affected by different factors:

    • For the 1 card case, this GPU based HPL application designs to bond CPU and GPU. While running in large scale with multiple GPU cards and nodes, it is known to make data access more efficient to bond GPU to the CPU. But testing with only 1 V100 here is a special case that only the 1st CPU bonded with the only GPU, and no workload on the 2nd CPU. Comparing with 2 cards result, the non-using part in Rpeak increased so efficiency dropped. This doesn’t mean GPU less efficient, just because HPL application designs and optimizes for large scales, and 1 GPU only situation is special for HPL.  

    • For the 3 cards case, one of the major limitation factor is 3x1(P and Q) matrix need to be used. HPL is known to perform better with squared PxQ matrix. We also verified 2x2 and 4x1 Matrix on C4140 with 4x GPUs, and as a result 2x2 did perform better. But keep in mind, with 3 cards, the Rmax increased significantly as well. HPL is an extreme benchmark, in real world the capability of having the additional 3rd GPU will give a big advantage for efficiency with different application and different dataset.    

     

    HPCG

    Figure 2: HPCG Performance with R740

    As shown in Figure 2, comparing with dual Xeon 6150 CPU only performance, single V100 is already 3 times faster, 2 V100 is 8 times and 3 V100 (CPU only node vs GPU node) is nearly 12 times. Just like HPL, HPCG is also designed for a large scale, so single card performance isn’t as efficient as multiple cards on HPCG. But unlike HPL, 3 card performance on HPCG is linearly scaled. It is 1.48 times higher than 2 cards’, which is very close to the theoretical 1.5 times. This is because all the HPCG workload is run on GPU and its data fits in GPU memory. This proves application like HPCG can take the advantage of having the 3rd GPU.

     

    LAMMPS

    Figure 3: LAMMPS Performance with R740

     

    The LAMMPS version used for this testing is 17Aug2017, which is the latest stable version at the time of testing. The testing dataset is in.intel.lj, which is the same one in all pervious GPU LAMMPS testing and it can be found here. With the same parameters set from previous testing, the initial values of space were x=4, y=z=2, the simulation executes with 512k atoms. Weak scaling obvious as timesteps/s number of 2 and 3 cards only 1.5 and 1.7 times than single card’s. The reason for this is that the workload isn’t heavy enough for 3 V100 GPUs. After adjusting all x,y,z to 8, 16M atoms generated in simulation, and then the performance scaled well with multiple cards. As shown in Figure 3, 2 and 3 cards is 1.8 and 2.4 times faster than single card, respectively. This results of LAMMPS is another example for GPU accelerated HPC applications that can benefit from having more GPUs in the system.

     

    Conclusion

    The R740 server with multiple Nvidia Tesla V100-PCIe GPUs demonstrates exceptional performance for applications like HPL, HPCG and LAMMPS. Besides balanced I/O, R740 has the flexibility for running HPC applications with 1, 2 or 3 GPUs. The newly added support for an additional 3rd GPU provides more compute power as well as larger total memory in GPU. Many applications work best when data fits in GPU memory and having the 3rd GPU allows fitting larger problems with R740.

    References:

    PowerEdge R740 Technical Guide: http://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/PowerEdge_R740_R740xd_Technical_Guide.pdf

     

     

  • General HPC

    Performance of LS-DYNA on Singularity Containers

    Authors: Nirmala Sundararajan, Joshua Weage, Nishanth Dandapanthula

    HPC Innovation Lab, February 2018

     

    Overview

    We often look at containers today and see the potential of accelerated application delivery, scaling and portability and wonder how we ever got by without them. This blog discusses the performance of LSTC LS-DYNA® within Singularity containers and on bare metal. This blog is the third in the series of blogs regarding container technology by the HPC Engineering team. The first blog Containers, Docker, Virtual Machines and HPC explored the use of containers. The second blog Containerizing HPC Applications with Singularity provided an introduction to Singularity and discussed the challenges of using Singularity in HPC environments. This third blog will focus on determining if there is a performance penalty while running the application (LS-DYNA) in a containerized environment.

    Containerizing LS-DYNA using Singularity

    An application specific container is a lightweight bundle of an application and its dependencies. The primary advantages of an application container are its portability and reproducibility. When we started assessing what type of workloads/applications could be containerized using singularity, interestingly enough, the first application that we tried to containerize using singularity was Fluent®, which presented a constraint. Since Fluent bundles MPI libraries, mpirun has to be invoked from within the container, instead of calling mpirun from outside the container. Adoption of containers is difficult in this scenario and requires an sshd wrapper. So we shifted gears, and started working on LS-DYNA  which is a general-purpose finite element analysis (FEA) program, capable of simulating complex real world problems. LS-DYNA consists of a single executable file and is entirely command line driven. Therefore, all that is required to run LS-DYNA is a command line shell, the executable, an input file, and enough disk space to run the calculation. It is used to simulate a whole range of different engineering problems using its fully automated analysis capabilities. LS-DYNA is used worldwide in multiple engineering disciplines such as automobile, aerospace, construction, military, manufacturing, and bioengineering industries. It is robust and has worked extremely well over the past 20 years in numerous applications such as crashworthiness, drop testing and occupant safety.

    The first step in creating the container for LS-DYNA would be to have a minimal operating system (CentOS 7.3 in this case), basic tools to run the application and support for InfiniBand within the container. With this, the container gets a runtime environment, system tools, and libraries. The next step would be to install the application binaries within the container. The definition file used to create the LS-DYNA container is given below. In this file, the bootstrap references the kind of base to be used and a number of options are available. “shub” pulls images hosted on Singularity Hub, “docker” pulls images from Docker Hub and here yum is used to install Centos-7.

    Definition File:

    BootStrap: yum

    OSVersion: 7

    MirrorURL: http://vault.centos.org/7.3.1611/os/x86_64/

    Include: yum

     

    %post

    # basic-tools and dev-tools

    yum -y install evince wget vim zip unzip gzip tar perl

    yum -y groupinstall "Development tools" --nogpgcheck

     

    # InfiniBand drivers

    yum -y --setopt=group_package_types=optional,default,mandatory groupinstall "InfiniBand Support"

    yum -y install rdma
    yum -y install libibverbs-devel libsysfs-devel

    yum -y install infinipath-psm

     

    # Platform MPI

    mkdir -p /home/inside/platform_mpi

    chmod -R 777 /home/inside/platform_mpi

    cd /home/inside/platform_mpi

    /usr/bin/wget http://192.168.41.41/platform_mpi-09.1.0.1isv.x64.bin

    chmod 777 platform_mpi-09.1.0.1isv.x64.bin

    ./platform_mpi-09.1.0.1isv.x64.bin -installdir=/home/inside/platform_mpi -silent

    rm platform_mpi-09.1.0.1isv.x64.bin

     

    # Application

    mkdir -p /home/inside/lsdyna/code

    chmod -R 777 /home/inside/lsdyna/code

    cd /home/inside/lsdyna/code/usr/bin/wget http://192.168.41.41/ls-dyna_mpp_s_r9_1_113698_x64_redhat54_ifort131_avx2_platformmpi.tar.gz

    tar -xvf ls-dyna_mpp_s_r9_1_113698_x64_redhat54_ifort131_avx2_platformmpi.tar.gz

    rm ls-dyna_mpp_s_r9_1_113698_x64_redhat54_ifort131_avx2_platformmpi.tar.gz

     

    %environment

    PATH=/home/inside/platform_mpi/bin:$PATH

    LSTC_LICENSE=network

    LSTC_MEMORY=auto

    export PATH LSTC_LICENSE LSTC_MEMORY

    %runscript

    exec ls

    /

    It is quick and easy to build a ready to use singularity image using the above definition file. In addition to the environment variables specified in the definition file, the value of any other variable such as LSTC_LICENSE_SERVER, can be set inside the container, with the use of the prefix “SINGULARITYENV_”. Singularity adopts a hybrid model for MPI support and the ‘mpirun’ is called, from outside the container, as follows:

    /home/nirmala/platform_mpi/bin/mpirun -env_inherit -np 80 -hostfile hostfile singularity exec completelsdyna.simg /home/inside/lsdyna/code/ls-dyna_mpp_s_r9_1_113698_x64_redhat54_ifort131_avx2_platformmpi i=Caravan-V03c-2400k-main-shell16-120ms.k memory=600m memory2=60m endtime=0.02 

     

    LS_DYNA car2car Benchmark Test

    The car2car benchmark presented here is a simulation of a two vehicle collision. The performance results for LS-DYNA are presented by using the Elapsed Time metric. This metric is the total elapsed time for the simulation in seconds as reported by LS-DYNA. A lower value represents better performance. The performance tests were conducted on two clusters in the HPC Innovation Lab, one with Intel® Xeon® Scalable Processor Family processors (Skylake) and another with Intel® Xeon® E5-2600 v4 processors (Broadwell). The software installed on the Skylake cluster is shown in Table 1

    The configuration details of the Skylake cluster are shown below in Table 2.

    Figure1 shows that there is no perceptible performance loss while running LS-DYNA inside a containerized environment, both on a single node and across the four node cluster. The results for the bare-metal and container tests are within 2% of each other, which is within the expected run-to-run variability of the application itself.

     

    Figure 1 Performance inside singularity containers relative to bare metal on Skylake Cluster.

    The software installed on the Broadwell cluster is shown in Table 3

    The configuration details of the Broadwell cluster are shown below in Table 4 .

    Figure 2 shows that there is no significant performance penalty while running LS-DYNA car2car inside Singularity containers on Broadwell cluster. The performance delta is within 2% only.

    Figure 2 Performance inside singularity containers relative to bare metal on Broadwell Cluster

    Conclusion and Future Work

    In this blog, we discussed how the performance of LS-DYNA within singularity containers is almost at par with running the application on bare metal servers. The performance difference while running LS-DYNA within Singularity containers remains within 2%, which is within the run-to-run variability of the application itself.. The Dell EMC HPC team will focus on containerizing other applications, and this blog series will be continued. So stay tuned for the next blog in this series!


  • HPC Storage and File Systems

    DELL EMC ISILON F800 AND H600 PERFOMRANCE EVALUATION

    Two white papers were recently published, which presented comprehensive performance studies of both Dell EMC lsilon F800 and H600 storage systems.

    DELL EMC ISILON F800 AND H600 I/O PERFORMANCE” describes sequential and random I/O performance results for Dell EMC Isilon F800 and H600 node types. The data is intended to inform administrators on the suitability of Isilon storage clusters for HPC various workflows.

    DELL EMC ISILON F800 AND H600 WHOLE GENOME ANALYSIS PERFORMANCE” describes whole genome analysis performance results for Dell EMC Isilon F800 and H600 storage clusters (4 Isilon nodes per cluster). The data is intended to inform administrators on the suitability of Isilon storage clusters for high performance genomic analysis.

  • General HPC

    Design Principles for HPC

    Dell EMC HPC Innovation Lab, February 2018.

    HPC system configuration can be a complex task, especially at scale, requiring a balance between user requirements, performance targets, data center power, cooling and space constraints and pricing. Furthermore, many choices among competing technologies complicates configuration decisions. The document below describes a modular approach to HPC system design at scale where sub-systems are broken down into modules that integrate well together. The design and options for each module are backed by measured results, including application performance. Data center considerations are taking into account during design phase. Enterprise class services, including deployment, management and support, are available for the whole HPC system.

  • Dell TechCenter

    Custom DCUI screen for DellEMC customized VMware ESXi

    This blog post is written by Amardeep Kahali and Krishnaprasad K from Dell Hypervisor Engineering team.

    This is a continuation of the  blog post "DellEMC factory install changes for VMware ESXi" . DellEMC 14th generation of servers shipped from factory with Dell customized VMware ESXi pre-installed has service tag set as the root user password by default.

    From October 2017 onwards, DellEMC modified the default DCUI (Direct Console User Interface) screen to proactively communicate this change. A sample screenshot of the same is as below.

    NOTE: Dell 13th generation of servers shipped from factory with Dell customized VMware ESXi pre-installed  continue to have no password set for the root user.