Dell Community

Blog Group Posts
Application Performance Monitoring Blog Foglight APM 105
Blueprint for HPC - Blog Blueprint for High Performance Computing 0
CommAutoTestGroup - Blog CommAutoTestGroup 1
Data Security Data Security 8
Dell Big Data - Blog Dell Big Data 68
Latest Blog Posts
  • General HPC

    NAMD Performance Analysis on Skylake Architecture

    Author: Joseph Stanfield

    The purpose of this blog is to provide a comparative performance analysis of the Intel® Xeon® Gold 6150 processor and the previous generation Xeon® E5-2697 v4 processors using the NAMD benchmark. The Xeon® Gold 6150 CPU features 18 physical cores or 36 logical cores when utilizing hyper threading. This processor is based on Intel’s new micro-architecture codenamed “Skylake”. Intel significantly increased the L2 cache per core from 256 KB on Broadwell to 1 MB on Skylake. The 6150 also touts 24.75 MB of L3 cache and a six channel DDR4 memory interface.

     

    Nanoscale Molecular Dynamics (NAMD) is an application developed using the Charm++ parallel programming model for molecular dynamics simulation. It is popular due to its parallel efficiency, scalability, and the ability to simulate millions of atoms.

    .
    Test Cluster Configurations:

     

    Dell EMC PowerEdge C6420

     

    Dell EMC PowerEdge C6320

    CPU

    2x Xeon® Gold 6150 18c 2.7 GHz
    (Skylake)

    2x Xeon® E5-2697 v4 16c 2.3 GHz
    (Broadwell)

    RAM

    12x 16GB @2666 MHz

    8x 16GB @2400 MHz

    HDD

    1TB SATA

    1 TB SATA

    OS

    RHEL 7.3

    RHEL 7.3

    InfiniBand

    EDR ConnectX-4

    EDR ConnectX-4

    CHARM++

    6.7.1

    NAMD

    2.12_Source

     
    BIOS Settings

    BIOS Options

    Settings

    System Profile

    Performance Optimized

    Logical Processor

    Disabled

    Virtualization Technology

    Disabled


    The benchmark dataset selected for this series of tests was the Satellite Tobacco Mosaic Virus, or STMV. STMV contains 1,066,628 atoms, which makes it ideal for demonstrating scaling to large clustered environments. The performance is measured in nanoseconds per day (ns/day), which is the number of days required to simulate 1 nanosecond of real-time. A larger value indicates faster performance.

     

    The first series of benchmark tests conducted were to measure the CPU performance. The test environment consisted of a single node, two nodes, four nodes, and eight nodes with the NAMD STMV dataset run three times for each configuration. The network interconnect between the nodes used was EDR InfiniBand as noted in the table above. Average results from a single node showed 0.70 ns/day. While for a two-node run performance increased by 80% to 1.25 ns/days. The trend of an average of 80% increase in performance for each doubling of node count remained relatively consistent as the environment was scaled to eight nodes, as seen in Figure 1.

    Figure 1.

     

    The second series of benchmarks were run to compare the Xeon® Gold 6150 against the previous generation Xeon® E5-2697v4. The same dataset, STMV was used for both benchmark environments. As you can see below in Figure 2, the Xeon® Gold CPU results surpass the Xeon E5 V4 by 111% on a single node, and the relative performance advantage decreases to 63% at eight nodes.

     


    Figure 2.

     

     

    Summary

    In this blog, we analyzed and presented the performance of a Dell EMC PowerEdge C6420 cluster scaling from a single node to eight nodes running NAMD with the STMV dataset. Results show that performance of NAMD scales linearly with the increased number of nodes.

    At the time of publishing this blog, there is an issue with the Intel Parallel Studio v, 2017.x and NAMD compilation. Intel recommends using Parallel Studio 2016.4 or 2018 (which is still in beta) with -xCORE-AVX512 under the FLOATOPS variable for best performance.

    A comparative analysis was also conducted with the previous generation Dell EMC PowerEdge C6320 server and Xeon® E5 v4 (Broadwell) processor. The Xeon® Gold outperformed the E5 V4 by 111% and maintained a linear performance increase as the cluster was scaled and the number of nodes multiplied.


    Resources

    Intel NAMD Recipe: https://software.intel.com/en-us/articles/building-namd-on-intel-xeon-and-intel-xeon-phi-processor

    Intel Fabric Tunining and Application Performance: https://www.intel.com/content/www/us/en/high-performance-computing-fabrics/omni-path-architecture-application-performance-mpi.html

  • Dell TechCenter

    Simplified Rack Scale Management

    Ed Bailey – Distinguished Eng., ESI Architecture, Dell EMC

    “What if you could manage more devices, with fewer interfaces?”

    In today’s large scale environments you are always managing more and more devices – more compute, more storage, more networking. As your infrastructure scales, management becomes increasingly complicated, time consuming, and expensive.    rack scale

    You need a new approach; one that simplifies your operations, your procurement, your service and your management. Buying integrated racks is key, but also critical is the ability to manage at a higher level – at rack scale. Rack scale management treats the entire rack as the unit of management, enabling faster scaling and more efficient resource utilization.

    The Dell EMC DSS 9000 rack scale solution is unique in that it delivers both the extreme configuration flexibility large scale infrastructures require for a wide range of workloads and the simplified management and operations that improve their bottom-line efficiency.  Part of how it does that is by making all aspects of rack scale management easy:

    •           Easy to Purchase – A consistent architecture and components streamline scale-out procurement
    •           Easy to Optimize - Highly flexible configuration options simplify multiple workload optimization
    •           Easy to Deploy - Complete racks are delivered pre-configured, pre-integrated & fully tested
    •           Easy to Manage - A single, open rack scale interface addresses the entire infrastructure
    •           Easy Scale - You can add fully configured sleds or full racks in a single step
    •           Easy to Service - Modular components, cold aisle service, proven global service network

    The rest of this blog describes in more detail how the rack scale approach can help you address the most pressing large scale infrastructure management challenges.

    Simplicity
    As part of a simplified rack scale solution, the DSS 9000 offers a single interface to manage all the compute and storage devices in a rack. This rack management interface is based on the industry-accepted open Redfish APIs. With a pre-integrated DSS 9000, you roll the rack into the data center, plug it in, and then begin provisioning the system - talking Redfish to a single point of management. Immediately you know everything that is in the rack – and immediately have access to it all. You understand what needs to be provisioned and can issue commands to each device. You can provision that rack as easily and quickly as possible.

    Another aspect of the DSS 9000 solution is its powerful management infrastructure, with a gigabit management network that is independent of the data network and a Rack Manager module that consolidates rack-wide communication and vastly simplifies cabling. Instead of needing to communicate to each individual compute or storage node – you just talk to the Rack Manager, and instead of connecting to all the nodes individually, connections are consolidated at the block level and cabling is reduced.

    Capability
    DSS 9000 rack management gives you more capability for configuration and provisioning of the rack. The DSS 9000 implements Intel ® Rack Scale Design (RSD) Pooled System Management Engine (PSME) APIs and firmware that enables the comprehensive inventory of all devices in the rack. This gives you the ability to manage with much greater efficiency using the Redfish APIs. For example, instead of performing firmware updates to each device in the rack individually, rack management allows you to perform firmware updates to all the devices in the rack at once.

    Other significant rack management capabilities include power-capping of blocks and nodes to improve energy efficiency, as well as monitoring and control of cooling fans.  Both of these capabilities can be administered at the block level to provide greater granularity of control.

    Efficiency
    Managing at the rack level also delivers efficiency by allowing you to define a full rack configuration that meets your infrastructure’s needs and employ it as the “unit of purchase” or the “unit of deployment”. This tremendously simplifies ongoing operational procurement and deployments. Improving time-to-value in this way – by reducing the time it takes to define, configure, order and deploy incremental infrastructure - provides another level of cost savings for you. At the same time, it has the added benefit of accelerating your organization’s responsiveness in terms of delivery of services and improving reliability

    The modular block infrastructure of the DSS 9000 also delivers higher efficiency. You can define workload-specific configurations for compute or storage nodes that can be available for rapid scaling of your infrastructure as demand increases. For example, some nodes in your infrastructure may be optimally configured for Hadoop workload performance with specific processors, memory capacity and storage. When the need to scale arises, an identically preconfigured node can quickly and easily be added to the rack.  Procurement and deployment are streamlined and management of the new node is consistent… and instantaneous. The newly introduced node is automatically inventoried and immediately accessible for management commands.

    Conclusion
    Simplicity, capability, efficiency – the DSS 9000 rack management delivers answers for the challenges of administrating IT infrastructure at a massive scale. Inquiries about ESI rack Scale solutions can be made at ESI@dell.com.

     

  • Dell TechCenter

    Which OpenManage solution is best for you? Ask the Dell Systems Management Advisor.

    Dell provides a vast array of Enterprise Systems Management solutions for many IT different needs and use cases.  With so many useful options available, sometimes it's not immediately obvious which Dell OpenManage solution will work best for you based on your environment and requirements whether you need to  deploy, update, monitor, or maintain systems.  

    Luckily, Dell recently created an advisor tool that will recommend which OpenManage products will work best for you based on (but not limited to) factors in your environment such as:

    • Functionality you wish to implement (Monitor / Manage / Deploy / Maintain)
    • Size and brand mix of your server environment
    • Features of your Dell servers
    • Mix of physical vs virtual hosts
    • Current Dell and 3rd party systems manage tools you utilize
    • Need for 1:1 or 1 to many tools

     

    Once you complete a short questionnaire, the advisor will suggest the OpenManage tools that best suit your needs and provide useful information and links so that you can learn more.

    Dell Systems Management Advisor

    Other than the Systems Management Advisor, Dell TechCenter provides a wealth of information if you would like to evaluate our Systems Management technologies.  Please visit our additional Dell OpenManage links:

     
     
  • vWorkspace - Blog

    What's new for vWorkspace - July 2017

    Updated monthly, this publication provides you with new and recently revised information and is organized in the following categories; Documentation, Notifications, Patches, Product Life Cycle, Release, Knowledge Base Articles.

    Subscribe to the RSS (Use IE only)

     

    Knowledgebase Articles

    New 

    None at this time

     

    Revised

    178358 - Windows 10 Support

    Microsoft Windows 10 is now supported from vWorkspace 8.6.2 and above with the exception of the Creator Update

    Revised: July 13, 2017

     

    182253 - Poor performance with using EOP USB to redirect USB drive

    When using EOP USB to redirect a USB drive or memory stick to a VDI redirection is sporadic the data transfer rate is poor when it is redirected.

    Revised: July 21, 2017

     

    Product Life Cycle -vWorkspace

    Revised: July 27, 2017

  • Dell TechCenter

    DellEMC PowerEdge 14G Servers certified for VMware ESXi

    This blog is written by Murugan Sekar & Revathi from Dell Hypervisor Engineering Team.

    DellEMC has introduced next generation (14G) of PowerEdge servers which support Intel Xeon Processor Scalable Family (Skylake-SP). This blog highlights 14G DellEMC PowerEdge server features related to VMware ESXi.

    • As part of initial release, Following servers are launched and certified with ESXi6.5 & ESXi6.0U3. Refer VMware HCL for more details.
      • R940, R740xd, R740, R640, C6420

    • The above listed serves are certified with Trusted boot (TxT) feature for ESXi6.5 & ESXi6.0U3 in both BIOS & UEFI boot mode. On these servers, DellEMC offered TPM as Plug-In Module solution which supports,
      • TPM1.2
      • TPM2.0
      • TPM2.0(China NationZ)

        Note: Trusted Platform Module (TPM) 2.0 is not supported in ESXi6.5 & ESXi6.0U3. Only TPM 1.2 is supported in the current releases of VMware ESXi.

    • VMware supports UEFI secure boot from ESXi6.5 onwards. This feature is certified on R940, R740, R740xd, R640 & C6420. Refer to the white paper  for details on UEFI Secure boot.
    • GPU Passthrough (vDGA) is a graphic acceleration function offered by VMware and currently this feature is certified for following GPUs on R740 server with ESXi6.5 & ESXi6.0U3.
      • Tesla M60
      • AMD S7150
      • AMD S7150x2

       Note: Ensure to set Memory Mapped I/O Base to 12TB under BIOS settings to power on the windows VMs with GPU Pass-through. Refer link for details.

    • 14G servers introduced new IDSDM module which combines IDSDM and/or vFLASH into single module. This module supports only microSD cards. Currently DellEMC offers 16/32/64GB IDSDM based micro SD cards which are certified for VMware ESXi. Refer this link for list of certified SD cards for servers.

    • NVDIMM-N is not supported in ESXi6.5 & ESXi6.0U3.

  • Dell TechCenter

    Dell announces PowerEdge VRTX support for VMware ESXi 6.5

    This blog post is written by Thiru Navukkarasu and Krishnaprasad K from Dell Hypervisor Engineering. 

    Dell PowerEdge VRTX was not supported for VMware ESXi 6.5 branch of ESXi thus far. Dell announced support for VRTX from Dell customized version of ESXi 6.5 A04 revision onwards. From VMware ESXi 6.5 onwards, the shared PERC8 controller in VRTX use dell_shared_perc8 native driver instead of megaraid_sas vmklinux driver in ESXi 6.0.x branch. 

    You may look at the following command outputs in ESXi to verify if you have the supported image installed on PowerEdge VRTX blades. 

    ~] vmware –lv

    VMware ESXi 6.5.0 build-5310538

    VMware ESXi 6.5.0 GA

     ~] cat /etc/vmware/oem.xml

    You are running DellEMC Customized Image ESXi 6.5 A04 (based on ESXi VMKernel Release Build 5310538)

    ~] esxcli storage core adapter list

    HBA Name  Driver             Link State  UID                   Capabilities  Description

    --------  -----------------  ----------  --------------------  ------------  ----------------------------------------------------------

    vmhba3    dell_shared_perc8  link-n/a    sas.0                               (0000:0a:00.0) LSI / Symbios Logic Shared PERC 8 Mini

    vmhba4    dell_shared_perc8  link-n/a    sas.c000016000c00                   (0000:15:00.0) LSI / Symbios Logic Shared PERC 8 Mini

     References

  • General HPC

    LAMMPS Four Node Comparative Performance Analysis on Skylake Processors

    Author: Joseph Stanfield
     

    The purpose of this blog is to provide a comparative performance analysis of the Intel® Xeon® Gold 6150 processor (architecture code named “Skylake”) and the previous generation Xeon® E5-2697 v4 processor using the LAMMPS benchmark. The Xeon® Gold 6150 CPU features 18 cores or 36 when utilizing hyper threading. Intel significantly increased the L2 cache per core from 256 KB on previous generations of Xeon to 1 MB. The new processor also touts 24.75 MB of L3 cache and a six channel DDR4 memory interface.

    LAMMPS, or Large Scale Atom/Molecular Massively Parallel Simulator, is an open-source molecular dynamics program originally developed by Sandia National Laboratories, Temple University, and the United States Department of Energy. The main function of LAMMPS is to model particles in a gaseous, liquid, or solid state.

     

    Test cluster configuration

     

    Dell EMC PowerEdge C6420

     

    Dell EMC PowerEdge C6320

    CPU

    2x Xeon® Gold 6150 18c 2.7 GHz
    (Skylake)

    2x Xeon® E5-2697 v4 16c 2.3 GHz
    (Broadwell)

    RAM

    12x 16GB @2666 MT/s

    8x 16GB @2400 MT/s

    HDD

    1TB SATA

    1 TB SATA

    OS

    RHEL 7.3

    RHEL 7.3

    InfiniBand

    EDR ConnectX-4

    EDR ConnectX-4

    BIOS Settings

    BIOS Options

    Settings

    System Profile

    Performance Optimized

    Logical Processor

    Disabled

    Virtualization Technology

    Disabled


    The LAMMPS version used for testing release was lammps-6June-17. The in.eam dataset was used for the analysis on both configurations. In.eam is a dataset that simulates a metallic solid, Cu EAM potential with 4.95 Angstrom cutoff (45 neighbors per atom), NVE integration. The simulation was executed using 100 steps with 32,000 atoms. The first series of benchmarks conducted were to measure performance in units of timesteps/s. The test environment consisted of four servers interconnected with InfiniBand EDR, and tests were run on a single node, two nodes, and four nodes with LAMMPS, three times for each configuration. Average results from a single node showed 106 time steps per second while a two node result nearly doubled performance with 216 time steps per second. This trend remained consistent as the environment was scaled to four nodes as seen in Figure 1.

     

    Figure 1.

    The second series of benchmarks were run to compare the Xeon® Gold against the previous generation, Xeon® E5 v4. The same dataset, in.eam, was used with 32,000 atoms and 100 steps per run. As you can see below in Figure 2, the Xeon® Gold CPU outperforms the Xeon® E5 v4 by about 120% with each test, but the performance increase drops slightly as the cluster is scaled.


    Figure 2.

    Conclusion

    In this blog, we analyzed and presented the performance of a Dell EMC PowerEdge C6420 cluster scaling from a single node to four nodes running the LAMMPS benchmark. Results show that performance of LAMMPS scales linearly with the increased number of nodes.

     

    A comparative analysis was also conducted with the previous generation Dell EMC PowerEdge C6320 server with a Xeon® E5 v4 (Broadwell) processor. As with the first test, as node count was increased the linear scaling of the application was observed on Xeon® E5 v4, results similar to the Xeon® Gold.. But the Xeon® Gold processor outperformed the previous generation CPU by about 120% each run.


    Resources

     

     

  • Dell TechCenter

    Dell EMC Unity 4.2 Release

    Blog author: Chuck Armstrong, Dell EMC Storage Engineering

     

    Dell EMC has just released a new Dell EMC Unity All-Flash portfolio: the 350F, 450F, 550F, and 650F.

    These new all-flash arrays are based on the latest Intel® Broadwell chip. Additionally, they are loaded up with twice the memory and up to 40 percent more processor cores than previous Dell EMC Unity models. What does all of this mean for customers? Finally, a midrange all-flash storage platform built to get the most out of virtualized and mixed workloads.

    Let’s talk about workloads:

    If you plan on deploying Microsoft® SQL Server®, Exchange Server, Hyper-V®, or VMware vSphere® with one of the new Dell EMC Unity All-Flash array models, keep reading to find a trove of information.

    Microsoft SQL Server

    There are several performance considerations that need to be understood when implementing Microsoft SQL Server on Dell EMC Unity All-Flash arrays to provide a highly efficient environment for the users. These considerations fall to the categories of database types, operating system settings and configuration, and the storage design (layout for the database). All of this, and more is found in the Dell EMC Unity Storage with Microsoft SQL Server best practices paper.

    Microsoft Exchange

    Deploying Microsoft Exchange Server in Dell EMC Unity All-Flash arrays has its own set of considerations to maximize performance. One of these considerations is the version of MS Exchange being deployed, as different versions have differing performance characteristics. Another is the design and layout of the database and log locations. These considerations and many more are to be found in the Dell EMC Unity Storage with Microsoft Exchange Server best practices paper.

     Microsoft Hyper-V

    If your environment utilizes Microsoft Hyper-V, the Dell EMC Unity Storage with Microsoft Hyper-V paper provides important best practices. Some of the many points of interest provided include guest virtual machine storage recommendations, virtual machine placement recommendations, and thin provisioning best practices.

     VMware vSphere

    For those environments where VMware vSphere is the hypervisor, and deploying a new Dell EMC Unity All-Flash arrays is on the horizon, the Dell EMC Unity Storage with VMware vSphere best practices paper provides vital information to get the job done. Some items of interest found in this paper are: getting the most out of multipathing, configuring datastores (Fibre Channel, iSCSI, NFS, and VVol), and determining where to thin provision: within vSphere, within the storage, or both.

      

     


    All of the best practices papers mentioned also provide information about several of the features available on the Dell EMC Unity All-Flash arrays. Additional information on features, and for a more general best practices guide, please check out the Dell EMC Unity Best Practices Guide.

  • Dell TechCenter

    Capability for disabling TLS1.0 on iDRAC6 in 11th generation of PowerEdge Servers

    iDRAC6, the Dell Remote Access Controller in the 11th generation of PowerEdge Servers support the protocols TLS version 1.0, TLS version 1.1 and TLS version 1.2 (cryptographic protocols designed to provide communications security over a computer network). Starting with firmware version 2.90 for Monolithic and version 3.85 for Modular, we have added the capability of optionally disabling TLS1.0 in iDRAC6. This is to facilitate running the system in a highly secured environment due to known security vulnerabilities with TLS1.0.

    TLS 1.0 with SSL 3.0 is known for exposing the system for following security vulnerabilities:

    1. POODLE, the vulnerability which could allow hackers to intercept and decrypt the traffic between a user's browser and an SSL-secured website.

    2. BEAST attack where an attacker can “decrypt” data exchanged between the two parties by taking advantage of a vulnerability in the implementation of the Cipher Block Chaining (CBC) mode in TLS 1.0 which allows them to perform chosen plaintext attack.

     Disabling TLS1.0 provides the users an option to run the system with TLS1.1 and above, thereby isolating the system from the above mentioned vulnerabilities.

     The capability to enable/disable TLS1.0 is supported only through the command line interface in iDRAC6 - RACadm. By default, TLS 1.0 is enabled.

     Limitations of disabling TLS 1.0:

    • Certain versions of Windows OS may not support TLS1.1 and above by default. On such systems WSMan access to iDRAC6 may not work seamlessly. 

    More details, and the patches from Microsoft for certain OS versions to work with TLS1.1 and above:

    https://blogs.msdn.microsoft.com/kaushal/2011/10/02/support-for-ssltls-protocols-on-windows/

    https://support.microsoft.com/en-us/help/3140245/update-to-enable-tls-1-1-and-tls-1-2-as-a-   default-secure-protocols-in 

  • General HPC

    BIOS characterization for HPC with Intel Skylake processor

    Ashish Kumar Singh. Dell EMC HPC Innovation Lab. Aug 2017

    This blog discusses the impact of the different BIOS tuning options available on Dell EMC 14th generation PowerEdge servers with the Intel Xeon® Processor Scalable Family (architecture codenamed “Skylake”) for some HPC benchmarks and applications. A brief description of the Skylake processor, BIOS options and HPC applications is provided below.  

    Skylake is a new 14nm “tock” processor in the Intel “tick-tock” series, which has the same process technology as the previous generation but with a new microarchitecture. Skylake requires a new CPU socket that is available with the Dell EMC 14th Generation PowerEdge servers. Skylake processors are available in two different configurations, with an integrated Omni-Path fabric and without fabric. The Omni-Path fabric supports network bandwidth up to 100Gb/s. The Skylake processor supports up to 28 cores, six DDR4 memory channels with speed up to 2666MT/s, and additional vectorization power with the AVX512 instruction set. Intel also introduces a new cache coherent interconnect named “Ultra Path Interconnect” (UPI), replacing Intel® QPI, that connects multiple CPU sockets.

    Skylake offers a new, more powerful AVX512 vectorization technology that provides 512-bit vectors. The Skylake CPUs include models that support two 512-bit Fuse-Multiply-Add (FMA) units to deliver 32 Double Precision (DP) FLOPS/cycle and models with a single 512-bit FMA unit that is capable of 16 DP FLOPS/cycle. More details on AVX512 are described in the Intel programming reference. With 32 FLOPS/cycle, Skylake doubles the compute capability of the previous generation, Intel Xeon E5-2600 v4 processors (“Broadwell”).

    Skylake processors are supported in the Dell EMC PowerEdge 14th Generation servers. The new processor architecture allows different tuning knobs, which are exposed in the server BIOS menu. In addition to existing options for performance and power management, the new servers also introduce a clustering mode called Sub NUMA clustering (SNC). On CPU models that support SNC, enabling SNC is akin to splitting the single socket into two NUMA domains, each with half the physical cores and half the memory of the socket. If this sounds familiar, it is similar in utility to the Cluster-on-Die option that was available in E5-2600 v3 and v4 processors as described here. SNC is implemented differently from COD, and these changes improve remote socket access in Skylake when compared to the previous generation. At the Operating System level, a dual socket server with SNC enabled will display four NUMA domains. Two of the domains will be closer to each other (on the same socket), and the other two will be a larger distance away, across the UPI to the remote socket. This can be seen using OS tools like numactl –H.

    In this study, we have used the Performance and PerformancePerWattDAPC system profiles based on our earlier experiences with other system profiles for HPC workloads. The Performance Profile aims to optimize for pure performance. The DAPC profile aims to balance performance with energy efficiency concerns. Both of these system profiles are meta options that, in turn, set multiple performance and power management focused BIOS options like Turbo mode, Cstates, C1E, Pstate management, Uncore frequency, etc.

    We have used two HPC benchmarks and two HPC applications to understand the behavior of SNC and System Profile BIOS options with Dell EMC PowerEdge 14th generation servers. This study was performed with a single server only; cluster level performance deltas will be bounded by these single server results. The server configuration used for this study is described below.    

    Testbed configuration:

    Table 1: Test configuration of new 14G server

    Components                                          Details

    Server                                                     PowerEdge C6420 

    Processor                                               2 x Intel Xeon Gold 6150 – 2.7GHz, 18c, 165W

    Memory                                                  192GB (12 x 16GB) DDR4 @2666MT/s

    Hard drive                                              1 x 1TB SATA HDD, 7.2k rpm

    Operating System                                   Red Hat Enterprise Linux-7.3 (kernel - 3.10.0-514.el7.x86_64)

    MPI                                                         Intel® MPI 2017 update4

    MKL                                                        Intel® MKL 2017.0.3

    Compiler                                                 Intel® compiler 17.0.4

    Table 2: HPC benchmarks and applications

    Application                                Version                                               Benchmark

    HPL                                             From Intel® MKL                                Problem size - 92% of total memory

    STREAM                                      v5.04                                                  Triad

    WRF                                            3.8.1                                                  conus2.5km

    ANSYS Fluent                              v17.2                                                  truck_poly_14m, Ice_2m

           combustor_12m   

     

    Sub-NUMA cluster

    As described above, a system with SNC enabled will expose four NUMA nodes to the OS on a two socket PowerEdge server. Each NUMA node can communicate with three remote NUMA nodes, two in another socket and one within same socket. NUMA domains on different sockets communicate over the UPI interconnect. With the Intel® Xeon Gold 6150 18 cores processor, each NUMA node will have nine cores. Since both sockets are equally populated in terms of memory, each NUMA domain will have one fourth of the total system memory.    

                    

                                                 Figure 1: Memory bandwidth with SNC enabled

    Figure 1 plots the memory bandwidth with SNC enabled. Except SNC and logical processors, all other options are set to BIOS defaults. Full system memory bandwidth is ~195 GB/s on the two socket server. This test uses all available 36 cores for memory access and calculates aggregate memory bandwidth. The “Local socket – 18 threads” data point measures the memory bandwidth of single socket with 18 threads. As per the graph, local socket memory bandwidth is ~101 GB/s, which is about half of the full system bandwidth. By enabling SNC, a single socket is divided into two NUMA nodes. The memory bandwidth of a single SNC enabled NUMA node is noted by “Local NUMA node – 9 threads”. In this test, the nine local cores access their local memory attached to their NUMA domain. The memory bandwidth here is ~50 GB/s, which is half of the total local socket bandwidth.

    The data point “Remote to same socket” measures the memory bandwidth between two NUMA nodes, which are on the same socket with cores on one NUMA domain accessing the memory of the other NUMA domain. As per the graph, the server measures  ~ 50GB/s memory bandwidth for this case; the same as the “local NUMA node – 9 threads” case. That is, with SNC enabled, memory access within the socket is similar in terms of bandwidth even across NUMA domains. This is a big difference from the previous generation where there was a penalty when accessing memory on the same socket with COD enabled. See Figure 1 in the previous blog where a 47% drop in bandwidth was observed and compare that to the 0% performance drop here. The “Remote to other socket” test involves cores on one NUMA domain accessing the memory of a remote NUMA node on the other socket. This bandwidth is 54% lower due to non-local memory access over UPI interconnect.

    These memory bandwidth tests are interesting, but what do they mean? Like in previous generations, SNC is a good option for codes that have high NUMA locality. Reducing the size of the NUMA domain can help some codes run faster due to less snoops and cache coherence checks within the domain. Additionally, the penalty for remote accesses on Skylake is not as bad as it was for Broadwell.

                                 

     

     Figure 2: Comparing Sub-NUMA clustering with DAPC

    Figure 2 shows the effect of SNC on multiple HPC workloads; note that all of these have good memory locality. All options except SNC and Hyper Threading are set to BIOS default. SNC disabled is considered as the baseline for each workload. As per Figure 2, all tests measure no more than 2% higher performance with SNC enabled. Although this is well within the run-to-run variation for these applications, SNC enabled consistently shows marginally higher performance for STREAM, WRF and Fluent for these datasets. The performance delta will vary for larger and different datasets. For many HPC clusters, this level of tuning for a few percentage points might not be worth it, especially if applications with sub-optimal memory locality will be penalized.

     

    The Dell EMC default setting for this option is “disabled”, i.e. two sockets show up as just two NUMA domains. The HPC recommendation is to leave this at disabled to accommodate multiple types of codes, including those with inefficient memory locality, and to test this on a case-by-case basis for the applications running on your cluster.

     

    System Profiles

    Figure 3 plots the impact of different system profiles on the tests in this study. For these studies, all BIOS options are default except system profiles and logical processors. The DAPC profile with SNC disabled is used as the baseline. Most of these workloads show similar performance on both Performance and DAPC system profile. Only HPL performance is higher by a few percent. As per our earlier studies, DAPC profile always consumes less power than performance profile, which makes it suitable for HPC workloads without compromising too much on performance.  

                                                                               

     Figure 3: Comparing System Profiles

    Power Consumption

    Figure 4 shows the power consumption of different system profiles with SNC enabled and disabled. The HPL benchmark is suited to put stress on the system and utilize the maximum compute power of the system. We have measured idle power and peak power consumption with logical processor set to disabled.

                       

                                                 Figure 4: Idle and peak power consumption

    As per Figure 4, DAPC Profile with SNC disabled shows the lowest idle power consumption relative to other profiles. Both Performance and DAPC system profiles consume up to ~5% lower power in idle status with SNC disabled. In idle state, Performance Profile consumes ~28% more power than DAPC.

    The peak power consumption is similar with SNC enabled and with SNC disabled. Peak power consumption in DAPC Profile is ~16% less than in Performance Profile. 

    Conclusion

    Performance system profile is still the best profile to achieve maximum performance for HPC workloads. However, DAPC consumes less power than performance increase with performance profile, which makes DAPC the best suitable system profile.

    Reference:

    http://en.community.dell.com/techcenter/extras/m/white_papers/20444326