Dell Community

Blog Group Posts
Application Performance Monitoring Blog Foglight APM 105
Blueprint for HPC - Blog Blueprint for High Performance Computing 0
CommAutoTestGroup - Blog CommAutoTestGroup 1
Data Security Data Security 8
Dell Big Data - Blog Dell Big Data 68
Latest Blog Posts
  • General HPC

    Performance study of four Socket PowerEdge R940 Server with Intel Skylake processors

    Author:    Somanath Moharana, Dell EMC HPC Innovation Lab, August 2017

    This blog explores the performance of the four socket Dell EMC PowerEdge R940 server with Intel Skylake processors. The latest Dell EMC 14th generation servers supports the new Intel® Xeon® Processor Scalable Family (processor architecture codenamed “Skylake”), and the increased number of cores and higher memory speed benefit a wide variety of HPC applications.

    The PowerEdge R940 is Dell EMC’s latest 4-socket, 3U rack server designed to run complex workloads, which supports up to 6TB of DDR4 memory and up to 122 TB of storage. The system features the Intel® Xeon® Scalable Processor Family, 48 DDR4 DIMMs, up to 13 PCI Express® (PCIe) 3.0 enabled expansion slots and a choice of embedded NIC technologies. It is a general-purpose platform capable of handling demanding workloads and applications, such as data warehouses, ecommerce, databases, and high-performance computing (HPC). With the increase in storage capacity the PowerEdge R940 makes it well-suited for data intensive applications that require greater storage.

    This blog also describes the impact of BIOS tuning options on HPL, STREAM and scientific applications ANSYS Fluent and WRF and compares performance of the new PowerEdge R940 to the previous generation PowerEdge R930 platform. It also analyses the performance with Sub NUMA Cluster (SNC) modes (SNC=Enabled and SNC=Disabled). SNC enabled will expose eight NUMA nodes to the OS on a four socket PowerEdge R940. Each NUMA node can communicate with seven other remote NUMA nodes, six in other three sockets and one within same socket. NUMA domains on different sockets communicate over the UPI interconnect. Please visit BIOS characteristics of Skylake processor-blog for more details on BIOS options. Table 1 lists the server configuration and the application details used for this study.

     Table 1: Details of Server and HPC Applications used for R940 analysis

    Platform

    PowerEdge R930

    PowerEdge R930

    PowerEdge R940

    Processor

    4 x Intel Xeon E7-8890 v3@2.5GHz (18 cores) 45MB L3 cache 165W

     Codename=Haswell-EX

    4 x Intel Xeon E7-8890 v4@2.2GHz (24 cores) 60MB L3 cache 165W

    Codename=Broadwell-EX

    4 x Intel Xeon Platinum 8180@2.5GHz, 10.4GT/s (Cross-bar connection)

     Codename=Skylake

    Memory

    1024 GB = 64 x 16GB DDR4 @1866MHz

    1024 GB = 32 x 32GB DDR4 @1866 MHz

    384GB = (24 x 16GB) DDR4@2666MT/s

    CPU Interconnect

    Intel QuickPath Interconnect (QPI) 8GT/s

    Intel QuickPath Interconnect (QPI) 8GT/s

    Intel Ultra Path Interconnect (UPI) 10.4GT/s

    BIOS Settings

    BIOS

    Version 1.0.9

    Version 2.0.1

    Version 1.0.7

    Processor Settings > Logical Processors

    Disabled

    Disabled

    Disabled

    Processor Settings > UPI Speed

    Maximum Data Rate

    Maximum Data Rate

    Maximum Data Rate

    Processor Settings > Sub NUMA cluster

    N/A

    N/A

    Enabled, Disabled

    System Profiles

    PerfOptimized (Performance),

    PerfPerWattOptimizedDapc (DAPC),

    PerfOptimized (Performance),

    PerfPerWattOptimizedDapc (DAPC),
    PerfOptimized (Performance),

    PerfPerWattOptimizedDapc (DAPC),)

    Software and Firmware

    Operating System

    Red Hat Enterprise Linux Server release 6.6

    Red Hat Enterprise Linux Server release 7.2

    Red Hat Enterprise Linux Server release 7.3 (3.10.0-514.el7.x86_64)

    Intel Compiler

    Version 15.0.2

    Version 16.0.3

    version 17.0.4

    Intel MKL

    Version 11.2

    Version 11.3

    2017 Update 3

    Benchmark and Applications

    LINPACK

    V2.1 from MKL 11.3

    V2.1 from MKL 11.3

    V2.1 from MKL update 3

    STREAM

    v5.10, Array Size 1800000000, Iterations 100

    v5.10, Array Size 1800000000, Iterations 100

    v5.4, Array Size 1800000000, Iterations 100

    WRF

    V3.5.1 Input Data Conus12KM, Netcdf-4.3.1

    V3.8 Input Data Conus12KM, Netcdf-4.4.0

    v3.8.1, Input Data Conus12KM, Conus2.5KM, Netcdf-4.4.2

     ANSYS Fluent

    v15, Input Data: truck_poly_14m

    v16, Input Data: truck_poly_14m

    v17.2, Input Data: truck_poly_14m, aircraft_wing_14m, ice_2m, combustor_12m, exhaust_system_33m

     

     

     

     

     

    Note: The software versions were different for the older generation processors and results are compared against what was best configuration at that time. Due to big architectural changes in servers and processors generation over generation, the changes in software versions is not a significant factor.

    The High Performance Linpack (HPL) Benchmark is a measure of a system's floating point computing power. It measures how fast a computer solves a dense n by n system of linear equations Ax = b, which is a common task in engineering. HPL performed with block size of NB=384 and problem size of N=217754. Since HPL is an AVX-512-enabled workload, we would calculate HPL theoretical maximum performance as (rated base frequency of processor * number of cores * 32 FLOP/second).

          

                     Figure 1: Comparing HPL Performance across BIOS profiles                                        

    Figure 1 depicts the performance of the PowerEdge R940 server described in Table 1 with different BIOS options. Here the “Performance SNC = Disabled” gives better performance compared to other bios profiles. With “SNC=Disabled” we can observe 1-2% better performance as compared to “SNC = Enabled” for all the BIOS profiles.

    Figure 2: HPL performance with AVX2 and AVX512 instructions sets          Figure 3: HPL Performance over multiple generations of processors

    Figure 2 compares the performance of HPL ran with AVX2 instructions sets and AVX512 instructions sets on PowerEdge R940 (where AVX=Advanced Vector Instructions). AVX-512 are 512-bit extensions to the 256-bit AVX SIMD instructions for x86 instruction set architecture. By Setting “MKL_ENABLE_INSTRUCTIONS=AVX2/AVX512” environment variable we can set the AVX instructions. Here we can observe by running HPL with AVX512 instruction set gives around 75% improvement in performance compared to AVX2 instruction set.

    Figure 3: Compares the results of four socket R930 powered by Haswell-EX processors and Broadwell-EX processors with R940 powered by Skylake processors. For HPL, R940 server performed ~192% better in comparison to R930 server with four Haswell-EX processors and ~99% better with Broadwell-EX processors. The performance improvement we observed in Skylake over Broadwell-EX is due to a 27% increase in the number of cores and 75% increase in performance for AVX 512 vector instructions.

    The Stream benchmark is a synthetic benchmark program that measures sustainable memory bandwidth and the corresponding computation rate for simple vector kernels. The STREAM benchmark calculates the memory bandwidth by counting only the bytes that the user program requested to be loaded or stored. This study uses the results reported by the TRIAD function of the stream bandwidth test.

    Figure 4: STREAM Performance across BIOS profiles                                Figure 5: STREAM Performance over multiple generations of processors

    As per Figure 4, With “SNC = Enabled” we are getting up to 3% better bandwidth in comparison to “SNC = Disabled” across all bios profiles. Figure 5, shows the comparison of memory bandwidth of PowerEdge R930 server with Haswell-EX, Broadwell-EX processors and PowerEdge R940 server with Skylake processors. Both Haswell-EX and Broadwell-EX support DDR3 and DDR4 memories respectively, while the platform with this configuration supports 1600MT/s of memory frequency for both generation of processors. Due to use of DIMMs of same memory frequency for both generation of processors on PowerEdge R930, both Broadwell-EX and Haswell-EX processors have same memory bandwidth but there is ~51% increase in memory bandwidth with Skylake compared to Broadwell- EX processors. This is due to use of 2666MHz RDIMMS, which gives around ~66% increase in maximum memory bandwidth compared to Broadwell-EX and the second factor is ~50% increase in the number of memory channels per socket which is 6 channels per socket for Skylake Processors and 4 channels per socket in Broadwell-EX processors.

    Figure 6: Comparing STREAM Performance with “SNC = Enabled”                    Figure 7: Comparing STREAM Performance with “SNC = Disabled

    Figure 6 and Figure 7 describe the impact of traversing the UPI link to go across sockets on memory bandwidth for the PowerEdge R940 servers. With “SNC = Enabled” the local memory bandwidth and remote to same socket memory bandwidth is nearly same (0-1% variations) but in case of remote to other socket, the memory bandwidth shows ~57% decrease in comparison to local memory bandwidth. With “SNC = Disabled” remote memory bandwidth is 77% lower compared to local memory bandwidth.

    The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture facilitating parallel computation and system extensibility. The model serves a wide range of meteorological applications across scales from tens of meters to thousands of kilometers. WRF can generate atmospheric simulations using real data or idealized conditions. We used the CONUS12km and CONUS2.5km benchmark datasets for this study. 

    CONUS12km is a single domain and small size (48hours, 12km resolution case over the Continental U.S. (CONUS) domain from October 24, 2001) benchmark with 72 seconds of time step. CONUS2.5km is a single domain and large size (Latter 3hours of a 9hours, 2.5km resolution case over the Continental U.S. (CONUS) domain from June 4, 2005) benchmark with 15 seconds of time step.

    WRF decomposes the domain into tasks or patches. Each patch can be further decomposed into tiles that are processed separately, but by default there is only one tile for every run. If the single tile is too large to fit into the cache of the CPU and/or core, it slows down computation due to WRF’s memory bandwidth sensitivity. In order to reduce the size of the tile, it is possible to increase the number of tiles by defining “numtile = x” in input file or defining environment variable “WRF_NUM_TILES = x”. For both CONUS 12km and CONUS 2.5km the number of tiles are chosen based on best performance.

    Figure 8: WRF Performance across BIOS profiles (Conus12KM)              Figure 9: WRF Performance across BIOS profiles (Conus2.5KM)                                                 

    Figure 8 and Figure 9 demonstrates the comparison of WRF datasets on different BIOS profiles .With Conus 12KM data and CONUS 2.5KM “Performance SNC = Enabled” gives best performance. For both conus12km and conus2.5km, the “SNC = Enabled” performs ~1%-2% better than “SNC = Disabled”. The performance difference across different bios profile is nearly equal for Conus12Km as it uses a smaller dataset size, while in CONUS2.5km which is having larger dataset we can observe 1-2% performance variations across the system profiles as it utilizes larger number of processors more efficiently.

    Figure 10: Comparison over multiple generations of processors                     Figure 11: Comparison over multiple generations of processors

    Figure 10 and Figure 11 shows the performance comparison between PowerEdge R940 powered by Skylake processors and PowerEdge R930 powered by Broadwell-EX processors and Haswell-EX processors. From the Figure 11 we can observe that for Conus12KM performance of PowerEdge R940 with Skylake is ~18% better as compared to PowerEdge R930 with Broadwell EX and ~45% better compared to Haswell-EX processors. In case of for Conus2.5 Skylake performs ~29% better than Broadwell-EX and ~38% better than Haswell-EX processors.

    ANSYS Fluent is a computational fluid dynamics (CFD) software tool. Fluent includes well-validated physical modeling capabilities to deliver fast and accurate results across the widest range of CFD and multi physics applications.

    Figure 12: Ansys Fluent Performance across BIOS profiles

    We used five different datasets for our analysis which are truck_poly_14m, combuster_12m, exhaust_system_33m, ice_2m and aircraft_wing_14m. Fluent with ‘Solver Rating’ (Higher is better) as the performance metric. Figure 12 shows that all datasets performed better with “Performance SNC = Enabled” BIOS option than others. For all datasets, the “SNC = Enabled” performs 2% to 4% better than “SNC = Disabled”.

      

    Figure 13: Ansys Fluent (truck_poly_14m) performance over multiple generations of Intel processors

    Figure 13, shows the performance comparison of truck poly on PowerEdge R940 with Skylake processors and PowerEdge R930 with Broadwell-EX and Haswell-EX processors. For PowerEdge R940 fluent showed 46% better performance in-comparison to PowerEdge R930 with Broadwell-EX and 87% better performance compared to Haswell-EX processors.

     Conclusion:

    The PowerEdge R940 is a highly efficient 4 socket next generation platform which provides up to 122TB of storage capacity with 6.3TF of computing power options, making it well-suited for data intensive applications, while not sacrificing performance. The Skylake processors gives PowerEdge R940 a performance boost in comparison to its previous generation of server (PowerEdge R930), we can observe more than 45% performance improvement across all the applications.

    Considering our above analysis we can observe that if we compare system profiles “Performance” gives better performance with respect other system profiles.

    In conclusion, PowerEdge R940 with Skylake processors is good platform for all variety of applications and may fulfill the demands of more compute power for HPC applications. 

  • General HPC

    Application Performance on P100-PCIe GPUs

    Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Feb 2017

    Introduction to P100-PCIe GPU

    This blog describes the performance analysis on NVIDIA® Tesla® P100™ GPUs on a cluster of Dell PowerEdge C4130 servers. There are two types of P100 GPUs: PCIe-based and SXM2-based. In PCIe-based server, GPUs are connected by PCIe buses and one P100 delivers around 4.7 and 9.3 TeraFLOPS of double and single precision performance, respectively. And in P100-SXM2, GPUs are connected by NVLink and one P100 delivers around 5.3 and 10.6 TeraFLOPS of double and single precision performance, respectively. This blog focuses on P100 for PCIe-based servers, i.e. P100-PCIe. We have already analyzed the P100 performance for several deep learning frameworks in this blog. The objective of this blog is to compare the performance of HPL, LAMMPS, NAMD, GROMACS, HOOMD-BLUE, Amber, ANSYS Mechanical and RELION. The hardware configuration of the cluster is the same as in the deep learning blog. Briefly speaking, we used a cluster of four C4130 nodes, each node has dual Intel Xeon E5-2690 v4 CPUs and four NVIDIA P100-PCIe GPUs and all nodes are connected with EDR Infiniband. Table 1 shows the detailed information about the hardware and software used in every compute node.

     

    Table 1: Experiment Platform and Software Details

    Platform

    PowerEdge C4130 (configuration G)

    Processor

    2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)

    Memory

    256GB DDR4 @ 2400MHz

    Disk

    9TB HDD

    GPU

    P100-PCIe with 16GB GPU memory

    Nodes Interconnects

    Mellanox ConnectX-4 VPI (EDR 100Gb/s Infiniband)

    Infiniband Switch

    Mellanox SB7890

    Software and Firmware

    Operating System

    RHEL 7.2 x86_64

    Linux Kernel Version

    3.10.0-327.el7

    BIOS

    Version 2.3.3

    CUDA version and driver

    CUDA 8.0.44 (375.20)

    OpenMPI compiler

    Version 2.0.1

    GCC compiler

    4.8.5

    Intel Compiler

    Version 2017.0.098

    Applications

    HPL

    Version hpl_cuda_8_ompi165_gcc_485_pascal_v1

    LAMMPS

    Version Lammps-30Sep16

    NAMD

    Version NAMD_2.12_Source

    GROMACS

    Version 2016.1

    HOOMD-blue

    Version 2.1.2

    Amber

    Version 16update7

    ANSYS Mechanical

    Version 17.0

    RELION

    Version 2.0.3


    High Performance Linpack (HPL)

    HPL is a multicomputer parallel application to measure how fast computers solve a dense n by n system of linear equations using LU decomposition with partial row pivoting and designed to be run at very large scale. The HPL running on the experimented cluster uses the double precision floating point operations. Figure 1 shows the HPL performance on the tested P100-PCIe cluster. It can be seen that 1 P100 is 3.6x faster than 2 x E5-2690 v4 CPUs. HPL also scales very well with more GPUs within nodes or across nodes. Recall that 4 P100 is within a server and therefore 8, 12 and 16 P100 are in 2, 3 and 4 servers. 16 P100 GPUs has the speedup of 14.9x compared to 1 P100. Note that the overall efficiency is calculated as: HPL Efficiency = rMax / (CPUs rPeak + GPUs rPeak), where rPeak is the highest theoretical FLOPS result, and the number reported by HPL is rMax and is the real performance that can be achieved. HPL cannot be run at the max boost clock. It is typically run at some number in between but the average is close to the base clock than to the max boost clock. That is why the efficiency is not very high. Although we also included CPU rPeak in the efficiency calculation, when running HPL on P100 we set DGEMM_SPLIT=1.0 which means CPU is not really contributing to the DGEMM, but only handling other overhead so it is not actually contributing a lot of FLOPS. Although we observed that CPUs stayed fully utilized they were just handling the overhead and data movement to keep the GPUs fed. What is the most important for P100 is that rMax is really big.

    Figure 1: HPL performance on P100-PCIe

     
    NAMD

    NAMD (for NAnoscale Molecular Dynamics) is a molecular dynamics application designed for high-performance simulation of large biomolecular systems. The dataset we used is Satellite Tobacco Mosaic Virus (STMV) which is a small, icosahedral plant virus that worsens the symptoms of infection by Tobacco Mosaic Virus (TMV). This dataset has 1,066,628 atoms and it is the largest dataset on NAMD utilities website. The performance metric in the output log of this application is “days/ns” (the lower the better). But its inverted metric “ns/day” is used in our plot since that is what most molecular dynamics users focus on. The average of all occurrences of this value in the output log was used. Figure 2 shows the performance within 1 node. It can be seen that the performance of using 2 P100 is better than that of using 4 P100. This is probably because of the communications among different CPU threads. This application launches a set of workers threads that handle the computation and communication threads that handle the data communication. As more GPUs are used, more communication threads are used and more synchronization is needed. In addition, based on the profiling result from NVIDIA’s CUDA profiler called nvprof, with 1 P100 the GPU computation takes less than 50% of the whole application time. According to Amdahl’s law, the speedup with more GPUs will be limited by another 50% work that is not parallelized by GPU. Based on this observation, we further ran this application on multiple nodes with two different settings (2 GPUs/node and 4 GPUs/node) and the result is shown in Figure 3. The result shows that no matter how many nodes are used, the performance of 2 GPUs/node is always better than 4 GPUs/node. Within a node, 2 P100 GPUs is 9.5x faster than dual CPUs.

     Figure 2: NAMD Performance within 1 P100-PCIe node

     Figure 3: NAMD Performance across Nodes

    GROMACS

    GROMACS (for GROningen MAchine for Chemical Simulations) primarily does simulations for biochemical molecules (bonded interactions). But because of its efficiency in calculating non-bonded interactions (atoms not linked by covalent bonds), the user base is expanding to non-biological systems. Figure 4 shows the performance of GROMACS on CPU, K80 GPUs and P100-PCIe GPUs. Since one K80 has two internal GPUs, from now on when we mention one K80 it always refers to two internal GPUs instead of one of the two internal GPUs. When testing with K80 GPUs, the same P100-PCIe GPUs based servers were used. Therefore, the CPUs and memory were kept the same and the only difference is that P100-PCIe GPUs were replaced to K80 GPUs. In all tests, there were four GPUs per server and all GPUs were utilized. For example, the 3 node data point is with 3 servers and 12 total GPUs. The performance of P100-PCIe is 4.2x – 2.8x faster than CPU from 1 node to 4 nodes, and is 1.5x – 1.1x faster than K80 GPU from 1 node to 4 nodes.

      Figure 4: GROMACS Performance on P100-PCIe

    LAMMPS

    LAMMPS (for Large Scale Atomic/Molecular Massively Parallel Simulator) is a classic molecular dynamics code, capable of simulations for solid-state materials (metals, semi-conductors), soft matter (biomolecules, polymers) and coarse-grained or mesoscopic systems. It can be used to model atoms or more generically as a parallel particle simulator at the atomic, meso or continuum scale. The dataset we used was LJ (Lennard-Jones liquid benchmark) which contains 512000 atoms. There are two GPU implementations in LAMMPS: GPU library version and kokkos version. In the experiment, we used kokkos version since it was much faster than the GPU library version.

    Figure 5 shows LAMMPS performance on CPU and P100-PCIe GPUs. Using 16 P100 GPUs is 5.8x faster than using 1 P100. The reason that this application did not scale linearly is that the data transfer (CPU->GPU, GPU->CPU and GPU->GPU) time increases as more GPUs are used although the computation part reduces linearly. And the reason that the data transfer time increases is because this application requires the data communication among all GPUs used. However, the configuration G we used only allows Peer-to-Peer (P2P) access for two pairs of GPUs: GPU 1 - GPU 2 and GPU 3 - GPU 4. GPU 1/2 cannot communicate with GPU 3/4 directly. If the communication is needed, the data must go through CPU which slows the communication. The configuration B is able to ease this issue as it allows P2P access among all four GPUs within a node. The comparison between configuration G and configuration B is shown in Figure 6. By running LAMMPS on a configuration B server with 4 P100, the performance metric “timesteps/s” was improved to 510 compared to 505 in configuration G, resulting in 1% improvement. The reason why the improvement is not significant is because the data communication takes only less than 8% of the whole application time when running on configuration G with 4 P100. Figure 7 also compared the performance of P100-PCIe with that of CPU and K80 GPUs for this application. It is shown that within 1 node, 4 P100-PCIe is 6.6x faster than 2 E5-2690 v4 CPUs and 1.4x faster than 4 K80 GPUs.

     Figure 5: LAMMPS Performance on P100-PCIe

     

      

    Figure 6 : Comparison between Configuration G and Configuration B


    Figure 7: LAMMPS Performance Comparison

    HOOMD-blue

    HOOMD-blue (for Highly Optimized Object-oriented Many-particle Dynamics - blue) is a general purpose molecular dynamic simulator. Figure 8 shows the HOOMD-blue performance. Note that the y-axis is in logarithmic scale. It is observed that 1 P100 is 13.4x faster than dual CPU. The speedup of using 2 P100 is 1.5x compared to using only 1 P100. This is a reasonable speedup. However, with 4 P100 to 16 P100, the speedup is from 2.1x to 3.9x which is not high. The reason is that similar to LAMMPS, this application also involves lots of communications among all used GPUs. Based on the analysis in LAMMPS, using configuration B should reduce this communication bottleneck significantly. To verify this, we ran the same application again on a configuration B server. With 4 P100, the performance metric “hours for 10e6 steps” was reduced to 10.2 compared to 11.73 in configuration G, resulting in 13% performance improvement and the speedup compared to 1 P100 was improved to 2.4x from 2.1x.

     Figure 8: HOOMD-blue Performance on CPU and P100-PCIe

    Amber

    Amber is the collective name for a suite of programs that allow users to carry out molecular dynamics simulations, particularly on biomolecules. The term Amber is also used to refer to the empirical force fields that are implemented in this suite. Figure 9 shows the performance of Amber on CPU and P100-PCIe. It can be seen that 1 P100 is 6.3x faster than dual CPU. Using 2 P100 GPUs is 1.2x faster than using 1 P100. However, the performance drops significantly when 4 or more GPUs are used. The reason is that similar to LAMMPS and HOOMD-blue, this application heavily relies on P2P access but configuration G only supports that between 2 pair GPUs. We verified this by again testing this application on a configuration B node. As a result, the performance of using 4 P100 was improved to 791 ns/day compared to 315 ns/day in configuration G, resulting in 151% performance improvement and the speedup of 2.5x. But even in configuration B, the multi-GPU scaling is still not good enough. This is because when the Amber multi-GPU support was originally designed the PCI-E bus speed was gen 2 x 16 and the GPUs were C1060 or C2050s. However, the current Pascal generation GPUs are > 16x faster than the C1060s while the PCI-E bus speed has only increased by 2x (PCI Gen2 x 16 to PCI Gen3 x 16) and Infiniband interconnects by about the same amount. Amber website explicitly states that “It should be noted that while the legacy MPI and GPU-Direct methods of multi-GPU communication are still supported, and will be used by the code automatically if peer to peer communication is not available, you are very unlikely to see any speedup by using multiple GPUs for a single job if the GPUs are newer than C2050s. Multi-node runs are almost impossible to get to scale.” This is consistent with our results on multi-node. Because it is obvious to see that in Figure 9, the more nodes are used, the worse the performance is.

      Figure 9: Amber Performance on CPU and P100-PCIe

    ANSYS Mechanical

     ANSYS® Mechanicalsoftware is a comprehensive finite element analysis (FEA) tool for structural analysis, including linear, nonlinear dynamic, hydrodynamic and explicit studies. It provides a complete set of element behavior, material models and equation solvers for a wide range of mechanical design problems. The finite element method is used to solve the partial differential equations which is a compute and memory intensive task. Our testing focused on the Power Supply Module (V17cg-1) benchmark. This is a medium sized job for iterative solvers and a good test for memory bandwidth. Figure 10 shows the performance of ANSYS Mechanical on CPU and P100-PCIe. It is shown that within a node, 4 P100 is 3.8x faster than dual CPUs. And with 4 nodes, 16 P100 is 2.3x faster than 8 CPUs. The figure also shows that the performance scales well with more nodes. The speedup with 4 nodes is 2.8x compared to 1 node.

     Figure 10: ANSYS Mechanical Performance on CPU and P100-PCIe

    RELION

    RELION (for REgularised Likelihood OptimisationN) is a program that employs an empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM). Figure 11 shows the performance of RELION on CPU and P100-PCIe. Note that y-axis is in logarithmic scale. It demonstrates that 1 P100 is 8.8x faster than dual CPU. From the figure we also notice that it does not scale well starting from 4 P100 GPUs. Because of the long execution time, we did not perform the profiling for this application. But it is possible that the reason of the weak scaling is similar to LAMMPS, HOOMD-blue and Amber.

     Figure 11: RELION Performance on CPU and P100-PCIe


    Conclusions and Future Work


    In this blog, we presented and analyzed the performance of different applications on Dell PowerEdge C4130 servers with P100-PCIe GPUs. In all of the tested applications, HPL, GROMACS and ANSYS Mechanical benefit from the balanced CPU-GPU configuration in configuration G, because they do not require P2P access among GPUs. However, LAMMPS, HOOMD-blue, Amber (and possibly RELION) rely on P2P accesses. Therefore, with configuration G, they scale well up to 2 P100 GPUs, then scale weakly with 4 or more P100 GPUs. But with Configuration B, they scale better than G with 4 GPUs, so configuration B is more suitable and recommended for applications implemented with P2P accesses.

    In the future work, we will run these applications on P100-SXM2 and compare the performance difference between P100-PCIe and P100-SXM2.



  • General HPC

    NAMD Performance Analysis on Skylake Architecture

    Author: Joseph Stanfield

    The purpose of this blog is to provide a comparative performance analysis of the Intel® Xeon® Gold 6150 processor and the previous generation Xeon® E5-2697 v4 processors using the NAMD benchmark. The Xeon® Gold 6150 CPU features 18 physical cores or 36 logical cores when utilizing hyper threading. This processor is based on Intel’s new micro-architecture codenamed “Skylake”. Intel significantly increased the L2 cache per core from 256 KB on Broadwell to 1 MB on Skylake. The 6150 also touts 24.75 MB of L3 cache and a six channel DDR4 memory interface.

     

    Nanoscale Molecular Dynamics (NAMD) is an application developed using the Charm++ parallel programming model for molecular dynamics simulation. It is popular due to its parallel efficiency, scalability, and the ability to simulate millions of atoms.

    .
    Test Cluster Configurations:

     

    Dell EMC PowerEdge C6420

     

    Dell EMC PowerEdge C6320

    CPU

    2x Xeon® Gold 6150 18c 2.7 GHz
    (Skylake)

    2x Xeon® E5-2697 v4 16c 2.3 GHz
    (Broadwell)

    RAM

    12x 16GB @2666 MHz

    8x 16GB @2400 MHz

    HDD

    1TB SATA

    1 TB SATA

    OS

    RHEL 7.3

    RHEL 7.3

    InfiniBand

    EDR ConnectX-4

    EDR ConnectX-4

    CHARM++

    6.7.1

    NAMD

    2.12_Source

     
    BIOS Settings

    BIOS Options

    Settings

    System Profile

    Performance Optimized

    Logical Processor

    Disabled

    Virtualization Technology

    Disabled


    The benchmark dataset selected for this series of tests was the Satellite Tobacco Mosaic Virus, or STMV. STMV contains 1,066,628 atoms, which makes it ideal for demonstrating scaling to large clustered environments. The performance is measured in nanoseconds per day (ns/day), which is the number of days required to simulate 1 nanosecond of real-time. A larger value indicates faster performance.

     

    The first series of benchmark tests conducted were to measure the CPU performance. The test environment consisted of a single node, two nodes, four nodes, and eight nodes with the NAMD STMV dataset run three times for each configuration. The network interconnect between the nodes used was EDR InfiniBand as noted in the table above. Average results from a single node showed 0.70 ns/day. While for a two-node run performance increased by 80% to 1.25 ns/days. The trend of an average of 80% increase in performance for each doubling of node count remained relatively consistent as the environment was scaled to eight nodes, as seen in Figure 1.

    Figure 1.

     

    The second series of benchmarks were run to compare the Xeon® Gold 6150 against the previous generation Xeon® E5-2697v4. The same dataset, STMV was used for both benchmark environments. As you can see below in Figure 2, the Xeon® Gold CPU results surpass the Xeon E5 V4 by 111% on a single node, and the relative performance advantage decreases to 63% at eight nodes.

     


    Figure 2.

     

     

    Summary

    In this blog, we analyzed and presented the performance of a Dell EMC PowerEdge C6420 cluster scaling from a single node to eight nodes running NAMD with the STMV dataset. Results show that performance of NAMD scales linearly with the increased number of nodes.

    At the time of publishing this blog, there is an issue with the Intel Parallel Studio v, 2017.x and NAMD compilation. Intel recommends using Parallel Studio 2016.4 or 2018 (which is still in beta) with -xCORE-AVX512 under the FLOATOPS variable for best performance.

    A comparative analysis was also conducted with the previous generation Dell EMC PowerEdge C6320 server and Xeon® E5 v4 (Broadwell) processor. The Xeon® Gold outperformed the E5 V4 by 111% and maintained a linear performance increase as the cluster was scaled and the number of nodes multiplied.


    Resources

    Intel NAMD Recipe: https://software.intel.com/en-us/articles/building-namd-on-intel-xeon-and-intel-xeon-phi-processor

    Intel Fabric Tunining and Application Performance: https://www.intel.com/content/www/us/en/high-performance-computing-fabrics/omni-path-architecture-application-performance-mpi.html

  • Dell TechCenter

    Simplified Rack Scale Management

    Ed Bailey – Distinguished Eng., ESI Architecture, Dell EMC

    “What if you could manage more devices, with fewer interfaces?”

    In today’s large scale environments you are always managing more and more devices – more compute, more storage, more networking. As your infrastructure scales, management becomes increasingly complicated, time consuming, and expensive.    rack scale

    You need a new approach; one that simplifies your operations, your procurement, your service and your management. Buying integrated racks is key, but also critical is the ability to manage at a higher level – at rack scale. Rack scale management treats the entire rack as the unit of management, enabling faster scaling and more efficient resource utilization.

    The Dell EMC DSS 9000 rack scale solution is unique in that it delivers both the extreme configuration flexibility large scale infrastructures require for a wide range of workloads and the simplified management and operations that improve their bottom-line efficiency.  Part of how it does that is by making all aspects of rack scale management easy:

    •           Easy to Purchase – A consistent architecture and components streamline scale-out procurement
    •           Easy to Optimize - Highly flexible configuration options simplify multiple workload optimization
    •           Easy to Deploy - Complete racks are delivered pre-configured, pre-integrated & fully tested
    •           Easy to Manage - A single, open rack scale interface addresses the entire infrastructure
    •           Easy Scale - You can add fully configured sleds or full racks in a single step
    •           Easy to Service - Modular components, cold aisle service, proven global service network

    The rest of this blog describes in more detail how the rack scale approach can help you address the most pressing large scale infrastructure management challenges.

    Simplicity
    As part of a simplified rack scale solution, the DSS 9000 offers a single interface to manage all the compute and storage devices in a rack. This rack management interface is based on the industry-accepted open Redfish APIs. With a pre-integrated DSS 9000, you roll the rack into the data center, plug it in, and then begin provisioning the system - talking Redfish to a single point of management. Immediately you know everything that is in the rack – and immediately have access to it all. You understand what needs to be provisioned and can issue commands to each device. You can provision that rack as easily and quickly as possible.

    Another aspect of the DSS 9000 solution is its powerful management infrastructure, with a gigabit management network that is independent of the data network and a Rack Manager module that consolidates rack-wide communication and vastly simplifies cabling. Instead of needing to communicate to each individual compute or storage node – you just talk to the Rack Manager, and instead of connecting to all the nodes individually, connections are consolidated at the block level and cabling is reduced.

    Capability
    DSS 9000 rack management gives you more capability for configuration and provisioning of the rack. The DSS 9000 implements Intel ® Rack Scale Design (RSD) Pooled System Management Engine (PSME) APIs and firmware that enables the comprehensive inventory of all devices in the rack. This gives you the ability to manage with much greater efficiency using the Redfish APIs. For example, instead of performing firmware updates to each device in the rack individually, rack management allows you to perform firmware updates to all the devices in the rack at once.

    Other significant rack management capabilities include power-capping of blocks and nodes to improve energy efficiency, as well as monitoring and control of cooling fans.  Both of these capabilities can be administered at the block level to provide greater granularity of control.

    Efficiency
    Managing at the rack level also delivers efficiency by allowing you to define a full rack configuration that meets your infrastructure’s needs and employ it as the “unit of purchase” or the “unit of deployment”. This tremendously simplifies ongoing operational procurement and deployments. Improving time-to-value in this way – by reducing the time it takes to define, configure, order and deploy incremental infrastructure - provides another level of cost savings for you. At the same time, it has the added benefit of accelerating your organization’s responsiveness in terms of delivery of services and improving reliability

    The modular block infrastructure of the DSS 9000 also delivers higher efficiency. You can define workload-specific configurations for compute or storage nodes that can be available for rapid scaling of your infrastructure as demand increases. For example, some nodes in your infrastructure may be optimally configured for Hadoop workload performance with specific processors, memory capacity and storage. When the need to scale arises, an identically preconfigured node can quickly and easily be added to the rack.  Procurement and deployment are streamlined and management of the new node is consistent… and instantaneous. The newly introduced node is automatically inventoried and immediately accessible for management commands.

    Conclusion
    Simplicity, capability, efficiency – the DSS 9000 rack management delivers answers for the challenges of administrating IT infrastructure at a massive scale. Inquiries about ESI rack Scale solutions can be made at ESI@dell.com.

     

  • Dell TechCenter

    Which OpenManage solution is best for you? Ask the Dell Systems Management Advisor.

    Dell provides a vast array of Enterprise Systems Management solutions for many IT different needs and use cases.  With so many useful options available, sometimes it's not immediately obvious which Dell OpenManage solution will work best for you based on your environment and requirements whether you need to  deploy, update, monitor, or maintain systems.  

    Luckily, Dell recently created an advisor tool that will recommend which OpenManage products will work best for you based on (but not limited to) factors in your environment such as:

    • Functionality you wish to implement (Monitor / Manage / Deploy / Maintain)
    • Size and brand mix of your server environment
    • Features of your Dell servers
    • Mix of physical vs virtual hosts
    • Current Dell and 3rd party systems manage tools you utilize
    • Need for 1:1 or 1 to many tools

     

    Once you complete a short questionnaire, the advisor will suggest the OpenManage tools that best suit your needs and provide useful information and links so that you can learn more.

    Dell Systems Management Advisor

    Other than the Systems Management Advisor, Dell TechCenter provides a wealth of information if you would like to evaluate our Systems Management technologies.  Please visit our additional Dell OpenManage links:

     
     
  • vWorkspace - Blog

    What's new for vWorkspace - July 2017

    Updated monthly, this publication provides you with new and recently revised information and is organized in the following categories; Documentation, Notifications, Patches, Product Life Cycle, Release, Knowledge Base Articles.

    Subscribe to the RSS (Use IE only)

     

    Knowledgebase Articles

    New 

    None at this time

     

    Revised

    178358 - Windows 10 Support

    Microsoft Windows 10 is now supported from vWorkspace 8.6.2 and above with the exception of the Creator Update

    Revised: July 13, 2017

     

    182253 - Poor performance with using EOP USB to redirect USB drive

    When using EOP USB to redirect a USB drive or memory stick to a VDI redirection is sporadic the data transfer rate is poor when it is redirected.

    Revised: July 21, 2017

     

    Product Life Cycle -vWorkspace

    Revised: July 27, 2017

  • Dell TechCenter

    DellEMC PowerEdge 14G Servers certified for VMware ESXi

    This blog is written by Murugan Sekar & Revathi from Dell Hypervisor Engineering Team.

    DellEMC has introduced next generation (14G) of PowerEdge servers which support Intel Xeon Processor Scalable Family (Skylake-SP). This blog highlights 14G DellEMC PowerEdge server features related to VMware ESXi.

    • As part of initial release, Following servers are launched and certified with ESXi6.5 & ESXi6.0U3. Refer VMware HCL for more details.
      • R940, R740xd, R740, R640, C6420

    • The above listed serves are certified with Trusted boot (TxT) feature for ESXi6.5 & ESXi6.0U3 in both BIOS & UEFI boot mode. On these servers, DellEMC offered TPM as Plug-In Module solution which supports,
      • TPM1.2
      • TPM2.0
      • TPM2.0(China NationZ)

        Note: Trusted Platform Module (TPM) 2.0 is not supported in ESXi6.5 & ESXi6.0U3. Only TPM 1.2 is supported in the current releases of VMware ESXi.

    • VMware supports UEFI secure boot from ESXi6.5 onwards. This feature is certified on R940, R740, R740xd, R640 & C6420. Refer to the white paper  for details on UEFI Secure boot.
    • GPU Passthrough (vDGA) is a graphic acceleration function offered by VMware and currently this feature is certified for following GPUs on R740 server with ESXi6.5 & ESXi6.0U3.
      • Tesla M60
      • AMD S7150
      • AMD S7150x2

       Note: Ensure to set Memory Mapped I/O Base to 12TB under BIOS settings to power on the windows VMs with GPU Pass-through. Refer link for details.

    • 14G servers introduced new IDSDM module which combines IDSDM and/or vFLASH into single module. This module supports only microSD cards. Currently DellEMC offers 16/32/64GB IDSDM based micro SD cards which are certified for VMware ESXi. Refer this link for list of certified SD cards for servers.

    • NVDIMM-N is not supported in ESXi6.5 & ESXi6.0U3.

  • Dell TechCenter

    Dell announces PowerEdge VRTX support for VMware ESXi 6.5

    This blog post is written by Thiru Navukkarasu and Krishnaprasad K from Dell Hypervisor Engineering. 

    Dell PowerEdge VRTX was not supported for VMware ESXi 6.5 branch of ESXi thus far. Dell announced support for VRTX from Dell customized version of ESXi 6.5 A04 revision onwards. From VMware ESXi 6.5 onwards, the shared PERC8 controller in VRTX use dell_shared_perc8 native driver instead of megaraid_sas vmklinux driver in ESXi 6.0.x branch. 

    You may look at the following command outputs in ESXi to verify if you have the supported image installed on PowerEdge VRTX blades. 

    ~] vmware –lv

    VMware ESXi 6.5.0 build-5310538

    VMware ESXi 6.5.0 GA

     ~] cat /etc/vmware/oem.xml

    You are running DellEMC Customized Image ESXi 6.5 A04 (based on ESXi VMKernel Release Build 5310538)

    ~] esxcli storage core adapter list

    HBA Name  Driver             Link State  UID                   Capabilities  Description

    --------  -----------------  ----------  --------------------  ------------  ----------------------------------------------------------

    vmhba3    dell_shared_perc8  link-n/a    sas.0                               (0000:0a:00.0) LSI / Symbios Logic Shared PERC 8 Mini

    vmhba4    dell_shared_perc8  link-n/a    sas.c000016000c00                   (0000:15:00.0) LSI / Symbios Logic Shared PERC 8 Mini

     References

  • General HPC

    LAMMPS Four Node Comparative Performance Analysis on Skylake Processors

    Author: Joseph Stanfield
     

    The purpose of this blog is to provide a comparative performance analysis of the Intel® Xeon® Gold 6150 processor (architecture code named “Skylake”) and the previous generation Xeon® E5-2697 v4 processor using the LAMMPS benchmark. The Xeon® Gold 6150 CPU features 18 cores or 36 when utilizing hyper threading. Intel significantly increased the L2 cache per core from 256 KB on previous generations of Xeon to 1 MB. The new processor also touts 24.75 MB of L3 cache and a six channel DDR4 memory interface.

    LAMMPS, or Large Scale Atom/Molecular Massively Parallel Simulator, is an open-source molecular dynamics program originally developed by Sandia National Laboratories, Temple University, and the United States Department of Energy. The main function of LAMMPS is to model particles in a gaseous, liquid, or solid state.

     

    Test cluster configuration

     

    Dell EMC PowerEdge C6420

     

    Dell EMC PowerEdge C6320

    CPU

    2x Xeon® Gold 6150 18c 2.7 GHz
    (Skylake)

    2x Xeon® E5-2697 v4 16c 2.3 GHz
    (Broadwell)

    RAM

    12x 16GB @2666 MT/s

    8x 16GB @2400 MT/s

    HDD

    1TB SATA

    1 TB SATA

    OS

    RHEL 7.3

    RHEL 7.3

    InfiniBand

    EDR ConnectX-4

    EDR ConnectX-4

    BIOS Settings

    BIOS Options

    Settings

    System Profile

    Performance Optimized

    Logical Processor

    Disabled

    Virtualization Technology

    Disabled


    The LAMMPS version used for testing release was lammps-6June-17. The in.eam dataset was used for the analysis on both configurations. In.eam is a dataset that simulates a metallic solid, Cu EAM potential with 4.95 Angstrom cutoff (45 neighbors per atom), NVE integration. The simulation was executed using 100 steps with 32,000 atoms. The first series of benchmarks conducted were to measure performance in units of timesteps/s. The test environment consisted of four servers interconnected with InfiniBand EDR, and tests were run on a single node, two nodes, and four nodes with LAMMPS, three times for each configuration. Average results from a single node showed 106 time steps per second while a two node result nearly doubled performance with 216 time steps per second. This trend remained consistent as the environment was scaled to four nodes as seen in Figure 1.

     

    Figure 1.

    The second series of benchmarks were run to compare the Xeon® Gold against the previous generation, Xeon® E5 v4. The same dataset, in.eam, was used with 32,000 atoms and 100 steps per run. As you can see below in Figure 2, the Xeon® Gold CPU outperforms the Xeon® E5 v4 by about 120% with each test, but the performance increase drops slightly as the cluster is scaled.


    Figure 2.

    Conclusion

    In this blog, we analyzed and presented the performance of a Dell EMC PowerEdge C6420 cluster scaling from a single node to four nodes running the LAMMPS benchmark. Results show that performance of LAMMPS scales linearly with the increased number of nodes.

     

    A comparative analysis was also conducted with the previous generation Dell EMC PowerEdge C6320 server with a Xeon® E5 v4 (Broadwell) processor. As with the first test, as node count was increased the linear scaling of the application was observed on Xeon® E5 v4, results similar to the Xeon® Gold.. But the Xeon® Gold processor outperformed the previous generation CPU by about 120% each run.


    Resources

     

     

  • Dell TechCenter

    Dell EMC Unity 4.2 Release

    Blog author: Chuck Armstrong, Dell EMC Storage Engineering

     

    Dell EMC has just released a new Dell EMC Unity All-Flash portfolio: the 350F, 450F, 550F, and 650F.

    These new all-flash arrays are based on the latest Intel® Broadwell chip. Additionally, they are loaded up with twice the memory and up to 40 percent more processor cores than previous Dell EMC Unity models. What does all of this mean for customers? Finally, a midrange all-flash storage platform built to get the most out of virtualized and mixed workloads.

    Let’s talk about workloads:

    If you plan on deploying Microsoft® SQL Server®, Exchange Server, Hyper-V®, or VMware vSphere® with one of the new Dell EMC Unity All-Flash array models, keep reading to find a trove of information.

    Microsoft SQL Server

    There are several performance considerations that need to be understood when implementing Microsoft SQL Server on Dell EMC Unity All-Flash arrays to provide a highly efficient environment for the users. These considerations fall to the categories of database types, operating system settings and configuration, and the storage design (layout for the database). All of this, and more is found in the Dell EMC Unity Storage with Microsoft SQL Server best practices paper.

    Microsoft Exchange

    Deploying Microsoft Exchange Server in Dell EMC Unity All-Flash arrays has its own set of considerations to maximize performance. One of these considerations is the version of MS Exchange being deployed, as different versions have differing performance characteristics. Another is the design and layout of the database and log locations. These considerations and many more are to be found in the Dell EMC Unity Storage with Microsoft Exchange Server best practices paper.

     Microsoft Hyper-V

    If your environment utilizes Microsoft Hyper-V, the Dell EMC Unity Storage with Microsoft Hyper-V paper provides important best practices. Some of the many points of interest provided include guest virtual machine storage recommendations, virtual machine placement recommendations, and thin provisioning best practices.

     VMware vSphere

    For those environments where VMware vSphere is the hypervisor, and deploying a new Dell EMC Unity All-Flash arrays is on the horizon, the Dell EMC Unity Storage with VMware vSphere best practices paper provides vital information to get the job done. Some items of interest found in this paper are: getting the most out of multipathing, configuring datastores (Fibre Channel, iSCSI, NFS, and VVol), and determining where to thin provision: within vSphere, within the storage, or both.

      

     


    All of the best practices papers mentioned also provide information about several of the features available on the Dell EMC Unity All-Flash arrays. Additional information on features, and for a more general best practices guide, please check out the Dell EMC Unity Best Practices Guide.