Our community is talking about the new Dell Technologies. Join the discussion in the Dell EMC Community Network:
Author: Somanath Moharana, Dell EMC HPC Innovation Lab, August 2017
This blog explores the performance of the four socket Dell EMC PowerEdge R940 server with Intel Skylake processors. The latest Dell EMC 14th generation servers supports the new Intel® Xeon® Processor Scalable Family (processor architecture codenamed “Skylake”), and the increased number of cores and higher memory speed benefit a wide variety of HPC applications.
The PowerEdge R940 is Dell EMC’s latest 4-socket, 3U rack server designed to run complex workloads, which supports up to 6TB of DDR4 memory and up to 122 TB of storage. The system features the Intel® Xeon® Scalable Processor Family, 48 DDR4 DIMMs, up to 13 PCI Express® (PCIe) 3.0 enabled expansion slots and a choice of embedded NIC technologies. It is a general-purpose platform capable of handling demanding workloads and applications, such as data warehouses, ecommerce, databases, and high-performance computing (HPC). With the increase in storage capacity the PowerEdge R940 makes it well-suited for data intensive applications that require greater storage.
This blog also describes the impact of BIOS tuning options on HPL, STREAM and scientific applications ANSYS Fluent and WRF and compares performance of the new PowerEdge R940 to the previous generation PowerEdge R930 platform. It also analyses the performance with Sub NUMA Cluster (SNC) modes (SNC=Enabled and SNC=Disabled). SNC enabled will expose eight NUMA nodes to the OS on a four socket PowerEdge R940. Each NUMA node can communicate with seven other remote NUMA nodes, six in other three sockets and one within same socket. NUMA domains on different sockets communicate over the UPI interconnect. Please visit BIOS characteristics of Skylake processor-blog for more details on BIOS options. Table 1 lists the server configuration and the application details used for this study.
Table 1: Details of Server and HPC Applications used for R940 analysis
4 x Intel Xeon E7-8890 firstname.lastname@example.orgGHz (18 cores) 45MB L3 cache 165W
4 x Intel Xeon E7-8890 email@example.comGHz (24 cores) 60MB L3 cache 165W
4 x Intel Xeon Platinum firstname.lastname@example.orgGHz, 10.4GT/s (Cross-bar connection)
1024 GB = 64 x 16GB DDR4 @1866MHz
1024 GB = 32 x 32GB DDR4 @1866 MHz
384GB = (24 x 16GB) DDR4@2666MT/s
Intel QuickPath Interconnect (QPI) 8GT/s
Intel Ultra Path Interconnect (UPI) 10.4GT/s
Processor Settings > Logical Processors
Processor Settings > UPI Speed
Maximum Data Rate
Processor Settings > Sub NUMA cluster
Software and Firmware
Red Hat Enterprise Linux Server release 6.6
Red Hat Enterprise Linux Server release 7.2
Red Hat Enterprise Linux Server release 7.3 (3.10.0-514.el7.x86_64)
2017 Update 3
Benchmark and Applications
V2.1 from MKL 11.3
V2.1 from MKL update 3
v5.10, Array Size 1800000000, Iterations 100
v5.4, Array Size 1800000000, Iterations 100
V3.5.1 Input Data Conus12KM, Netcdf-4.3.1
V3.8 Input Data Conus12KM, Netcdf-4.4.0
v3.8.1, Input Data Conus12KM, Conus2.5KM, Netcdf-4.4.2
v15, Input Data: truck_poly_14m
v16, Input Data: truck_poly_14m
v17.2, Input Data: truck_poly_14m, aircraft_wing_14m, ice_2m, combustor_12m, exhaust_system_33m
Note: The software versions were different for the older generation processors and results are compared against what was best configuration at that time. Due to big architectural changes in servers and processors generation over generation, the changes in software versions is not a significant factor.
The High Performance Linpack (HPL) Benchmark is a measure of a system's floating point computing power. It measures how fast a computer solves a dense n by n system of linear equations Ax = b, which is a common task in engineering. HPL performed with block size of NB=384 and problem size of N=217754. Since HPL is an AVX-512-enabled workload, we would calculate HPL theoretical maximum performance as (rated base frequency of processor * number of cores * 32 FLOP/second).
Figure 1: Comparing HPL Performance across BIOS profiles
Figure 1 depicts the performance of the PowerEdge R940 server described in Table 1 with different BIOS options. Here the “Performance SNC = Disabled” gives better performance compared to other bios profiles. With “SNC=Disabled” we can observe 1-2% better performance as compared to “SNC = Enabled” for all the BIOS profiles.
Figure 2: HPL performance with AVX2 and AVX512 instructions sets Figure 3: HPL Performance over multiple generations of processors
Figure 2 compares the performance of HPL ran with AVX2 instructions sets and AVX512 instructions sets on PowerEdge R940 (where AVX=Advanced Vector Instructions). AVX-512 are 512-bit extensions to the 256-bit AVX SIMD instructions for x86 instruction set architecture. By Setting “MKL_ENABLE_INSTRUCTIONS=AVX2/AVX512” environment variable we can set the AVX instructions. Here we can observe by running HPL with AVX512 instruction set gives around 75% improvement in performance compared to AVX2 instruction set.
Figure 3: Compares the results of four socket R930 powered by Haswell-EX processors and Broadwell-EX processors with R940 powered by Skylake processors. For HPL, R940 server performed ~192% better in comparison to R930 server with four Haswell-EX processors and ~99% better with Broadwell-EX processors. The performance improvement we observed in Skylake over Broadwell-EX is due to a 27% increase in the number of cores and 75% increase in performance for AVX 512 vector instructions.
The Stream benchmark is a synthetic benchmark program that measures sustainable memory bandwidth and the corresponding computation rate for simple vector kernels. The STREAM benchmark calculates the memory bandwidth by counting only the bytes that the user program requested to be loaded or stored. This study uses the results reported by the TRIAD function of the stream bandwidth test.
Figure 4: STREAM Performance across BIOS profiles Figure 5: STREAM Performance over multiple generations of processors
As per Figure 4, With “SNC = Enabled” we are getting up to 3% better bandwidth in comparison to “SNC = Disabled” across all bios profiles. Figure 5, shows the comparison of memory bandwidth of PowerEdge R930 server with Haswell-EX, Broadwell-EX processors and PowerEdge R940 server with Skylake processors. Both Haswell-EX and Broadwell-EX support DDR3 and DDR4 memories respectively, while the platform with this configuration supports 1600MT/s of memory frequency for both generation of processors. Due to use of DIMMs of same memory frequency for both generation of processors on PowerEdge R930, both Broadwell-EX and Haswell-EX processors have same memory bandwidth but there is ~51% increase in memory bandwidth with Skylake compared to Broadwell- EX processors. This is due to use of 2666MHz RDIMMS, which gives around ~66% increase in maximum memory bandwidth compared to Broadwell-EX and the second factor is ~50% increase in the number of memory channels per socket which is 6 channels per socket for Skylake Processors and 4 channels per socket in Broadwell-EX processors.
Figure 6: Comparing STREAM Performance with “SNC = Enabled” Figure 7: Comparing STREAM Performance with “SNC = Disabled
Figure 6 and Figure 7 describe the impact of traversing the UPI link to go across sockets on memory bandwidth for the PowerEdge R940 servers. With “SNC = Enabled” the local memory bandwidth and remote to same socket memory bandwidth is nearly same (0-1% variations) but in case of remote to other socket, the memory bandwidth shows ~57% decrease in comparison to local memory bandwidth. With “SNC = Disabled” remote memory bandwidth is 77% lower compared to local memory bandwidth.
The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture facilitating parallel computation and system extensibility. The model serves a wide range of meteorological applications across scales from tens of meters to thousands of kilometers. WRF can generate atmospheric simulations using real data or idealized conditions. We used the CONUS12km and CONUS2.5km benchmark datasets for this study.
CONUS12km is a single domain and small size (48hours, 12km resolution case over the Continental U.S. (CONUS) domain from October 24, 2001) benchmark with 72 seconds of time step. CONUS2.5km is a single domain and large size (Latter 3hours of a 9hours, 2.5km resolution case over the Continental U.S. (CONUS) domain from June 4, 2005) benchmark with 15 seconds of time step.
WRF decomposes the domain into tasks or patches. Each patch can be further decomposed into tiles that are processed separately, but by default there is only one tile for every run. If the single tile is too large to fit into the cache of the CPU and/or core, it slows down computation due to WRF’s memory bandwidth sensitivity. In order to reduce the size of the tile, it is possible to increase the number of tiles by defining “numtile = x” in input file or defining environment variable “WRF_NUM_TILES = x”. For both CONUS 12km and CONUS 2.5km the number of tiles are chosen based on best performance.
Figure 8: WRF Performance across BIOS profiles (Conus12KM) Figure 9: WRF Performance across BIOS profiles (Conus2.5KM)
Figure 8 and Figure 9 demonstrates the comparison of WRF datasets on different BIOS profiles .With Conus 12KM data and CONUS 2.5KM “Performance SNC = Enabled” gives best performance. For both conus12km and conus2.5km, the “SNC = Enabled” performs ~1%-2% better than “SNC = Disabled”. The performance difference across different bios profile is nearly equal for Conus12Km as it uses a smaller dataset size, while in CONUS2.5km which is having larger dataset we can observe 1-2% performance variations across the system profiles as it utilizes larger number of processors more efficiently.
Figure 10: Comparison over multiple generations of processors Figure 11: Comparison over multiple generations of processors
Figure 10 and Figure 11 shows the performance comparison between PowerEdge R940 powered by Skylake processors and PowerEdge R930 powered by Broadwell-EX processors and Haswell-EX processors. From the Figure 11 we can observe that for Conus12KM performance of PowerEdge R940 with Skylake is ~18% better as compared to PowerEdge R930 with Broadwell EX and ~45% better compared to Haswell-EX processors. In case of for Conus2.5 Skylake performs ~29% better than Broadwell-EX and ~38% better than Haswell-EX processors.
ANSYS Fluent is a computational fluid dynamics (CFD) software tool. Fluent includes well-validated physical modeling capabilities to deliver fast and accurate results across the widest range of CFD and multi physics applications.
Figure 12: Ansys Fluent Performance across BIOS profiles
We used five different datasets for our analysis which are truck_poly_14m, combuster_12m, exhaust_system_33m, ice_2m and aircraft_wing_14m. Fluent with ‘Solver Rating’ (Higher is better) as the performance metric. Figure 12 shows that all datasets performed better with “Performance SNC = Enabled” BIOS option than others. For all datasets, the “SNC = Enabled” performs 2% to 4% better than “SNC = Disabled”.
Figure 13: Ansys Fluent (truck_poly_14m) performance over multiple generations of Intel processors
Figure 13, shows the performance comparison of truck poly on PowerEdge R940 with Skylake processors and PowerEdge R930 with Broadwell-EX and Haswell-EX processors. For PowerEdge R940 fluent showed 46% better performance in-comparison to PowerEdge R930 with Broadwell-EX and 87% better performance compared to Haswell-EX processors.
The PowerEdge R940 is a highly efficient 4 socket next generation platform which provides up to 122TB of storage capacity with 6.3TF of computing power options, making it well-suited for data intensive applications, while not sacrificing performance. The Skylake processors gives PowerEdge R940 a performance boost in comparison to its previous generation of server (PowerEdge R930), we can observe more than 45% performance improvement across all the applications.
Considering our above analysis we can observe that if we compare system profiles “Performance” gives better performance with respect other system profiles.
In conclusion, PowerEdge R940 with Skylake processors is good platform for all variety of applications and may fulfill the demands of more compute power for HPC applications.
Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Feb 2017
Introduction to P100-PCIe GPU
This blog describes the performance analysis on NVIDIA® Tesla® P100™ GPUs on a cluster of Dell PowerEdge C4130 servers. There are two types of P100 GPUs: PCIe-based and SXM2-based. In PCIe-based server, GPUs are connected by PCIe buses and one P100 delivers around 4.7 and 9.3 TeraFLOPS of double and single precision performance, respectively. And in P100-SXM2, GPUs are connected by NVLink and one P100 delivers around 5.3 and 10.6 TeraFLOPS of double and single precision performance, respectively. This blog focuses on P100 for PCIe-based servers, i.e. P100-PCIe. We have already analyzed the P100 performance for several deep learning frameworks in this blog. The objective of this blog is to compare the performance of HPL, LAMMPS, NAMD, GROMACS, HOOMD-BLUE, Amber, ANSYS Mechanical and RELION. The hardware configuration of the cluster is the same as in the deep learning blog. Briefly speaking, we used a cluster of four C4130 nodes, each node has dual Intel Xeon E5-2690 v4 CPUs and four NVIDIA P100-PCIe GPUs and all nodes are connected with EDR Infiniband. Table 1 shows the detailed information about the hardware and software used in every compute node.
Table 1: Experiment Platform and Software Details
PowerEdge C4130 (configuration G)
2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)
256GB DDR4 @ 2400MHz
P100-PCIe with 16GB GPU memory
Mellanox ConnectX-4 VPI (EDR 100Gb/s Infiniband)
RHEL 7.2 x86_64
Linux Kernel Version
CUDA version and driver
CUDA 8.0.44 (375.20)
High Performance Linpack (HPL)
HPL is a multicomputer parallel application to measure how fast computers solve a dense n by n system of linear equations using LU decomposition with partial row pivoting and designed to be run at very large scale. The HPL running on the experimented cluster uses the double precision floating point operations. Figure 1 shows the HPL performance on the tested P100-PCIe cluster. It can be seen that 1 P100 is 3.6x faster than 2 x E5-2690 v4 CPUs. HPL also scales very well with more GPUs within nodes or across nodes. Recall that 4 P100 is within a server and therefore 8, 12 and 16 P100 are in 2, 3 and 4 servers. 16 P100 GPUs has the speedup of 14.9x compared to 1 P100. Note that the overall efficiency is calculated as: HPL Efficiency = rMax / (CPUs rPeak + GPUs rPeak), where rPeak is the highest theoretical FLOPS result, and the number reported by HPL is rMax and is the real performance that can be achieved. HPL cannot be run at the max boost clock. It is typically run at some number in between but the average is close to the base clock than to the max boost clock. That is why the efficiency is not very high. Although we also included CPU rPeak in the efficiency calculation, when running HPL on P100 we set DGEMM_SPLIT=1.0 which means CPU is not really contributing to the DGEMM, but only handling other overhead so it is not actually contributing a lot of FLOPS. Although we observed that CPUs stayed fully utilized they were just handling the overhead and data movement to keep the GPUs fed. What is the most important for P100 is that rMax is really big.
Figure 1: HPL performance on P100-PCIe
NAMD (for NAnoscale Molecular Dynamics) is a molecular dynamics application designed for high-performance simulation of large biomolecular systems. The dataset we used is Satellite Tobacco Mosaic Virus (STMV) which is a small, icosahedral plant virus that worsens the symptoms of infection by Tobacco Mosaic Virus (TMV). This dataset has 1,066,628 atoms and it is the largest dataset on NAMD utilities website. The performance metric in the output log of this application is “days/ns” (the lower the better). But its inverted metric “ns/day” is used in our plot since that is what most molecular dynamics users focus on. The average of all occurrences of this value in the output log was used. Figure 2 shows the performance within 1 node. It can be seen that the performance of using 2 P100 is better than that of using 4 P100. This is probably because of the communications among different CPU threads. This application launches a set of workers threads that handle the computation and communication threads that handle the data communication. As more GPUs are used, more communication threads are used and more synchronization is needed. In addition, based on the profiling result from NVIDIA’s CUDA profiler called nvprof, with 1 P100 the GPU computation takes less than 50% of the whole application time. According to Amdahl’s law, the speedup with more GPUs will be limited by another 50% work that is not parallelized by GPU. Based on this observation, we further ran this application on multiple nodes with two different settings (2 GPUs/node and 4 GPUs/node) and the result is shown in Figure 3. The result shows that no matter how many nodes are used, the performance of 2 GPUs/node is always better than 4 GPUs/node. Within a node, 2 P100 GPUs is 9.5x faster than dual CPUs.
Figure 2: NAMD Performance within 1 P100-PCIe node
Figure 3: NAMD Performance across Nodes
GROMACS (for GROningen MAchine for Chemical Simulations) primarily does simulations for biochemical molecules (bonded interactions). But because of its efficiency in calculating non-bonded interactions (atoms not linked by covalent bonds), the user base is expanding to non-biological systems. Figure 4 shows the performance of GROMACS on CPU, K80 GPUs and P100-PCIe GPUs. Since one K80 has two internal GPUs, from now on when we mention one K80 it always refers to two internal GPUs instead of one of the two internal GPUs. When testing with K80 GPUs, the same P100-PCIe GPUs based servers were used. Therefore, the CPUs and memory were kept the same and the only difference is that P100-PCIe GPUs were replaced to K80 GPUs. In all tests, there were four GPUs per server and all GPUs were utilized. For example, the 3 node data point is with 3 servers and 12 total GPUs. The performance of P100-PCIe is 4.2x – 2.8x faster than CPU from 1 node to 4 nodes, and is 1.5x – 1.1x faster than K80 GPU from 1 node to 4 nodes.
Figure 4: GROMACS Performance on P100-PCIe
LAMMPS (for Large Scale Atomic/Molecular Massively Parallel Simulator) is a classic molecular dynamics code, capable of simulations for solid-state materials (metals, semi-conductors), soft matter (biomolecules, polymers) and coarse-grained or mesoscopic systems. It can be used to model atoms or more generically as a parallel particle simulator at the atomic, meso or continuum scale. The dataset we used was LJ (Lennard-Jones liquid benchmark) which contains 512000 atoms. There are two GPU implementations in LAMMPS: GPU library version and kokkos version. In the experiment, we used kokkos version since it was much faster than the GPU library version.
Figure 5 shows LAMMPS performance on CPU and P100-PCIe GPUs. Using 16 P100 GPUs is 5.8x faster than using 1 P100. The reason that this application did not scale linearly is that the data transfer (CPU->GPU, GPU->CPU and GPU->GPU) time increases as more GPUs are used although the computation part reduces linearly. And the reason that the data transfer time increases is because this application requires the data communication among all GPUs used. However, the configuration G we used only allows Peer-to-Peer (P2P) access for two pairs of GPUs: GPU 1 - GPU 2 and GPU 3 - GPU 4. GPU 1/2 cannot communicate with GPU 3/4 directly. If the communication is needed, the data must go through CPU which slows the communication. The configuration B is able to ease this issue as it allows P2P access among all four GPUs within a node. The comparison between configuration G and configuration B is shown in Figure 6. By running LAMMPS on a configuration B server with 4 P100, the performance metric “timesteps/s” was improved to 510 compared to 505 in configuration G, resulting in 1% improvement. The reason why the improvement is not significant is because the data communication takes only less than 8% of the whole application time when running on configuration G with 4 P100. Figure 7 also compared the performance of P100-PCIe with that of CPU and K80 GPUs for this application. It is shown that within 1 node, 4 P100-PCIe is 6.6x faster than 2 E5-2690 v4 CPUs and 1.4x faster than 4 K80 GPUs.
Figure 5: LAMMPS Performance on P100-PCIe
Figure 6 : Comparison between Configuration G and Configuration B
Figure 7: LAMMPS Performance Comparison
HOOMD-blue (for Highly Optimized Object-oriented Many-particle Dynamics - blue) is a general purpose molecular dynamic simulator. Figure 8 shows the HOOMD-blue performance. Note that the y-axis is in logarithmic scale. It is observed that 1 P100 is 13.4x faster than dual CPU. The speedup of using 2 P100 is 1.5x compared to using only 1 P100. This is a reasonable speedup. However, with 4 P100 to 16 P100, the speedup is from 2.1x to 3.9x which is not high. The reason is that similar to LAMMPS, this application also involves lots of communications among all used GPUs. Based on the analysis in LAMMPS, using configuration B should reduce this communication bottleneck significantly. To verify this, we ran the same application again on a configuration B server. With 4 P100, the performance metric “hours for 10e6 steps” was reduced to 10.2 compared to 11.73 in configuration G, resulting in 13% performance improvement and the speedup compared to 1 P100 was improved to 2.4x from 2.1x.
Figure 8: HOOMD-blue Performance on CPU and P100-PCIe
Amber is the collective name for a suite of programs that allow users to carry out molecular dynamics simulations, particularly on biomolecules. The term Amber is also used to refer to the empirical force fields that are implemented in this suite. Figure 9 shows the performance of Amber on CPU and P100-PCIe. It can be seen that 1 P100 is 6.3x faster than dual CPU. Using 2 P100 GPUs is 1.2x faster than using 1 P100. However, the performance drops significantly when 4 or more GPUs are used. The reason is that similar to LAMMPS and HOOMD-blue, this application heavily relies on P2P access but configuration G only supports that between 2 pair GPUs. We verified this by again testing this application on a configuration B node. As a result, the performance of using 4 P100 was improved to 791 ns/day compared to 315 ns/day in configuration G, resulting in 151% performance improvement and the speedup of 2.5x. But even in configuration B, the multi-GPU scaling is still not good enough. This is because when the Amber multi-GPU support was originally designed the PCI-E bus speed was gen 2 x 16 and the GPUs were C1060 or C2050s. However, the current Pascal generation GPUs are > 16x faster than the C1060s while the PCI-E bus speed has only increased by 2x (PCI Gen2 x 16 to PCI Gen3 x 16) and Infiniband interconnects by about the same amount. Amber website explicitly states that “It should be noted that while the legacy MPI and GPU-Direct methods of multi-GPU communication are still supported, and will be used by the code automatically if peer to peer communication is not available, you are very unlikely to see any speedup by using multiple GPUs for a single job if the GPUs are newer than C2050s. Multi-node runs are almost impossible to get to scale.” This is consistent with our results on multi-node. Because it is obvious to see that in Figure 9, the more nodes are used, the worse the performance is.
Figure 9: Amber Performance on CPU and P100-PCIe
ANSYS® Mechanical™ software is a comprehensive finite element analysis (FEA) tool for structural analysis, including linear, nonlinear dynamic, hydrodynamic and explicit studies. It provides a complete set of element behavior, material models and equation solvers for a wide range of mechanical design problems. The finite element method is used to solve the partial differential equations which is a compute and memory intensive task. Our testing focused on the Power Supply Module (V17cg-1) benchmark. This is a medium sized job for iterative solvers and a good test for memory bandwidth. Figure 10 shows the performance of ANSYS Mechanical on CPU and P100-PCIe. It is shown that within a node, 4 P100 is 3.8x faster than dual CPUs. And with 4 nodes, 16 P100 is 2.3x faster than 8 CPUs. The figure also shows that the performance scales well with more nodes. The speedup with 4 nodes is 2.8x compared to 1 node.
Figure 10: ANSYS Mechanical Performance on CPU and P100-PCIe
RELION (for REgularised Likelihood OptimisationN) is a program that employs an empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM). Figure 11 shows the performance of RELION on CPU and P100-PCIe. Note that y-axis is in logarithmic scale. It demonstrates that 1 P100 is 8.8x faster than dual CPU. From the figure we also notice that it does not scale well starting from 4 P100 GPUs. Because of the long execution time, we did not perform the profiling for this application. But it is possible that the reason of the weak scaling is similar to LAMMPS, HOOMD-blue and Amber.
Figure 11: RELION Performance on CPU and P100-PCIe
Conclusions and Future Work
In this blog, we presented and analyzed the performance of different applications on Dell PowerEdge C4130 servers with P100-PCIe GPUs. In all of the tested applications, HPL, GROMACS and ANSYS Mechanical benefit from the balanced CPU-GPU configuration in configuration G, because they do not require P2P access among GPUs. However, LAMMPS, HOOMD-blue, Amber (and possibly RELION) rely on P2P accesses. Therefore, with configuration G, they scale well up to 2 P100 GPUs, then scale weakly with 4 or more P100 GPUs. But with Configuration B, they scale better than G with 4 GPUs, so configuration B is more suitable and recommended for applications implemented with P2P accesses.
In the future work, we will run these applications on P100-SXM2 and compare the performance difference between P100-PCIe and P100-SXM2.
Author: Joseph Stanfield
The purpose of this blog is to provide a comparative performance analysis of the Intel® Xeon® Gold 6150 processor and the previous generation Xeon® E5-2697 v4 processors using the NAMD benchmark. The Xeon® Gold 6150 CPU features 18 physical cores or 36 logical cores when utilizing hyper threading. This processor is based on Intel’s new micro-architecture codenamed “Skylake”. Intel significantly increased the L2 cache per core from 256 KB on Broadwell to 1 MB on Skylake. The 6150 also touts 24.75 MB of L3 cache and a six channel DDR4 memory interface.
Nanoscale Molecular Dynamics (NAMD) is an application developed using the Charm++ parallel programming model for molecular dynamics simulation. It is popular due to its parallel efficiency, scalability, and the ability to simulate millions of atoms.
. Test Cluster Configurations:
Dell EMC PowerEdge C6420
Dell EMC PowerEdge C6320
2x Xeon® Gold 6150 18c 2.7 GHz (Skylake)
2x Xeon® E5-2697 v4 16c 2.3 GHz (Broadwell)
12x 16GB @2666 MHz
8x 16GB @2400 MHz
1 TB SATA
The benchmark dataset selected for this series of tests was the Satellite Tobacco Mosaic Virus, or STMV. STMV contains 1,066,628 atoms, which makes it ideal for demonstrating scaling to large clustered environments. The performance is measured in nanoseconds per day (ns/day), which is the number of days required to simulate 1 nanosecond of real-time. A larger value indicates faster performance.
The first series of benchmark tests conducted were to measure the CPU performance. The test environment consisted of a single node, two nodes, four nodes, and eight nodes with the NAMD STMV dataset run three times for each configuration. The network interconnect between the nodes used was EDR InfiniBand as noted in the table above. Average results from a single node showed 0.70 ns/day. While for a two-node run performance increased by 80% to 1.25 ns/days. The trend of an average of 80% increase in performance for each doubling of node count remained relatively consistent as the environment was scaled to eight nodes, as seen in Figure 1.
The second series of benchmarks were run to compare the Xeon® Gold 6150 against the previous generation Xeon® E5-2697v4. The same dataset, STMV was used for both benchmark environments. As you can see below in Figure 2, the Xeon® Gold CPU results surpass the Xeon E5 V4 by 111% on a single node, and the relative performance advantage decreases to 63% at eight nodes.
In this blog, we analyzed and presented the performance of a Dell EMC PowerEdge C6420 cluster scaling from a single node to eight nodes running NAMD with the STMV dataset. Results show that performance of NAMD scales linearly with the increased number of nodes.
At the time of publishing this blog, there is an issue with the Intel Parallel Studio v, 2017.x and NAMD compilation. Intel recommends using Parallel Studio 2016.4 or 2018 (which is still in beta) with -xCORE-AVX512 under the FLOATOPS variable for best performance.
A comparative analysis was also conducted with the previous generation Dell EMC PowerEdge C6320 server and Xeon® E5 v4 (Broadwell) processor. The Xeon® Gold outperformed the E5 V4 by 111% and maintained a linear performance increase as the cluster was scaled and the number of nodes multiplied.
Intel NAMD Recipe: https://software.intel.com/en-us/articles/building-namd-on-intel-xeon-and-intel-xeon-phi-processor
Intel Fabric Tunining and Application Performance: https://www.intel.com/content/www/us/en/high-performance-computing-fabrics/omni-path-architecture-application-performance-mpi.html
Ed Bailey – Distinguished Eng., ESI Architecture, Dell EMC
“What if you could manage more devices, with fewer interfaces?”
In today’s large scale environments you are always managing more and more devices – more compute, more storage, more networking. As your infrastructure scales, management becomes increasingly complicated, time consuming, and expensive.
You need a new approach; one that simplifies your operations, your procurement, your service and your management. Buying integrated racks is key, but also critical is the ability to manage at a higher level – at rack scale. Rack scale management treats the entire rack as the unit of management, enabling faster scaling and more efficient resource utilization.
The Dell EMC DSS 9000 rack scale solution is unique in that it delivers both the extreme configuration flexibility large scale infrastructures require for a wide range of workloads and the simplified management and operations that improve their bottom-line efficiency. Part of how it does that is by making all aspects of rack scale management easy:
• Easy to Purchase – A consistent architecture and components streamline scale-out procurement • Easy to Optimize - Highly flexible configuration options simplify multiple workload optimization • Easy to Deploy - Complete racks are delivered pre-configured, pre-integrated & fully tested • Easy to Manage - A single, open rack scale interface addresses the entire infrastructure • Easy Scale - You can add fully configured sleds or full racks in a single step • Easy to Service - Modular components, cold aisle service, proven global service network
The rest of this blog describes in more detail how the rack scale approach can help you address the most pressing large scale infrastructure management challenges.
Simplicity As part of a simplified rack scale solution, the DSS 9000 offers a single interface to manage all the compute and storage devices in a rack. This rack management interface is based on the industry-accepted open Redfish APIs. With a pre-integrated DSS 9000, you roll the rack into the data center, plug it in, and then begin provisioning the system - talking Redfish to a single point of management. Immediately you know everything that is in the rack – and immediately have access to it all. You understand what needs to be provisioned and can issue commands to each device. You can provision that rack as easily and quickly as possible.
Another aspect of the DSS 9000 solution is its powerful management infrastructure, with a gigabit management network that is independent of the data network and a Rack Manager module that consolidates rack-wide communication and vastly simplifies cabling. Instead of needing to communicate to each individual compute or storage node – you just talk to the Rack Manager, and instead of connecting to all the nodes individually, connections are consolidated at the block level and cabling is reduced.
Capability DSS 9000 rack management gives you more capability for configuration and provisioning of the rack. The DSS 9000 implements Intel ® Rack Scale Design (RSD) Pooled System Management Engine (PSME) APIs and firmware that enables the comprehensive inventory of all devices in the rack. This gives you the ability to manage with much greater efficiency using the Redfish APIs. For example, instead of performing firmware updates to each device in the rack individually, rack management allows you to perform firmware updates to all the devices in the rack at once.
Other significant rack management capabilities include power-capping of blocks and nodes to improve energy efficiency, as well as monitoring and control of cooling fans. Both of these capabilities can be administered at the block level to provide greater granularity of control.
Efficiency Managing at the rack level also delivers efficiency by allowing you to define a full rack configuration that meets your infrastructure’s needs and employ it as the “unit of purchase” or the “unit of deployment”. This tremendously simplifies ongoing operational procurement and deployments. Improving time-to-value in this way – by reducing the time it takes to define, configure, order and deploy incremental infrastructure - provides another level of cost savings for you. At the same time, it has the added benefit of accelerating your organization’s responsiveness in terms of delivery of services and improving reliability
The modular block infrastructure of the DSS 9000 also delivers higher efficiency. You can define workload-specific configurations for compute or storage nodes that can be available for rapid scaling of your infrastructure as demand increases. For example, some nodes in your infrastructure may be optimally configured for Hadoop workload performance with specific processors, memory capacity and storage. When the need to scale arises, an identically preconfigured node can quickly and easily be added to the rack. Procurement and deployment are streamlined and management of the new node is consistent… and instantaneous. The newly introduced node is automatically inventoried and immediately accessible for management commands.
Conclusion Simplicity, capability, efficiency – the DSS 9000 rack management delivers answers for the challenges of administrating IT infrastructure at a massive scale. Inquiries about ESI rack Scale solutions can be made at ESI@dell.com.
Dell provides a vast array of Enterprise Systems Management solutions for many IT different needs and use cases. With so many useful options available, sometimes it's not immediately obvious which Dell OpenManage solution will work best for you based on your environment and requirements whether you need to deploy, update, monitor, or maintain systems.
Luckily, Dell recently created an advisor tool that will recommend which OpenManage products will work best for you based on (but not limited to) factors in your environment such as:
Once you complete a short questionnaire, the advisor will suggest the OpenManage tools that best suit your needs and provide useful information and links so that you can learn more.
Dell Systems Management Advisor
Other than the Systems Management Advisor, Dell TechCenter provides a wealth of information if you would like to evaluate our Systems Management technologies. Please visit our additional Dell OpenManage links:
This blog is written by Murugan Sekar & Revathi from Dell Hypervisor Engineering Team.
DellEMC has introduced next generation (14G) of PowerEdge servers which support Intel Xeon Processor Scalable Family (Skylake-SP). This blog highlights 14G DellEMC PowerEdge server features related to VMware ESXi.
Note: Trusted Platform Module (TPM) 2.0 is not supported in ESXi6.5 & ESXi6.0U3. Only TPM 1.2 is supported in the current releases of VMware ESXi.
Note: Ensure to set Memory Mapped I/O Base to 12TB under BIOS settings to power on the windows VMs with GPU Pass-through. Refer link for details.
Author: Joseph Stanfield
The purpose of this blog is to provide a comparative performance analysis of the Intel® Xeon® Gold 6150 processor (architecture code named “Skylake”) and the previous generation Xeon® E5-2697 v4 processor using the LAMMPS benchmark. The Xeon® Gold 6150 CPU features 18 cores or 36 when utilizing hyper threading. Intel significantly increased the L2 cache per core from 256 KB on previous generations of Xeon to 1 MB. The new processor also touts 24.75 MB of L3 cache and a six channel DDR4 memory interface.
LAMMPS, or Large Scale Atom/Molecular Massively Parallel Simulator, is an open-source molecular dynamics program originally developed by Sandia National Laboratories, Temple University, and the United States Department of Energy. The main function of LAMMPS is to model particles in a gaseous, liquid, or solid state.
Test cluster configuration
12x 16GB @2666 MT/s
8x 16GB @2400 MT/s
The LAMMPS version used for testing release was lammps-6June-17. The in.eam dataset was used for the analysis on both configurations. In.eam is a dataset that simulates a metallic solid, Cu EAM potential with 4.95 Angstrom cutoff (45 neighbors per atom), NVE integration. The simulation was executed using 100 steps with 32,000 atoms. The first series of benchmarks conducted were to measure performance in units of timesteps/s. The test environment consisted of four servers interconnected with InfiniBand EDR, and tests were run on a single node, two nodes, and four nodes with LAMMPS, three times for each configuration. Average results from a single node showed 106 time steps per second while a two node result nearly doubled performance with 216 time steps per second. This trend remained consistent as the environment was scaled to four nodes as seen in Figure 1.
The second series of benchmarks were run to compare the Xeon® Gold against the previous generation, Xeon® E5 v4. The same dataset, in.eam, was used with 32,000 atoms and 100 steps per run. As you can see below in Figure 2, the Xeon® Gold CPU outperforms the Xeon® E5 v4 by about 120% with each test, but the performance increase drops slightly as the cluster is scaled.
In this blog, we analyzed and presented the performance of a Dell EMC PowerEdge C6420 cluster scaling from a single node to four nodes running the LAMMPS benchmark. Results show that performance of LAMMPS scales linearly with the increased number of nodes.
A comparative analysis was also conducted with the previous generation Dell EMC PowerEdge C6320 server with a Xeon® E5 v4 (Broadwell) processor. As with the first test, as node count was increased the linear scaling of the application was observed on Xeon® E5 v4, results similar to the Xeon® Gold.. But the Xeon® Gold processor outperformed the previous generation CPU by about 120% each run.
Blog author: Chuck Armstrong, Dell EMC Storage Engineering
Dell EMC™ has just released a new Dell EMC Unity All-Flash portfolio: the 350F, 450F, 550F, and 650F.
These new all-flash arrays are based on the latest Intel® Broadwell chip. Additionally, they are loaded up with twice the memory and up to 40 percent more processor cores than previous Dell EMC Unity models. What does all of this mean for customers? Finally, a midrange all-flash storage platform built to get the most out of virtualized and mixed workloads.
Let’s talk about workloads:
If you plan on deploying Microsoft® SQL Server®, Exchange Server, Hyper-V®, or VMware vSphere® with one of the new Dell EMC Unity All-Flash array models, keep reading to find a trove of information.
There are several performance considerations that need to be understood when implementing Microsoft SQL Server on Dell EMC Unity All-Flash arrays to provide a highly efficient environment for the users. These considerations fall to the categories of database types, operating system settings and configuration, and the storage design (layout for the database). All of this, and more is found in the Dell EMC Unity Storage with Microsoft SQL Server best practices paper.
Deploying Microsoft Exchange Server in Dell EMC Unity All-Flash arrays has its own set of considerations to maximize performance. One of these considerations is the version of MS Exchange being deployed, as different versions have differing performance characteristics. Another is the design and layout of the database and log locations. These considerations and many more are to be found in the Dell EMC Unity Storage with Microsoft Exchange Server best practices paper.
If your environment utilizes Microsoft Hyper-V, the Dell EMC Unity Storage with Microsoft Hyper-V paper provides important best practices. Some of the many points of interest provided include guest virtual machine storage recommendations, virtual machine placement recommendations, and thin provisioning best practices.
For those environments where VMware vSphere is the hypervisor, and deploying a new Dell EMC Unity All-Flash arrays is on the horizon, the Dell EMC Unity Storage with VMware vSphere best practices paper provides vital information to get the job done. Some items of interest found in this paper are: getting the most out of multipathing, configuring datastores (Fibre Channel, iSCSI, NFS, and VVol), and determining where to thin provision: within vSphere, within the storage, or both.
All of the best practices papers mentioned also provide information about several of the features available on the Dell EMC Unity All-Flash arrays. Additional information on features, and for a more general best practices guide, please check out the Dell EMC Unity Best Practices Guide.
iDRAC6, the Dell Remote Access Controller in the 11th generation of PowerEdge Servers support the protocols TLS version 1.0, TLS version 1.1 and TLS version 1.2 (cryptographic protocols designed to provide communications security over a computer network). Starting with firmware version 2.90 for Monolithic and version 3.85 for Modular, we have added the capability of optionally disabling TLS1.0 in iDRAC6. This is to facilitate running the system in a highly secured environment due to known security vulnerabilities with TLS1.0.
TLS 1.0 with SSL 3.0 is known for exposing the system for following security vulnerabilities:
POODLE, the vulnerability which could allow hackers to intercept and decrypt the traffic between a user's browser and an SSL-secured website.
BEAST attack where an attacker can “decrypt” data exchanged between the two parties by taking advantage of a vulnerability in the implementation of the Cipher Block Chaining (CBC) mode in TLS 1.0 which allows them to perform chosen plaintext attack.
Disabling TLS1.0 provides the users an option to run the system with TLS1.1 and above, thereby isolating the system from the above mentioned vulnerabilities.
The capability to enable/disable TLS1.0 is supported only through the command line interface in iDRAC6 - RACadm. By default, TLS 1.0 is enabled.
Limitations of disabling TLS 1.0:
Certain versions of Windows OS may not support TLS1.1 and above by default. On such systems WSMan access to iDRAC6 may not work seamlessly.
More details, and the patches from Microsoft for certain OS versions to work with TLS1.1 and above:
Ashish Kumar Singh. Dell EMC HPC Innovation Lab. Aug 2017
This blog discusses the impact of the different BIOS tuning options available on Dell EMC 14th generation PowerEdge servers with the Intel Xeon® Processor Scalable Family (architecture codenamed “Skylake”) for some HPC benchmarks and applications. A brief description of the Skylake processor, BIOS options and HPC applications is provided below.
Skylake is a new 14nm “tock” processor in the Intel “tick-tock” series, which has the same process technology as the previous generation but with a new microarchitecture. Skylake requires a new CPU socket that is available with the Dell EMC 14th Generation PowerEdge servers. Skylake processors are available in two different configurations, with an integrated Omni-Path fabric and without fabric. The Omni-Path fabric supports network bandwidth up to 100Gb/s. The Skylake processor supports up to 28 cores, six DDR4 memory channels with speed up to 2666MT/s, and additional vectorization power with the AVX512 instruction set. Intel also introduces a new cache coherent interconnect named “Ultra Path Interconnect” (UPI), replacing Intel® QPI, that connects multiple CPU sockets.
Skylake offers a new, more powerful AVX512 vectorization technology that provides 512-bit vectors. The Skylake CPUs include models that support two 512-bit Fuse-Multiply-Add (FMA) units to deliver 32 Double Precision (DP) FLOPS/cycle and models with a single 512-bit FMA unit that is capable of 16 DP FLOPS/cycle. More details on AVX512 are described in the Intel programming reference. With 32 FLOPS/cycle, Skylake doubles the compute capability of the previous generation, Intel Xeon E5-2600 v4 processors (“Broadwell”).
Skylake processors are supported in the Dell EMC PowerEdge 14th Generation servers. The new processor architecture allows different tuning knobs, which are exposed in the server BIOS menu. In addition to existing options for performance and power management, the new servers also introduce a clustering mode called Sub NUMA clustering (SNC). On CPU models that support SNC, enabling SNC is akin to splitting the single socket into two NUMA domains, each with half the physical cores and half the memory of the socket. If this sounds familiar, it is similar in utility to the Cluster-on-Die option that was available in E5-2600 v3 and v4 processors as described here. SNC is implemented differently from COD, and these changes improve remote socket access in Skylake when compared to the previous generation. At the Operating System level, a dual socket server with SNC enabled will display four NUMA domains. Two of the domains will be closer to each other (on the same socket), and the other two will be a larger distance away, across the UPI to the remote socket. This can be seen using OS tools like numactl –H.
In this study, we have used the Performance and PerformancePerWattDAPC system profiles based on our earlier experiences with other system profiles for HPC workloads. The Performance Profile aims to optimize for pure performance. The DAPC profile aims to balance performance with energy efficiency concerns. Both of these system profiles are meta options that, in turn, set multiple performance and power management focused BIOS options like Turbo mode, Cstates, C1E, Pstate management, Uncore frequency, etc.
We have used two HPC benchmarks and two HPC applications to understand the behavior of SNC and System Profile BIOS options with Dell EMC PowerEdge 14th generation servers. This study was performed with a single server only; cluster level performance deltas will be bounded by these single server results. The server configuration used for this study is described below.
Table 1: Test configuration of new 14G server
Server PowerEdge C6420
Processor 2 x Intel Xeon Gold 6150 – 2.7GHz, 18c, 165W
Memory 192GB (12 x 16GB) DDR4 @2666MT/s
Hard drive 1 x 1TB SATA HDD, 7.2k rpm
Operating System Red Hat Enterprise Linux-7.3 (kernel - 3.10.0-514.el7.x86_64)
MPI Intel® MPI 2017 update4
MKL Intel® MKL 2017.0.3
Compiler Intel® compiler 17.0.4
Table 2: HPC benchmarks and applications
Application Version Benchmark
HPL From Intel® MKL Problem size - 92% of total memory
STREAM v5.04 Triad
WRF 3.8.1 conus2.5km
ANSYS Fluent v17.2 truck_poly_14m, Ice_2m
As described above, a system with SNC enabled will expose four NUMA nodes to the OS on a two socket PowerEdge server. Each NUMA node can communicate with three remote NUMA nodes, two in another socket and one within same socket. NUMA domains on different sockets communicate over the UPI interconnect. With the Intel® Xeon Gold 6150 18 cores processor, each NUMA node will have nine cores. Since both sockets are equally populated in terms of memory, each NUMA domain will have one fourth of the total system memory.
Figure 1: Memory bandwidth with SNC enabled
Figure 1 plots the memory bandwidth with SNC enabled. Except SNC and logical processors, all other options are set to BIOS defaults. Full system memory bandwidth is ~195 GB/s on the two socket server. This test uses all available 36 cores for memory access and calculates aggregate memory bandwidth. The “Local socket – 18 threads” data point measures the memory bandwidth of single socket with 18 threads. As per the graph, local socket memory bandwidth is ~101 GB/s, which is about half of the full system bandwidth. By enabling SNC, a single socket is divided into two NUMA nodes. The memory bandwidth of a single SNC enabled NUMA node is noted by “Local NUMA node – 9 threads”. In this test, the nine local cores access their local memory attached to their NUMA domain. The memory bandwidth here is ~50 GB/s, which is half of the total local socket bandwidth.
The data point “Remote to same socket” measures the memory bandwidth between two NUMA nodes, which are on the same socket with cores on one NUMA domain accessing the memory of the other NUMA domain. As per the graph, the server measures ~ 50GB/s memory bandwidth for this case; the same as the “local NUMA node – 9 threads” case. That is, with SNC enabled, memory access within the socket is similar in terms of bandwidth even across NUMA domains. This is a big difference from the previous generation where there was a penalty when accessing memory on the same socket with COD enabled. See Figure 1 in the previous blog where a 47% drop in bandwidth was observed and compare that to the 0% performance drop here. The “Remote to other socket” test involves cores on one NUMA domain accessing the memory of a remote NUMA node on the other socket. This bandwidth is 54% lower due to non-local memory access over UPI interconnect.
These memory bandwidth tests are interesting, but what do they mean? Like in previous generations, SNC is a good option for codes that have high NUMA locality. Reducing the size of the NUMA domain can help some codes run faster due to less snoops and cache coherence checks within the domain. Additionally, the penalty for remote accesses on Skylake is not as bad as it was for Broadwell.
Figure 2: Comparing Sub-NUMA clustering with DAPC
Figure 2 shows the effect of SNC on multiple HPC workloads; note that all of these have good memory locality. All options except SNC and Hyper Threading are set to BIOS default. SNC disabled is considered as the baseline for each workload. As per Figure 2, all tests measure no more than 2% higher performance with SNC enabled. Although this is well within the run-to-run variation for these applications, SNC enabled consistently shows marginally higher performance for STREAM, WRF and Fluent for these datasets. The performance delta will vary for larger and different datasets. For many HPC clusters, this level of tuning for a few percentage points might not be worth it, especially if applications with sub-optimal memory locality will be penalized.
The Dell EMC default setting for this option is “disabled”, i.e. two sockets show up as just two NUMA domains. The HPC recommendation is to leave this at disabled to accommodate multiple types of codes, including those with inefficient memory locality, and to test this on a case-by-case basis for the applications running on your cluster.
Figure 3 plots the impact of different system profiles on the tests in this study. For these studies, all BIOS options are default except system profiles and logical processors. The DAPC profile with SNC disabled is used as the baseline. Most of these workloads show similar performance on both Performance and DAPC system profile. Only HPL performance is higher by a few percent. As per our earlier studies, DAPC profile always consumes less power than performance profile, which makes it suitable for HPC workloads without compromising too much on performance.
Figure 3: Comparing System Profiles
Figure 4 shows the power consumption of different system profiles with SNC enabled and disabled. The HPL benchmark is suited to put stress on the system and utilize the maximum compute power of the system. We have measured idle power and peak power consumption with logical processor set to disabled.
Figure 4: Idle and peak power consumption
As per Figure 4, DAPC Profile with SNC disabled shows the lowest idle power consumption relative to other profiles. Both Performance and DAPC system profiles consume up to ~5% lower power in idle status with SNC disabled. In idle state, Performance Profile consumes ~28% more power than DAPC.
The peak power consumption is similar with SNC enabled and with SNC disabled. Peak power consumption in DAPC Profile is ~16% less than in Performance Profile.
Performance system profile is still the best profile to achieve maximum performance for HPC workloads. However, DAPC consumes less power than performance increase with performance profile, which makes DAPC the best suitable system profile.