Our community is talking about the new Dell Technologies. Join the discussion in the Dell EMC Community Network:
13G/14G server performance comparisons with Dell EMC Isilon and Lustre Storage
Variant calling is a process by which we identify variants from sequence data. This process helps determine if there is single nucleotide polymorphisms (SNPs), insertions and deletions (indels), and/or structural variants (SVs) at a given position in an individual genome or transcriptome. The main goal of identifying genomic variations is linking to human diseases. Although not all human diseases are associated with genetic variations, variant calling can provide a valuable guideline for geneticists working on a particular disease caused by genetic variations. BWA-GATK is one of the Next Generation Sequencing (NGS) computational tools that is designed to identify germline and somatic mutations from human NGS data. There are a handful of variant identification tools, and we understand that there is not a single tool performs perfectly (Pabinger et al., 2014). However, we chose GATK which is one of most popular tools as our benchmarking tool to demonstrate how well Dell EMC Ready Bundle for HPC Life Sciences can process complex and massive NGS workloads.
The purpose of this blog is to provide valuable performance information on the Intel® Xeon® Gold 6148 processor and the previous generation Xeon® E5-2697 v4 processor using a BWA-GATK pipeline benchmark on Dell EMC Isilon F800/H600 and Dell EMC Lustre Storage. The Xeon® Gold 6148 CPU features 20 physical cores or 40 logical cores when utilizing hyper threading. This processor is based on Intel’s new micro-architecture codenamed “Skylake”. Intel significantly increased the L2 cache per core from 256 KB on Broadwell to 1 MB on Skylake. The 6148 also touts 27.5 MB of L3 cache and a six channel DDR4 memory interface. The test cluster configurations are summarized in Table 1.
The test clusters and F800/H600 storage systems were connected via 4 x 100GbE links between two Dell Networking Z9100-ON switches. Each of the compute nodes was connected to the test cluster side Dell Networking Z9100-ON switch via single 10GbE. Four storage nodes in the Dell EMC Isilon F800/H600 were connected to the other switch via 8x 40GbE links. The configuration of the storage is listed in Table 2. For the Dell EMC Lustre Storage connection, four servers (a Metadata Server pair (MDS) and an Object Storage Server pair (OSS)) were connected to 13/14th Generation servers via Intel® Omni-Path. The detailed network topology is illustrated in Figure 1.
Figure 1 Illustration of Networking Topology: only 8 compute nodes are illustrated here for the simplicity, 64 nodes of 13G/14G servers used for the actual tests.
2x Dell EMC PowerEdge R730 as MDS
Front end network: 40GbE
Back end network: 40GbE
Back end network: IB QDR
Front end network: Intel® Omni-Path
Internal connections: 12Gbps SAS
The test data was chosen from one of Illumina’s Platinum Genomes. ERR091571 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. Although the description of the data from the linked website shows that this sample has a 30x depth of coverage, in reality it is closer to 10x coverage according to the number of reads counted.
Dell EMC PowerEdge C6320s and C6420s were configured as listed in Table 1. The tests performed here are designed to demonstrate performance at the server level, not for comparisons on individual components. At the same time, the tests were also designed to estimate the sizing information of Dell EMC Isilon F800/H600 and Dell EMC Lustre Storage. The data points in Figure 2 are calculated based on the total number of samples (X axis in the figure) that were processed concurrently. The number of genomes per day metric is obtained from total running time taken to process the total number of samples in a test. The smoothed curves are generated by using a polynomial spline with the piecewise polynomial degree of 3 generating B-spline basis matrix. The details of BWA-GATK pipeline information can be obtained from the Broad Institute web site.
Figure 2 BWA-GATK Benchmark over 13/14 generation with Dell EMC Isilon and Dell EMC Lustre Storage: the number of compute nodes used for the tests are 64x C6420s and 63x C6320s (64x C6320s for testing H600). The number of samples per node was increased to get the desired total number of samples processed concurrently. For C6320 (13G), 3 samples per node was the maximum number of samples each node can process. 64, 104, and 126 test results for 13G system (blue) were with 2 samples per node while 129, 156, 180, 189 and 192 sample test results were obtained from 3 samples per node. For C6420 (14G), the tests were performed with maximum 5 samples per node. The plot for 14G was generated by processing 1, 2, 3, 4, and 5 samples per node. The number of samples per node is limited by the amount of memory in a system. 128 GB and 192 GB of RAM were used in 13G and 14G system, respectively as shown in Table 1. C6420s show a better scaling behavior than C6320s. 13G server with Broadwell CPUs seems to be more sensitive to the number of samples loaded onto system as shown from the results of 126 vs 129 sample tests on all the storages tested in this study.
The results with Dell EMC Isilon F800 that indicate C6320 with Broadwell/128GB RAM performs roughly 50 genomes per day less when 3 samples are processed per compute node and 30 genomes per day less when 2 samples are processed in each compute node compared to C6420. It is not clear if C6320’s performance will drop again when more samples are added to each compute node; however, it is obvious that C6420 does not show this behavior when the number of samples is increased on each compute node. The results also allow estimating the maximum performance of Dell EMC Isilon F800. As the total number of genomes increases, the increment of the number of genomes per day metric is slow down. Unfortunately, we were not able to identify the exact number of C6420s that would saturate Dell EMC Isilon F800 with four nodes. However, it is safe to say that more than 64x C6420s will require additional Dell EMC Isilon F800/H600 nodes to maintain high performance with more than 320 10x whole human genome samples. Dell EMC Lustre Storage did not scale as well as Dell EMC Isilon F800/H600. However, we observed that some optimizations are necessary to make Dell EMC Lustre Storage perform better. For example, the aligning, sorting, and marking duplicates steps in the pipeline perform extremely well when the file system’s stripe size was set to 2MB while other steps perform very poorly with 2MB stripe size. This suggests that Dell EMC Lustre Storage needs to be optimized further for these heterogeneous workloads in the pipeline. Since there is not any concrete configuration for the pipeline, we will further investigate the idea of using multiple tier file systems to cover different requirements in each step for both Dell EMC Isilon and Lustre Storage.
Dell EMC PowerEdge C6320 with Dell EMC Isilon H600 performance reached the maximum around 140 concurrent 10x human whole genomes. Running three 10x samples concurrently on a single node is not ideal. This limit appears to be on the compute node side, since H600 performance is much better with C6420s running a similar number of samples.
Dell EMC PowerEdge C6420 has at least a 12% performance gain compared to the previous generation. Each C6420 compute node with 192 GB RAM can process about seven 10x whole human genomes per day. This number could be increased if the C6420 compute node is configured with more memory. In addition to the improvement on the 14G server side, four Isilon F800 nodes in a 4U chassis can support 64x C6420s and 320 10x whole human genomes concurrently.
Internal web page http://en.community.dell.com/techcenter/blueprints/blueprint_for_hpc/m/mediagallery/20442903
External web page https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3956068/
Contacts Americas Kihoon Yoon Sr. Principal Systems Dev Eng Kihoon.Yoon@dell.com +1 512 728 4191
Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula
Dell EMC HPC Innovation Lab. February 2018
The Dell EMC PowerEdge R740 is a 2-socket, 2U rack server. The system features the Intel Skylake processors, up to 24 DIMMs, and up to 3 double width or 6 single width GPUs. In our previous blog Deep Learning Inference on P40 vs P4 with SkyLake, we presented the deep learning inference performance on Dell EMC’s PowerEdge R740 server with P40 and P4 GPUs. This blog will present the performance of the deep learning training performance on single R740 with multiple V100-PCIe GPUs. The deep learning frameworks we benchmarked include Caffe2, MXNet and Horovod+TensorFlow. Horovod is a distributed framework for TensorFlow. We used Horovod because it has better scalability implementation (using MPI model) than TensorFlow, which has been explained in the article “Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow”. Table 1 shows the hardware configuration and software details we tested. To test the deep learning performance and scalability on R740 server, we used the same neural network, the same dataset and the same measurement as in our other deep learning blog series such as Scaling Deep Learning on Multiple V100 Nodes and Deep Learning on V100.
Table 1: The hardware configuration and software details
The Figure 1, Figure 2 and Figure 3 show the Resnet50 performance and speedup of multiple V100 GPUs with Caffe2, MXNet and TensorFlow, respectively. We can obtain the following conclusions based on these results:
Overall the performance of Resnet50 scales well on multiple V100 GPUs within one node. With 3 V100:
Caffe2 achieved the speedup of 2.61x and 2.65x in FP32 and FP16 mode, respectively.
MXNet achieved the speedup of 2.87x and 2.82x in FP32 and FP16 mode, respectively.
Horovod+TensorFlow achieved the speedup of 2.12x in FP32 mode. (FP16 still under development)
Figure 1: Caffe2: Performance and speedup of V100
Figure 2: MXNet: Performance and speedup of V100
Figure 3: TensorFlow: Performance and speedup of V100
In this blog, we presented the deep learning performance and scalability of popular deep learning frameworks like Caffe2, MXNet and Horovod+TensorFlow. Overall the three frameworks scale as expected on all GPUs within single R740 server.
Four technical white papers were recently published describing the recently released Dell EMC Ready Bundle for HPC Digital Manufacturing. These papers discuss the architecture of the system as well as the performance of ANSYS® Mechanical™, ANSYS® Fluent® and ANSYS® CFX®, LSTC LS-DYNA®, Simcenter STAR-CCM+™ and Dassault Systѐmes’ Simulia Abaqus on the Dell EMC Ready Bundle for HPC.
Dell EMC Ready Bundle for HPC Digital Manufacturing–ANSYS Performance
Dell EMC Ready Bundle for HPC Digital Manufacturing–LSTC LS-DYNA Performance
Dell EMC Ready Bundle for HPC Digital Manufacturing–Simcenter STAR-CCM+ Performance
Dell EMC Ready Bundle for HPC Digital Manufacturing–Dassault Systѐmes’ Simulia Abaqus Performance
Authors: Frank Han, Rengan Xu, Nishanth Dandapanthula.
HPC Innovation Lab. February 2018
Not long ago, PowerEdge R740 Server was released as part of Dell’s 14th Generation server portfolio. It is a 2U Intel SkyLake based rack mount server and provides the ideal balance between storage, I/O and application acceleration. Besides VDI and Cloud, the server is also designed for HPC workloads. Compared to the previous R730 server, one of the major changes on GPU support is that, R740 supports up to 3 double width cards, which is one more than what R730 could support. This blog will focus on the performance of a single R740 server with 1, 2 and 3 Nvidia Tesla V100-PCIe GPUs. Multiple cards scaling number for HPC applications like High-Performance Linpack (HPL), High Performance Conjugate Gradients benchmark (HPCG) and Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) will be presented.
Table1: Details of R740 configuration and software version
2 x Intel(R) Xeon(R) Gold 6150 @2.7GHz, 18c
Red Hat Enterprise Linux Server release 7.3
Nvidia Tesla V100-PCIe
Processor Settings > Logical Processors
High Performance Linpack (HPL)
Figure 1: HPL Performance and efficiency with R740
Figure 1 shows HPL performance and efficiency numbers. Performance data increases with multiple GPUs nearly linearly. Efficiency line isn’t flat, the peak 67.8% appears with the 2-card, which means configuration with 2 GPUs is the most optimized one for HPL. Number of 1 and 3 cards are about 7% lower than 2 card and they are affected by different factors:
For the 1 card case, this GPU based HPL application designs to bond CPU and GPU. While running in large scale with multiple GPU cards and nodes, it is known to make data access more efficient to bond GPU to the CPU. But testing with only 1 V100 here is a special case that only the 1st CPU bonded with the only GPU, and no workload on the 2nd CPU. Comparing with 2 cards result, the non-using part in Rpeak increased so efficiency dropped. This doesn’t mean GPU less efficient, just because HPL application designs and optimizes for large scales, and 1 GPU only situation is special for HPL.
Figure 2: HPCG Performance with R740
As shown in Figure 2, comparing with dual Xeon 6150 CPU only performance, single V100 is already 3 times faster, 2 V100 is 8 times and 3 V100 (CPU only node vs GPU node) is nearly 12 times. Just like HPL, HPCG is also designed for a large scale, so single card performance isn’t as efficient as multiple cards on HPCG. But unlike HPL, 3 card performance on HPCG is linearly scaled. It is 1.48 times higher than 2 cards’, which is very close to the theoretical 1.5 times. This is because all the HPCG workload is run on GPU and its data fits in GPU memory. This proves application like HPCG can take the advantage of having the 3rd GPU.
Figure 3: LAMMPS Performance with R740
The LAMMPS version used for this testing is 17Aug2017, which is the latest stable version at the time of testing. The testing dataset is in.intel.lj, which is the same one in all pervious GPU LAMMPS testing and it can be found here. With the same parameters set from previous testing, the initial values of space were x=4, y=z=2, the simulation executes with 512k atoms. Weak scaling obvious as timesteps/s number of 2 and 3 cards only 1.5 and 1.7 times than single card’s. The reason for this is that the workload isn’t heavy enough for 3 V100 GPUs. After adjusting all x,y,z to 8, 16M atoms generated in simulation, and then the performance scaled well with multiple cards. As shown in Figure 3, 2 and 3 cards is 1.8 and 2.4 times faster than single card, respectively. This results of LAMMPS is another example for GPU accelerated HPC applications that can benefit from having more GPUs in the system.
The R740 server with multiple Nvidia Tesla V100-PCIe GPUs demonstrates exceptional performance for applications like HPL, HPCG and LAMMPS. Besides balanced I/O, R740 has the flexibility for running HPC applications with 1, 2 or 3 GPUs. The newly added support for an additional 3rd GPU provides more compute power as well as larger total memory in GPU. Many applications work best when data fits in GPU memory and having the 3rd GPU allows fitting larger problems with R740.
PowerEdge R740 Technical Guide: http://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/PowerEdge_R740_R740xd_Technical_Guide.pdf
We often look at containers today and see the potential of accelerated application delivery, scaling and portability and wonder how we ever got by without them. This blog discusses the performance of LSTC LS-DYNA® within Singularity containers and on bare metal. This blog is the third in the series of blogs regarding container technology by the HPC Engineering team. The first blog Containers, Docker, Virtual Machines and HPC explored the use of containers. The second blog Containerizing HPC Applications with Singularity provided an introduction to Singularity and discussed the challenges of using Singularity in HPC environments. This third blog will focus on determining if there is a performance penalty while running the application (LS-DYNA) in a containerized environment.
Containerizing LS-DYNA using Singularity
An application specific container is a lightweight bundle of an application and its dependencies. The primary advantages of an application container are its portability and reproducibility. When we started assessing what type of workloads/applications could be containerized using singularity, interestingly enough, the first application that we tried to containerize using singularity was Fluent®, which presented a constraint. Since Fluent bundles MPI libraries, mpirun has to be invoked from within the container, instead of calling mpirun from outside the container. Adoption of containers is difficult in this scenario and requires an sshd wrapper. So we shifted gears, and started working on LS-DYNA which is a general-purpose finite element analysis (FEA) program, capable of simulating complex real world problems. LS-DYNA consists of a single executable file and is entirely command line driven. Therefore, all that is required to run LS-DYNA is a command line shell, the executable, an input file, and enough disk space to run the calculation. It is used to simulate a whole range of different engineering problems using its fully automated analysis capabilities. LS-DYNA is used worldwide in multiple engineering disciplines such as automobile, aerospace, construction, military, manufacturing, and bioengineering industries. It is robust and has worked extremely well over the past 20 years in numerous applications such as crashworthiness, drop testing and occupant safety.
The first step in creating the container for LS-DYNA would be to have a minimal operating system (CentOS 7.3 in this case), basic tools to run the application and support for InfiniBand within the container. With this, the container gets a runtime environment, system tools, and libraries. The next step would be to install the application binaries within the container. The definition file used to create the LS-DYNA container is given below. In this file, the bootstrap references the kind of base to be used and a number of options are available. “shub” pulls images hosted on Singularity Hub, “docker” pulls images from Docker Hub and here yum is used to install Centos-7.
# basic-tools and dev-tools
yum -y install evince wget vim zip unzip gzip tar perl
yum -y groupinstall "Development tools" --nogpgcheck
# InfiniBand drivers
yum -y --setopt=group_package_types=optional,default,mandatory groupinstall "InfiniBand Support"
yum -y install rdma yum -y install libibverbs-devel libsysfs-devel
yum -y install infinipath-psm
# Platform MPI
mkdir -p /home/inside/platform_mpi
chmod -R 777 /home/inside/platform_mpi
chmod 777 platform_mpi-09.1.0.1isv.x64.bin
./platform_mpi-09.1.0.1isv.x64.bin -installdir=/home/inside/platform_mpi -silent
mkdir -p /home/inside/lsdyna/code
chmod -R 777 /home/inside/lsdyna/code
cd /home/inside/lsdyna/code/usr/bin/wget http://192.168.41.41/ls-dyna_mpp_s_r9_1_113698_x64_redhat54_ifort131_avx2_platformmpi.tar.gz
tar -xvf ls-dyna_mpp_s_r9_1_113698_x64_redhat54_ifort131_avx2_platformmpi.tar.gz
export PATH LSTC_LICENSE LSTC_MEMORY
It is quick and easy to build a ready to use singularity image using the above definition file. In addition to the environment variables specified in the definition file, the value of any other variable such as LSTC_LICENSE_SERVER, can be set inside the container, with the use of the prefix “SINGULARITYENV_”. Singularity adopts a hybrid model for MPI support and the ‘mpirun’ is called, from outside the container, as follows:
/home/nirmala/platform_mpi/bin/mpirun -env_inherit -np 80 -hostfile hostfile singularity exec completelsdyna.simg /home/inside/lsdyna/code/ls-dyna_mpp_s_r9_1_113698_x64_redhat54_ifort131_avx2_platformmpi i=Caravan-V03c-2400k-main-shell16-120ms.k memory=600m memory2=60m endtime=0.02
LS_DYNA car2car Benchmark Test
The car2car benchmark presented here is a simulation of a two vehicle collision. The performance results for LS-DYNA are presented by using the Elapsed Time metric. This metric is the total elapsed time for the simulation in seconds as reported by LS-DYNA. A lower value represents better performance. The performance tests were conducted on two clusters in the HPC Innovation Lab, one with Intel® Xeon® Scalable Processor Family processors (Skylake) and another with Intel® Xeon® E5-2600 v4 processors (Broadwell). The software installed on the Skylake cluster is shown in Table 1
The configuration details of the Skylake cluster are shown below in Table 2.
Figure1 shows that there is no perceptible performance loss while running LS-DYNA inside a containerized environment, both on a single node and across the four node cluster. The results for the bare-metal and container tests are within 2% of each other, which is within the expected run-to-run variability of the application itself.
Figure 1 Performance inside singularity containers relative to bare metal on Skylake Cluster.
The software installed on the Broadwell cluster is shown in Table 3
The configuration details of the Broadwell cluster are shown below in Table 4 .
Figure 2 shows that there is no significant performance penalty while running LS-DYNA car2car inside Singularity containers on Broadwell cluster. The performance delta is within 2% only.
Figure 2 Performance inside singularity containers relative to bare metal on Broadwell Cluster
Conclusion and Future Work
In this blog, we discussed how the performance of LS-DYNA within singularity containers is almost at par with running the application on bare metal servers. The performance difference while running LS-DYNA within Singularity containers remains within 2%, which is within the run-to-run variability of the application itself.. The Dell EMC HPC team will focus on containerizing other applications, and this blog series will be continued. So stay tuned for the next blog in this series!
Two white papers were recently published, which presented comprehensive performance studies of both Dell EMC lsilon F800 and H600 storage systems.
“DELL EMC ISILON F800 AND H600 I/O PERFORMANCE” describes sequential and random I/O performance results for Dell EMC Isilon F800 and H600 node types. The data is intended to inform administrators on the suitability of Isilon storage clusters for HPC various workflows.
“DELL EMC ISILON F800 AND H600 WHOLE GENOME ANALYSIS PERFORMANCE” describes whole genome analysis performance results for Dell EMC Isilon F800 and H600 storage clusters (4 Isilon nodes per cluster). The data is intended to inform administrators on the suitability of Isilon storage clusters for high performance genomic analysis.
HPC system configuration can be a complex task, especially at scale, requiring a balance between user requirements, performance targets, data center power, cooling and space constraints and pricing. Furthermore, many choices among competing technologies complicates configuration decisions. The document below describes a modular approach to HPC system design at scale where sub-systems are broken down into modules that integrate well together. The design and options for each module are backed by measured results, including application performance. Data center considerations are taking into account during design phase. Enterprise class services, including deployment, management and support, are available for the whole HPC system.
This blog post is written by Amardeep Kahali and Krishnaprasad K from Dell Hypervisor Engineering team.
This is a continuation of the blog post "DellEMC factory install changes for VMware ESXi" . DellEMC 14th generation of servers shipped from factory with Dell customized VMware ESXi pre-installed has service tag set as the root user password by default.
From October 2017 onwards, DellEMC modified the default DCUI (Direct Console User Interface) screen to proactively communicate this change. A sample screenshot of the same is as below.
NOTE: Dell 13th generation of servers shipped from factory with Dell customized VMware ESXi pre-installed continue to have no password set for the root user.
This blog was originally posted by Cantwell, Thomas.
Dell has introduced the AMD Epyc server processor to Dell PowerEdge servers – What to know when installing and using operating systems on the R6415, R7415, and R7425 servers.
Dell PowerEdge servers have traditionally used Intel processors, but the new AMD Epyc processor is now an option on the server models listed above, and there are some specific nuances you should to be aware of when installing/using some operating systems.
General OSDell AMD servers will initially ship with only UEFI mode. Legacy bios mode will not be an option. This will limit OS selections to those that are UEFI-capable.WindowsTo install the Windows Server 2016 RTM using the OS install media:
1. Go to System Settings (Press F2 during the POST), disable the Virtualization Technology for the processor settings. 2. Install Windows Server 2016 with the OS media. 3. Apply latest Update Rollup available via Windows Update. Windows Server 2016 RTM requires a hotfix (Available on June 2017 Update Rollup and up) for a known AMD IOMMU issue. For Windows Server version 1709 deployment, you no longer need this hot fix.4. Go to System Settings, and enable Virtualization Technology for processor.
If you wish to use Windows live debugging and want to use USB 3.0 as your debug transport, for the R6415 and R7415 platforms, only the internal USB port on the motherboard can be used for debugging, so the case must be opened. On the R7425, no USB ports can be used for USB debugging and you must resort to serial debugging for live debug.VMWare For VMWare, be sure you update to the latest release:* Navigate to the Platform’s support page and under Drivers, change OS to ESXi 6.5 and then in the Enterprise Solutions category, download the latest image.
LinuxFor RHEL7.4, there is a bug that will prevent system boot up due to a kernel panic. The fix for the kernel panic is in this errata kernel:
To function efficiently in an HPC environment, a cluster of compute nodes must work in tandem to compile complex data and achieve desired results. The user expects each node to function at peak performance as an individual system, as well as a part of an intricate group of nodes processing data in parallel. To enable efficient cluster performance, we first need good single system performance. With that in mind, we evaluated the impact of different memory configurations on single node memory bandwidth performance using the STREAM benchmark. The servers used here support the latest Intel Skylake processor (Intel Scalable Processor Family) and are from the Dell EMC 14th generation (14G) server product line.
The Skylake processor has a built-in memory controller similar to previous generation Xeons but now supports *six* memory channels per socket. This is an increase from the four memory channels found in previous generation Xeon E5-2600 v3 and E5-2600 v4 processors. Different Dell EMC server models offer a different number of memory slots based on server density, but all servers offer at least one memory module slot on each of the six memory channels per socket.
For applications that are sensitive to memory bandwidth and require predictable performance, configuring memory for the underlying architecture is an important consideration. For optimal memory performance, all six memory channels of a CPU should be populated with memory modules (DIMMs), and populated identically. This is a called a balanced memory configuration. In a balanced configuration all DIMMs are accessed uniformly and the full complement of memory channels are available to the application. An unbalanced memory configuration will lead to lower memory performance as some channels will be unused or used unequally. Even worse, an unbalanced memory configuration can lead to unpredictable memory performance based on how the system fractures the memory space into multiple regions and how Linux maps out these memory domains.
Figure 1: Relative memory bandwidth with different number of DIMMs on one socket.
PowerEdge C6420. Platinum 8176. 32 GB 2666 MT/s DIMMs.
Figure 1 shows the drop in performance when all six memory channels of a 14G server are not populated. Using all six memory channels per socket is the best configuration, and will give the most predictable performance. This data was collected using the Intel Xeon Platinum 8176 processor. While the exact memory performance of a system depends on the CPU model, the general trends and conclusions presented here apply across all CPU models.
Focusing now on the recommended configurations that use all 12 memory channels in a two socket 14G system, there are different memory module options that allow different total system memory capacities. Memory performance will also vary depending on whether the DIMMs used are single ranked, double ranked, RDIMMS or LR-DIMMs. These variations are, however, significantly lower than any unbalanced memory configuration as shown in Figure 2.
8GB 2666 MT/s memory is single ranked and have lower memory bandwidth than the dual ranked 16GB and 32GB memory modules.
16GB and 32GB are both dual ranked and have similar memory bandwidth with 16G DIMMs demonstrating higher memory bandwidth.
128GB memory modules are also LR-DIMMS but are lower performing than the 64GB modules and their prime attraction is the additional memory capacity. Note that LR-DIMMs also have higher latency and higher power draw. Here is an older study on previous generation 13G servers that describes these characteristics in detail.
Figure 2: Relative memory bandwidth for different system capacities (12D balanced configs).
PowerEdge R740. Platinum 8180. DIMM configuration as noted, all 2666 MT/s memory.
Data for Figure 2 was collected on the Intel Xeon Platinum 8180 processor. As mentioned above, the memory subsystem performance depends on the CPU model since the memory controller is integrated with the processor, and the speed of the processor and number of cores also influence memory performance. The trends presented here will apply across the Skylake CPUs, though the actual percentage differences across configurations may vary. For example, here we see the 96 GB configuration has 7% lower memory bandwidth than the 384 GB configuration. With a different CPU model, that difference could be 9-10%.
Figure 3 shows another example of balanced configurations, this time using 12 or 24 identical DIMMs in the 2S system where one DIMM per channel is populated (1DPC with 12DIMMs) or two DIMMs per channel are populated (2DPC using 24 DIMMs). The information plotted in Figure 3 was collected across two CPU models and shows the same patterns as Figure 2. Additionally, the following observations can be made:
With two 8GB single ranked DIMMs giving two ranks on each channel, some of the memory bandwidth lost with 1DPC SR DIMMs can be recovered with the 2DPC configuration.
We also measured the impact of this memory population when 2400 MT/s memory is used, and the conclusions were identical to those for 2666 MT/s memory. For brevity, the 2400 MT/s results are not presented here.
Figure 3: Relative memory bandwidth for different system capacities (12D, 24D balanced configs).
PowerEdge R640. Processor and DIMM configuration as noted. All 2666 MT/s memory.
In previous generation systems, the processor supported four memory channels per socket. This led to balanced configurations with eight or sixteen memory modules per dual socket server. Configurations of 8x16GB (128 GB), 16x16 GB or 8x32GB (256 GB), 16x32 GB (512 GB) were popular and recommended.
With 14G and Skylake, these absolute memory capacities will lead to unbalanced configurations as these memory capacities do not distribute evenly across 12 memory channels. A configuration of 512 GB on 14G Skylake is possible but suboptimal, as shown in Figure 4. Across CPU models (Platinum 8176 down to Bronze 3106), there is a 65% to 35% drop in memory bandwidth when using an unbalanced memory configuration when compared to a balanced memory configuration! The figure compares 512 GB to 384 GB, but the same conclusion holds for 512 GB vs 768 GB as Figure 2 has shown us that a balanced 384 GB configuration performs similarly to a balanced 768 GB configuration.
Figure 4: Impact of unbalanced memory configurations.
PowerEdge C6420. Processor and DIMM configuration as noted. All 2666 MT/s memory.
The question that arises is - Is there a reasonable configuration that would work for capacities close to 256GB without having to go all the way to a 384GB configuration, and close to 512GB without having to raise the capacity all the way to 768GB?
Dell EMC systems do allow mixing different memory modules, and this is described in more detail in the server owner manual. For example, the Dell EMC PowerEdge R640 has 24 memory slots with 12 slots per processor. Each processor’s set of 12 slots is organized across 6 channels with 2 slots per channel. In each channel, the first slot is identified by the white release tab while the second slot tab is black. Here is an extract of the memory population guidelines that permit mixing DIMM capacities.
The PowerEdge R640 supports Flexible Memory Configuration, enabling the system to be configured and run in any valid chipset architectural configuration. Dell EMC recommends the following guidelines to install memory modules:
RDIMMs and LRDIMMs must not be mixed.
Populate all the sockets with white release tabs first, followed by the black release tabs.
When mixing memory modules with different capacities, populate the sockets with memory modules with the highest capacity first. For example, if you want to mix 8 GB and 16 GB memory modules, populate 16 GB memory modules in the sockets with white release tabs and 8 GB memory modules in the sockets with black release tabs.
Memory modules of different capacities can be mixed provided other memory population rules are followed (for example, 8 GB and 16 GB memory modules can be mixed).
Mixing of more than two memory module capacities in a system is not supported.
In a dual-processor configuration, the memory configuration for each processor should be identical. For example, if you populate socket A1 for processor 1, then populate socket B1 for processor 2, and so on.
Populate six memory modules per processor (one DIMM per channel) at a time to maximize performance.
*One important caveat is that 64 GB LRDIMMs and 128 GB LRDIMMs cannot be mixed; they are different technologies and are not compatible.
So the question is, how bad are mixed memory configurations for HPC? To address this, we tested valid “near-balanced configurations” as described in Table 1, with the results displayed in Figure 5.
Table 1: Near balanced memory configurations
Figure 5: Impact of near-balanced configurations.
Figure 5 illustrates that near-balanced configurations are a reasonable alternative when the memory capacity requirements demand a compromise. All memory channels are populated, and this helps with the memory bandwidth. The 288 GB configuration uses single ranked 8GB DIMMs and we see the penalty single ranked DIMMS impose on the memory bandwidth.
Performance and Energy Efficiency of the 14th Generation Dell PowerEdge Servers – Memory bandwidth results across Intel Skylake Processor Family (Skylake) CPU models (14G servers)
13G PowerEdge Server Performance Sensitivity to Memory Configuration – Intel Xeon 2600 v3 and 2600 v4 systems (13G servers)
Unbalanced Memory Performance – Intel Xeon E5-2600 and 2600 v2 systems (12G servers)
Memory Selection Guidelines for HPC and 11G PowerEdge Servers – Intel Xeon 5500 and 5600 systems (11G servers)