Four technical white papers were recently published describing the recently released Dell EMC Ready Bundle for HPC Digital Manufacturing. These papers discuss the architecture of the system as well as the performance of ANSYS® Mechanical™, ANSYS® Fluent® and ANSYS® CFX®, LSTC LS-DYNA®, Simcenter STAR-CCM+™ and Dassault Systѐmes’ Simulia Abaqus on the Dell EMC Ready Bundle for HPC.
Dell EMC Ready Bundle for HPC Digital Manufacturing–ANSYS Performance
Dell EMC Ready Bundle for HPC Digital Manufacturing–LSTC LS-DYNA Performance
Dell EMC Ready Bundle for HPC Digital Manufacturing–Simcenter STAR-CCM+ Performance
Dell EMC Ready Bundle for HPC Digital Manufacturing–Dassault Systѐmes’ Simulia Abaqus Performance
Authors: Frank Han, Rengan Xu, Nishanth Dandapanthula.
HPC Innovation Lab. February 2018
Not long ago, PowerEdge R740 Server was released as part of Dell’s 14th Generation server portfolio. It is a 2U Intel SkyLake based rack mount server and provides the ideal balance between storage, I/O and application acceleration. Besides VDI and Cloud, the server is also designed for HPC workloads. Compared to the previous R730 server, one of the major changes on GPU support is that, R740 supports up to 3 double width cards, which is one more than what R730 could support. This blog will focus on the performance of a single R740 server with 1, 2 and 3 Nvidia Tesla V100-PCIe GPUs. Multiple cards scaling number for HPC applications like High-Performance Linpack (HPL), High Performance Conjugate Gradients benchmark (HPCG) and Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) will be presented.
Table1: Details of R740 configuration and software version
2 x Intel(R) Xeon(R) Gold 6150 @2.7GHz, 18c
Red Hat Enterprise Linux Server release 7.3
Nvidia Tesla V100-PCIe
Processor Settings > Logical Processors
High Performance Linpack (HPL)
Figure 1: HPL Performance and efficiency with R740
Figure 1 shows HPL performance and efficiency numbers. Performance data increases with multiple GPUs nearly linearly. Efficiency line isn’t flat, the peak 67.8% appears with the 2-card, which means configuration with 2 GPUs is the most optimized one for HPL. Number of 1 and 3 cards are about 7% lower than 2 card and they are affected by different factors:
For the 1 card case, this GPU based HPL application designs to bond CPU and GPU. While running in large scale with multiple GPU cards and nodes, it is known to make data access more efficient to bond GPU to the CPU. But testing with only 1 V100 here is a special case that only the 1st CPU bonded with the only GPU, and no workload on the 2nd CPU. Comparing with 2 cards result, the non-using part in Rpeak increased so efficiency dropped. This doesn’t mean GPU less efficient, just because HPL application designs and optimizes for large scales, and 1 GPU only situation is special for HPL.
Figure 2: HPCG Performance with R740
As shown in Figure 2, comparing with dual Xeon 6150 CPU only performance, single V100 is already 3 times faster, 2 V100 is 8 times and 3 V100 (CPU only node vs GPU node) is nearly 12 times. Just like HPL, HPCG is also designed for a large scale, so single card performance isn’t as efficient as multiple cards on HPCG. But unlike HPL, 3 card performance on HPCG is linearly scaled. It is 1.48 times higher than 2 cards’, which is very close to the theoretical 1.5 times. This is because all the HPCG workload is run on GPU and its data fits in GPU memory. This proves application like HPCG can take the advantage of having the 3rd GPU.
Figure 3: LAMMPS Performance with R740
The LAMMPS version used for this testing is 17Aug2017, which is the latest stable version at the time of testing. The testing dataset is in.intel.lj, which is the same one in all pervious GPU LAMMPS testing and it can be found here. With the same parameters set from previous testing, the initial values of space were x=4, y=z=2, the simulation executes with 512k atoms. Weak scaling obvious as timesteps/s number of 2 and 3 cards only 1.5 and 1.7 times than single card’s. The reason for this is that the workload isn’t heavy enough for 3 V100 GPUs. After adjusting all x,y,z to 8, 16M atoms generated in simulation, and then the performance scaled well with multiple cards. As shown in Figure 3, 2 and 3 cards is 1.8 and 2.4 times faster than single card, respectively. This results of LAMMPS is another example for GPU accelerated HPC applications that can benefit from having more GPUs in the system.
The R740 server with multiple Nvidia Tesla V100-PCIe GPUs demonstrates exceptional performance for applications like HPL, HPCG and LAMMPS. Besides balanced I/O, R740 has the flexibility for running HPC applications with 1, 2 or 3 GPUs. The newly added support for an additional 3rd GPU provides more compute power as well as larger total memory in GPU. Many applications work best when data fits in GPU memory and having the 3rd GPU allows fitting larger problems with R740.
PowerEdge R740 Technical Guide: http://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/PowerEdge_R740_R740xd_Technical_Guide.pdf
We often look at containers today and see the potential of accelerated application delivery, scaling and portability and wonder how we ever got by without them. This blog discusses the performance of LSTC LS-DYNA® within Singularity containers and on bare metal. This blog is the third in the series of blogs regarding container technology by the HPC Engineering team. The first blog Containers, Docker, Virtual Machines and HPC explored the use of containers. The second blog Containerizing HPC Applications with Singularity provided an introduction to Singularity and discussed the challenges of using Singularity in HPC environments. This third blog will focus on determining if there is a performance penalty while running the application (LS-DYNA) in a containerized environment.
Containerizing LS-DYNA using Singularity
An application specific container is a lightweight bundle of an application and its dependencies. The primary advantages of an application container are its portability and reproducibility. When we started assessing what type of workloads/applications could be containerized using singularity, interestingly enough, the first application that we tried to containerize using singularity was Fluent®, which presented a constraint. Since Fluent bundles MPI libraries, mpirun has to be invoked from within the container, instead of calling mpirun from outside the container. Adoption of containers is difficult in this scenario and requires an sshd wrapper. So we shifted gears, and started working on LS-DYNA which is a general-purpose finite element analysis (FEA) program, capable of simulating complex real world problems. LS-DYNA consists of a single executable file and is entirely command line driven. Therefore, all that is required to run LS-DYNA is a command line shell, the executable, an input file, and enough disk space to run the calculation. It is used to simulate a whole range of different engineering problems using its fully automated analysis capabilities. LS-DYNA is used worldwide in multiple engineering disciplines such as automobile, aerospace, construction, military, manufacturing, and bioengineering industries. It is robust and has worked extremely well over the past 20 years in numerous applications such as crashworthiness, drop testing and occupant safety.
The first step in creating the container for LS-DYNA would be to have a minimal operating system (CentOS 7.3 in this case), basic tools to run the application and support for InfiniBand within the container. With this, the container gets a runtime environment, system tools, and libraries. The next step would be to install the application binaries within the container. The definition file used to create the LS-DYNA container is given below. In this file, the bootstrap references the kind of base to be used and a number of options are available. “shub” pulls images hosted on Singularity Hub, “docker” pulls images from Docker Hub and here yum is used to install Centos-7.
# basic-tools and dev-tools
yum -y install evince wget vim zip unzip gzip tar perl
yum -y groupinstall "Development tools" --nogpgcheck
# InfiniBand drivers
yum -y --setopt=group_package_types=optional,default,mandatory groupinstall "InfiniBand Support"
yum -y install rdma yum -y install libibverbs-devel libsysfs-devel
yum -y install infinipath-psm
# Platform MPI
mkdir -p /home/inside/platform_mpi
chmod -R 777 /home/inside/platform_mpi
chmod 777 platform_mpi-09.1.0.1isv.x64.bin
./platform_mpi-09.1.0.1isv.x64.bin -installdir=/home/inside/platform_mpi -silent
mkdir -p /home/inside/lsdyna/code
chmod -R 777 /home/inside/lsdyna/code
cd /home/inside/lsdyna/code/usr/bin/wget http://192.168.41.41/ls-dyna_mpp_s_r9_1_113698_x64_redhat54_ifort131_avx2_platformmpi.tar.gz
tar -xvf ls-dyna_mpp_s_r9_1_113698_x64_redhat54_ifort131_avx2_platformmpi.tar.gz
export PATH LSTC_LICENSE LSTC_MEMORY
It is quick and easy to build a ready to use singularity image using the above definition file. In addition to the environment variables specified in the definition file, the value of any other variable such as LSTC_LICENSE_SERVER, can be set inside the container, with the use of the prefix “SINGULARITYENV_”. Singularity adopts a hybrid model for MPI support and the ‘mpirun’ is called, from outside the container, as follows:
/home/nirmala/platform_mpi/bin/mpirun -env_inherit -np 80 -hostfile hostfile singularity exec completelsdyna.simg /home/inside/lsdyna/code/ls-dyna_mpp_s_r9_1_113698_x64_redhat54_ifort131_avx2_platformmpi i=Caravan-V03c-2400k-main-shell16-120ms.k memory=600m memory2=60m endtime=0.02
LS_DYNA car2car Benchmark Test
The car2car benchmark presented here is a simulation of a two vehicle collision. The performance results for LS-DYNA are presented by using the Elapsed Time metric. This metric is the total elapsed time for the simulation in seconds as reported by LS-DYNA. A lower value represents better performance. The performance tests were conducted on two clusters in the HPC Innovation Lab, one with Intel® Xeon® Scalable Processor Family processors (Skylake) and another with Intel® Xeon® E5-2600 v4 processors (Broadwell). The software installed on the Skylake cluster is shown in Table 1
The configuration details of the Skylake cluster are shown below in Table 2.
Figure1 shows that there is no perceptible performance loss while running LS-DYNA inside a containerized environment, both on a single node and across the four node cluster. The results for the bare-metal and container tests are within 2% of each other, which is within the expected run-to-run variability of the application itself.
Figure 1 Performance inside singularity containers relative to bare metal on Skylake Cluster.
The software installed on the Broadwell cluster is shown in Table 3
The configuration details of the Broadwell cluster are shown below in Table 4 .
Figure 2 shows that there is no significant performance penalty while running LS-DYNA car2car inside Singularity containers on Broadwell cluster. The performance delta is within 2% only.
Figure 2 Performance inside singularity containers relative to bare metal on Broadwell Cluster
Conclusion and Future Work
In this blog, we discussed how the performance of LS-DYNA within singularity containers is almost at par with running the application on bare metal servers. The performance difference while running LS-DYNA within Singularity containers remains within 2%, which is within the run-to-run variability of the application itself.. The Dell EMC HPC team will focus on containerizing other applications, and this blog series will be continued. So stay tuned for the next blog in this series!
HPC system configuration can be a complex task, especially at scale, requiring a balance between user requirements, performance targets, data center power, cooling and space constraints and pricing. Furthermore, many choices among competing technologies complicates configuration decisions. The document below describes a modular approach to HPC system design at scale where sub-systems are broken down into modules that integrate well together. The design and options for each module are backed by measured results, including application performance. Data center considerations are taking into account during design phase. Enterprise class services, including deployment, management and support, are available for the whole HPC system.
To function efficiently in an HPC environment, a cluster of compute nodes must work in tandem to compile complex data and achieve desired results. The user expects each node to function at peak performance as an individual system, as well as a part of an intricate group of nodes processing data in parallel. To enable efficient cluster performance, we first need good single system performance. With that in mind, we evaluated the impact of different memory configurations on single node memory bandwidth performance using the STREAM benchmark. The servers used here support the latest Intel Skylake processor (Intel Scalable Processor Family) and are from the Dell EMC 14th generation (14G) server product line.
The Skylake processor has a built-in memory controller similar to previous generation Xeons but now supports *six* memory channels per socket. This is an increase from the four memory channels found in previous generation Xeon E5-2600 v3 and E5-2600 v4 processors. Different Dell EMC server models offer a different number of memory slots based on server density, but all servers offer at least one memory module slot on each of the six memory channels per socket.
For applications that are sensitive to memory bandwidth and require predictable performance, configuring memory for the underlying architecture is an important consideration. For optimal memory performance, all six memory channels of a CPU should be populated with memory modules (DIMMs), and populated identically. This is a called a balanced memory configuration. In a balanced configuration all DIMMs are accessed uniformly and the full complement of memory channels are available to the application. An unbalanced memory configuration will lead to lower memory performance as some channels will be unused or used unequally. Even worse, an unbalanced memory configuration can lead to unpredictable memory performance based on how the system fractures the memory space into multiple regions and how Linux maps out these memory domains.
Figure 1: Relative memory bandwidth with different number of DIMMs on one socket.
PowerEdge C6420. Platinum 8176. 32 GB 2666 MT/s DIMMs.
Figure 1 shows the drop in performance when all six memory channels of a 14G server are not populated. Using all six memory channels per socket is the best configuration, and will give the most predictable performance. This data was collected using the Intel Xeon Platinum 8176 processor. While the exact memory performance of a system depends on the CPU model, the general trends and conclusions presented here apply across all CPU models.
Focusing now on the recommended configurations that use all 12 memory channels in a two socket 14G system, there are different memory module options that allow different total system memory capacities. Memory performance will also vary depending on whether the DIMMs used are single ranked, double ranked, RDIMMS or LR-DIMMs. These variations are, however, significantly lower than any unbalanced memory configuration as shown in Figure 2.
8GB 2666 MT/s memory is single ranked and have lower memory bandwidth than the dual ranked 16GB and 32GB memory modules.
16GB and 32GB are both dual ranked and have similar memory bandwidth with 16G DIMMs demonstrating higher memory bandwidth.
128GB memory modules are also LR-DIMMS but are lower performing than the 64GB modules and their prime attraction is the additional memory capacity. Note that LR-DIMMs also have higher latency and higher power draw. Here is an older study on previous generation 13G servers that describes these characteristics in detail.
Figure 2: Relative memory bandwidth for different system capacities (12D balanced configs).
PowerEdge R740. Platinum 8180. DIMM configuration as noted, all 2666 MT/s memory.
Data for Figure 2 was collected on the Intel Xeon Platinum 8180 processor. As mentioned above, the memory subsystem performance depends on the CPU model since the memory controller is integrated with the processor, and the speed of the processor and number of cores also influence memory performance. The trends presented here will apply across the Skylake CPUs, though the actual percentage differences across configurations may vary. For example, here we see the 96 GB configuration has 7% lower memory bandwidth than the 384 GB configuration. With a different CPU model, that difference could be 9-10%.
Figure 3 shows another example of balanced configurations, this time using 12 or 24 identical DIMMs in the 2S system where one DIMM per channel is populated (1DPC with 12DIMMs) or two DIMMs per channel are populated (2DPC using 24 DIMMs). The information plotted in Figure 3 was collected across two CPU models and shows the same patterns as Figure 2. Additionally, the following observations can be made:
With two 8GB single ranked DIMMs giving two ranks on each channel, some of the memory bandwidth lost with 1DPC SR DIMMs can be recovered with the 2DPC configuration.
We also measured the impact of this memory population when 2400 MT/s memory is used, and the conclusions were identical to those for 2666 MT/s memory. For brevity, the 2400 MT/s results are not presented here.
Figure 3: Relative memory bandwidth for different system capacities (12D, 24D balanced configs).
PowerEdge R640. Processor and DIMM configuration as noted. All 2666 MT/s memory.
In previous generation systems, the processor supported four memory channels per socket. This led to balanced configurations with eight or sixteen memory modules per dual socket server. Configurations of 8x16GB (128 GB), 16x16 GB or 8x32GB (256 GB), 16x32 GB (512 GB) were popular and recommended.
With 14G and Skylake, these absolute memory capacities will lead to unbalanced configurations as these memory capacities do not distribute evenly across 12 memory channels. A configuration of 512 GB on 14G Skylake is possible but suboptimal, as shown in Figure 4. Across CPU models (Platinum 8176 down to Bronze 3106), there is a 65% to 35% drop in memory bandwidth when using an unbalanced memory configuration when compared to a balanced memory configuration! The figure compares 512 GB to 384 GB, but the same conclusion holds for 512 GB vs 768 GB as Figure 2 has shown us that a balanced 384 GB configuration performs similarly to a balanced 768 GB configuration.
Figure 4: Impact of unbalanced memory configurations.
PowerEdge C6420. Processor and DIMM configuration as noted. All 2666 MT/s memory.
The question that arises is - Is there a reasonable configuration that would work for capacities close to 256GB without having to go all the way to a 384GB configuration, and close to 512GB without having to raise the capacity all the way to 768GB?
Dell EMC systems do allow mixing different memory modules, and this is described in more detail in the server owner manual. For example, the Dell EMC PowerEdge R640 has 24 memory slots with 12 slots per processor. Each processor’s set of 12 slots is organized across 6 channels with 2 slots per channel. In each channel, the first slot is identified by the white release tab while the second slot tab is black. Here is an extract of the memory population guidelines that permit mixing DIMM capacities.
The PowerEdge R640 supports Flexible Memory Configuration, enabling the system to be configured and run in any valid chipset architectural configuration. Dell EMC recommends the following guidelines to install memory modules:
RDIMMs and LRDIMMs must not be mixed.
Populate all the sockets with white release tabs first, followed by the black release tabs.
When mixing memory modules with different capacities, populate the sockets with memory modules with the highest capacity first. For example, if you want to mix 8 GB and 16 GB memory modules, populate 16 GB memory modules in the sockets with white release tabs and 8 GB memory modules in the sockets with black release tabs.
Memory modules of different capacities can be mixed provided other memory population rules are followed (for example, 8 GB and 16 GB memory modules can be mixed).
Mixing of more than two memory module capacities in a system is not supported.
In a dual-processor configuration, the memory configuration for each processor should be identical. For example, if you populate socket A1 for processor 1, then populate socket B1 for processor 2, and so on.
Populate six memory modules per processor (one DIMM per channel) at a time to maximize performance.
*One important caveat is that 64 GB LRDIMMs and 128 GB LRDIMMs cannot be mixed; they are different technologies and are not compatible.
So the question is, how bad are mixed memory configurations for HPC? To address this, we tested valid “near-balanced configurations” as described in Table 1, with the results displayed in Figure 5.
Table 1: Near balanced memory configurations
Figure 5: Impact of near-balanced configurations.
Figure 5 illustrates that near-balanced configurations are a reasonable alternative when the memory capacity requirements demand a compromise. All memory channels are populated, and this helps with the memory bandwidth. The 288 GB configuration uses single ranked 8GB DIMMs and we see the penalty single ranked DIMMS impose on the memory bandwidth.
Performance and Energy Efficiency of the 14th Generation Dell PowerEdge Servers – Memory bandwidth results across Intel Skylake Processor Family (Skylake) CPU models (14G servers)
13G PowerEdge Server Performance Sensitivity to Memory Configuration – Intel Xeon 2600 v3 and 2600 v4 systems (13G servers)
Unbalanced Memory Performance – Intel Xeon E5-2600 and 2600 v2 systems (12G servers)
Memory Selection Guidelines for HPC and 11G PowerEdge Servers – Intel Xeon 5500 and 5600 systems (11G servers)
Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula.
HPC Innovation Lab. November 2017
In our previous blog, we presented the deep learning performance on single Dell PowerEdge C4130 node with four V100 GPUs. For very large neural network models, a single node is still not powerful enough to quickly train those models. Therefore, it is important to scale the training model to multiple nodes to meet the computation demand. In this blog, we will evaluate the multi-node performance of deep learning frameworks MXNet and Caffe2. The Interconnect in use is Mellanox EDR InfiniBand. The results will show that both frameworks scale well on multiple V100-SXM2 nodes.
Overview of MXNet and Caffe2
In this section, we will give an overview about how MXNet and Caffe2 are implemented for distributed training on multiple nodes. Usually there are two ways to parallelize neural network training on multiple devices: data parallelism and model parallelism. In data parallelism, all devices have the same model but different devices work on different pieces of data. While in model parallelism, difference devices have parameters of different layers of a neural network. In this blog, we only focus on the data parallelism in deep learning frameworks and will evaluate the model parallelism in the future. Another choice in most deep learning frameworks is whether to use synchronous or asynchronous weight update. The synchronous implementation aggregates the gradients over all workers in each iteration (or mini-batch) before updating the weights. However, in asynchronous implementation, each worker updates the weight independently with each other. Since the synchronous way guarantees the model convergence while the asynchronous way is still an open question, we only evaluate the synchronous weight update.
MXNet is able to launch jobs on a cluster in several ways including SSH, Yarn, MPI. For this evaluation, SSH was chosen. In SSH mode, the processes in different nodes use rsync to synchronize the working directory from root node into slave nodes. The purpose of synchronization is to aggregate the gradients over all workers in each iteration (or mini-batch). Caffe2 uses Gloo library for multi-node training and Redis library to facilitate management of nodes in distributed training. Gloo is a MPI like library that comes with a number of collective operations like barrier, broadcast and allreduce for machine learning applications. The Redis library used by Gloo is used to connect all participating nodes.
We chose to evaluate two deep learning frameworks for our testing, MXNet and Caffe2. As with our previous benchmarks, we will again use the ILSVRC 2012 dataset which contains 1,281,167 training images and 50,000 validation images. The neural network in the training is called Resnet50 which is a computationally intensive network that both frameworks support. To get the best performance, CUDA 9 compiler, CUDNN 7 library and NCCL 2.0 are used for both frameworks, since they are optimized for V100 GPUs. The testing platform has four Dell EMC’s PowerEdge C4130 servers in configuration K. The system layout of configuration K is shown in Figure 1. As we can see, the server has four V100-SXM2 GPUs and all GPUs are connected by NVLink. The other hardware and software details are shown in Table 1. Table 2 shows the input parameters that are used to train Resnet50 neural network in both frameworks.
Figure 1: C4130 configuration K
Table 1: The hardware configuration and software details
Table 2: Input parameters used in different deep learning frameworks
Figure 2 and Figure 3 show the Resnet50 performance and speedup results on multiple nodes with MXNet and Caffe2, respectively. As we can see, the performance scales very well with both frameworks. With MXNet, compared to 1*V100, the speedup of using 16*V100 (in 4 nodes) is 15.4x in FP32 mode and 13.8x in FP16, respectively. And compared to FP32, FP16 improved the performance by 63.28% - 82.79%. Such performance improvement was contributed to the Tensor Cores in V100.
In Caffe2, compared to 1*V100, the speedup of using 16*V100 (4 nodes) is 14.8x in FP32 and 13.6x in FP16, respectively. And the performance improvement of using FP16 compared to FP32 is 50.42% - 63.75% excluding the 12*V100 case. With 12*V100, using FP16 is only 29.26% faster than using FP32. We are still investigating the exact reason of it, but one possible explanation is that 12 is not the power of 2, which may make some operations like reductions slower.
Figure 2: Performance of MXNet Resnet50 on multiple nodes
Figure 3: Performance of Caffe2 Resnet50 on multiple nodes
Conclusions and Future Work
In this blog, we present the performance of MXNet and Caffe2 on multiple V100-SXM2 nodes. The results demonstrate that the deep learning frameworks are able to scale very well on multiple Dell EMC’s PowerEdge servers. At this time the FP16 support in TensorFlow is still experimental, our evaluation is in progress and the results will be included in future blogs. We are also working on containerizing these frameworks with Singularity to make their deployment much easier.
HPC Innovation Lab. October 2017
In this blog, we will give an introduction to Singularity containers and how they should be used to containerize HPC applications. We run different deep learning frameworks with and without Singularity containers and show that there is no performance loss with Singularity containers. We also show that Singularity can be easily used to run MPI applications.
Introduction to Singularity
Singularity is a container system developed by Lawrence Berkeley Lab to provide container technology like Docker for High Performance Computing (HPC). It wraps applications into an isolated virtual environment to simplify application deployment. Unlike virtual machines, the container does not have a virtual hardware layer and its own Linux kernel inside the host OS. It is just sandboxing the environment; therefore, the overhead and the performance loss are minimal. The goal of the container is reproducibility. The container has all environment and libraries an application needs to run, and it can be deployed anywhere so that anyone can reproduce the results the container creator generated for that application.
Besides Singularity, another popular container is Docker, which has been widely used for many applications. However, there are several reasons that Docker is not suitable for an HPC environment. The following are various reasons that we choose Singularity rather than Docker:
Security concern. The Docker daemon has root privileges and this is a security concern for several high performance computing centers. In contrast, Singularity solves this by running the container with the user’s credentials. The access permissions of a user are the same both inside the container and outside the container. Thus, a non-root user cannot change anything outside of his/her permission.
HPC Scheduler. Docker does not support any HPC job scheduler, but Singularity integrates seamlessly with all job schedulers including SLURM, Torque, SGE, etc.
GPU support. Docker does not support GPU natively. Singularity is able to support GPUs natively. Users can install whatever CUDA version and software they want on the host which can be transparently passed to Singularity.
MPI support. Docker does not support MPI natively. So if a user wants to use MPI with Docker, a MPI-enabled Docker needs to be developed. If a MPI-enabled Docker is available, the network stacks such as TCP and those needed by MPI are private to the container which makes Docker containers not suitable for more complicated networks like Infiniband. In Singularity, the user’s environment is shared to the container seamlessly.
Challenges with Singularity in HPC and Workaround
Many HPC applications, especially deep learning applications, have deep library dependences and it is time consuming to figure out these dependences and debug build issues. Most deep learning frameworks are developed in Ubuntu but they need to be deployed to Red Hat Enterprise Linux. So it is beneficial to build those applications once in a container and then deploy them anywhere. The most important goal of Singularity is portability which means once a Singularity container is created, the container can be run on any system. However, so far it is still not easy to achieve this goal. Usually we build a container on our own laptop, a server, a cluster or a cloud, and then deploy that container on a server, a cluster or a cloud. When building a container, one challenge is in GPU-based systems which have GPU driver installed. If we choose to install GPU driver inside the container, but the driver version does not match the host GPU driver, then an error will occur. So the container should always use the host GPU driver. The next option is to bind the paths of the GPU driver binary file and libraries to the container so that these paths are visible to the container. However, if the container OS is different than the host OS, such binding may have problems. For instance, assume the container OS is Ubuntu while the host OS is RHEL, and on the host the GPU driver binaries are installed in /usr/bin and the driver libraries are installed in /usr/lib64. Note that the container OS also have /usr/bin and /usr/lib64; therefore, if we bind those paths from the host to the container, the other binaries and libraries inside the container may not work anymore because they may not be compatible in different Linux distributions. One workaround is to move all those driver related files to a central place that does not exist in the container and then bind that central place.
The second solution is to implement the above workaround inside the container so that the container can use those driver related files automatically. This feature has already been implemented in the development branch of Singularity repository. A user just need to use the option “--nv” when launching the container. However, based on our experience, a cluster usually installs GPU driver in a shared file system instead of the default local path on all nodes, and then Singularity is not able to find the GPU driver path if the driver is not installed in the default or common paths (e.g. /usr/bin, /usr/sbin, /bin, etc.). Even if the container is able to find the GPU driver and the corresponding driver libraries and we build the container successfully, if in the deployment system the host driver version is not new enough to support the GPU libraries which were linked to the application when building the container, then an error will occur. Because of the backward compatibility of GPU driver, the deployment system should keep the GPU driver up to date to ensure its libraries are equal to or newer than the GPU libraries that were used for building the container.
Another challenge is to use InfiniBand with the container because InfiniBand driver is kernel dependent. There is no issue if the container OS and host OS are the same or compatible. For instance, RHEL and Centos are compatible, and Debian and Ubuntu are compatible. But if these two OSs are not compatible, then it will have library compatibility issue if we let the container use the host InfiniBand driver and libraries. If we choose to install the InfiniBand driver inside the container, then the drivers in the container and the host are not compatible. The Singularity community is still trying hard to solve this InfiniBand issue. Our current solution is to make the container OS and host OS be compatible and let the container reuse the InfiniBand driver and libraries on the host.
Singularity on Single Node
To measure the performance impact of using Singularity container, we ran the neural network Inception-V3 with three deep learning frameworks: NV-Caffe, MXNet and TensorFlow. The test server is a Dell PowerEdge C4130 configuration G. We compared the training speed in images/sec with Singularity container and bare-metal (without container). The performance comparison is shown in Figure 1. As we can see, there is no overhead or performance penalty when using Singularity container.
Figure 1: Performance comparison with and without Singularity
Singularity at Scale
We ran HPL across multiple nodes and compared the performance with and without container. All nodes are Dell PowerEdge C4130 configuration G with four P100-PCIe GPUs, and they are connected via Mellanox EDR InfiniBand. The result comparison is shown in Figure 2. As we can see, the percent performance difference is within ± 0.5%. This is within normal variation range since the HPL performance is slightly different in each run. This indicates that MPI applications such as HPL can be run at scale without performance loss with Singularity.
Figure 2: HPL performance on multiple nodes
In this blog, we introduced what Singularity is and how it can be used to containerize HPC applications. We discussed the benefits of using Singularity over Docker. We also mentioned the challenges of using Singularity in a cluster environment and the workarounds available. We compared the performance of bare metal vs Singularity container and the results indicated that there is no performance loss when using Singularity. We also showed that MPI applications can be run at scale without performance penalty with Singularity.
HPC Innovation Lab. August 2017
This is one of two articles in our Tesla V100 blog series. In this blog, we present the initial benchmark results of NVIDIA® Tesla® Volta-based V100™ GPUs on 4 different HPC benchmarks, as well as a comparative analysis against previous generation Tesla P100 GPUs. We are releasing another V100 series blog, which discusses our V100 and deep learning applications. If you haven’t read it yet, it is highly recommend to take a look here.
PowerEdge C4130 with V100 GPU support
The NVIDIA® Tesla® V100 accelerator is one of the most advanced accelerators available in the market right now and was launched within one year of the P100 release. In fact, Dell EMC is the first in the industry to integrate Tesla V100 and bring it to market. As was the case with the P100, V100 supports two form factors: V100-PCIe and the mezzanine version V100-SXM2. The Dell EMC PowerEdge C4130 server supports both types of V100 and P100 GPU cards. Table 1 below notes the major enhancements in V100 over P100:
Table 1: The comparison between V100 and P100
GPU Max Clock rate (MHz)
Memory Clock rate (MHz)
Memory Bandwidth (GB/s)
Interconnect Bandwidth Bi-Directional (GB/s)
Deep Learning (TFlops)
Single Precision (TFlops)
Double Precision (TFlops)
V100 not only significantly improves performance and scalability as will be shown below, but also comes with new features. Below are some highlighted features important for HPC Applications:
Second-Generation NVIDIA NVLink™
All four V100-SXM2 GPUs in the C4130 are connected by NVLink™ and each GPU has six links. The bi-directional bandwidth of each link is 50 GB/s, so the bi-directional bandwidth between different GPUs is 300 GB/s. This is useful for applications requiring a lot of peer-to-peer data transfers between GPUs.
New Streaming Multiprocessor (SM)
Single precision and double precision capability of the new SM is 50% more than the previous P100 for both PCIe and SXM2 form factors. The TDP (Thermal Design Power) of both cards are the same, which means V100 is ~1.5 times more energy efficient than the previous P100.
HBM2 Memory: Faster, Higher Efficiency
The 900 GB/sec peak memory bandwidth delivered by V100, is 23% higher than P100. Also the DRAM utilization has been improved from 76% to 95%, which allows for a 1.5x improvement in delivered memory bandwidth.
More in-depth details of all new features of V100 GPU card can be found at this Nvidia website.
Hardware and software specification update
All the performance results in this blog were measured on a PowerEdge Server C4130 using Configuration G (4x PCIe V100) and Configuration K (4x V100-SXM2). Both these configurations have been used previously in P100 testing. Also except for the GPU, the hardware components remain identical to those used in the P100 tests as well: dual Intel Xeon E5-2690 v4 processors, 256GB (16GB*16 2400 MHz) Memory and an NFS file system mounted via IPoIB on InfiniBand EDR were used. Complete specs details are included in our previous blog. Moreover, if you are interested in other C4130 configurations besides G and K, you can find them in our K80 blog.
There are some changes on the software front. In order to unleash the power of the V100, it was necessary to use the latest version of all software components. Table 2 lists the versions used for this set of performance tests. To keep the comparison fair, the P100 tests were reran using the new software stack to normalize for the upgraded software.
Table 2: The changes in software versions
Previous version in P100 blog
1.10.7 & 2.1.2
1.10.1 & 2.0.1
Compiled with sm7.0
Compiled with sm6.0
16, AmberTools17 update 20
16 AmberTools16 update3
p2pBandwidthLatencyTest is a micro-benchmark included in the CUDA SDK. It tests the card to card bandwidth and latency with and without GPUDirect™ Peer-to-Peer enabled. Since the full output matrix is fairly long, the unidirectional P2P result is listed below as an example here to demonstrate the way to verify the NVLINKs speed on both V100 and P100.
In theory, V100 has 6x 25GB/s uni-directional links, giving 150GB/s throughput. The previous P100-SXM2 only has 4x 20GB/s links, delivering 80GB/s. The results of p2pBandwitdhtLatencyTest on both cards are in Table 3. “D/D” represents “device-to-device”, that is the bandwidth available between two devices (GPUs). The achievable bandwidth of GPU0 was calculated by aggregating the second, third and fourth value in the first line, which represent the throughput from GPU0 to GPU1, GPU2 and GPU3 respectively.
Table 3: Unidirectional peer-to-peer bandwidth
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s). Four GPUs cards in the server.
It is clearly seen that V100-SXM2 on C4130 configuration K is significant faster than P100-SXM2, on:
Achievable throughput. V100-SXM2 has 47.88+47.9+47.93= 143.71 GB/s aggregated achievable throughput, which is 95.8% of the theoretical value 150GB/s and significant higher than 73.06GB/s and 91.3% on P100-SXM2. The bandwidth for bidirectional traffic is twice that of unidirectional traffic and is also very close to the theoretically 300 GB/s throughput.
Real world application. Symmetric access is the key for real world applications, on each chipset, P100 has 4 links, out of which three are connected to each of the other three GPUS. The remaining fourth link is connected to one of the other three GPUs. So, there are two links between GPU0 and GPU3, but only 1 link between GPU0 and GPU1 as well as GPU0 and GPU2. This is not symmetrical. The above numbers of p2pBandwidthLatencyTest in blue show this imbalance, as the value between GPU0 to GPU3 reaches 36.39 GB/s, which is double the bandwidth between GPU0 and GPU1 or GPU0 and GPU2. In most real world applications, it is common for the developer to treat all cards equally and not take such architectural differences into account. Therefore it will be likely that the faster pair of GPUs will need to wait for the slowest transfers, which means that 18.31 GB/s is the actual speed between all pairs of GPUs.
On the other hand, V100 has a symmetrical design with 6 links as seen in Figure 1. GPU0 to GPU1, GPU2, or GPU3 all have 2 links between each pair. So 47.88 GB/s is the achievable link bandwidth for each, which is 2.6 times faster than the P100.
Figure1: V100 and P100 Topologies on C4130 configuration K
High Performance Linpack (HPL)
Figure2: HPL Multi-GPU results with V100 and P100 on C4130 configuration G and K
Figure 2 shows the HPL performance on the C4130 platform with 1, 2 and 4 V100-PCIe and V100-SXM2 installed. P100’s performance number is also listed for comparison. It can be observed:
Both P100 and V100 scaling well, performance increases as more GPUs are added.
V100 is ~30% faster than P100 on both PCIe (Config G) and SMX2 (Config K).
A single C4130 server with 4x V100 reaches over 20TFlops on PCIe (Config G).
HPL is a system level benchmark and its performance is limited by other components like CPU, memory and PCIe bandwidth. Configuration G is a balanced design, which has 2 PCIe links between CPU and GPU and this is why it outperforms configuration K with 4x GPUs in the HPL benchmark. We do see some other applications perform better in Configuration K, since SXM2 (Config K) supports NVLink, higher core clock speed and peer-to-peer data transfer, these are described below.
Figure 3: HPCG Performance results with 4x V100 and P100 on C4130 configuration G and K
HPCG, the High Performance Conjugate Gradients benchmark, is another well-known metric for HPC system ranking. Unlike HPL, its performance is strongly influenced by memory bandwidth. Credit to the faster and higher efficient HBM2 memory of V100, the performance improvement observed is 44% over P100 on both Configuration G and K.
Figure 4: AMBER Multi-GPU results with V100 and P100 on C4130 configuration G and K
Figure 4 illustrates AMBER’s results with Satellite Tobacco Mosaic Virus (STMV) dataset. On SXM2 system (Config K), AMBER scales weakly with 2 and 4 GPUs. Even though the scaling is not strong, V100 has noticeable improvement than P100, giving ~78% increase in single card runs, and 1x V100 is actually 23% faster than 4x P100. On the PCIe (Config G) side, 1 and 2 cards perform similar to SXM2, but 4 cards’ results dropped sharply. This is because PCIe (Config G) only supports Peer-to-Peer access between GPU0/1 and GPU2/3 and not among all four GPUs. Since AMBER has redesigned the way data transfers among GPUs to address the PCIe bottleneck, it relies heavily on Peer-to-Peer access for performance with multiple GPU cards. Hence a fast, direct interconnect like NVLink between all GPUs in SXM2 (Config K) is vital for AMBER multiple GPU performance.
Figure 5: AMBER Multi-GPU Aggregate results with V100 and P100 on C4130 configuration G and K
To compensate for a single job’s weak scaling on multiple GPUs, there is another use case promoted by AMBER developers, which is running multiple jobs in the same node concurrently but where each job uses only 1 or 2 GPUs. Figure 5 shows the results of 1-4 individual jobs on one C4130 with V100s and the numbers indicate that those individual jobs have little impact on each other. This is because AMBER is designed to run pretty much entirely on the GPUs and has very low dependency on the CPU. The aggregate throughput of multiple individual jobs scales linearly in this case. Without any card to card communication, the 5% better performance on SXM2 is contributed by its higher clock speed.
Figure 6: LAMMPS 4-GPU results with V100 and P100 on C4130 configuration G and K
Figure 6 shows LAMMPS performance on both configurations G and K. The testing dataset is Lennard-Jones liquid dataset, which contains 512000 atoms, and LAMMPS compiled with the kokkos package. V100 is 71% and 81% faster on Config G and Config K respectively. Comparing V100-SXM2 (Config K) and V100-PCIe (Config G), the former is 5% faster due to NVLINK and higher CUDA core frequency.
Figure 7: V100 Speedups on C4130 configuration G and K
The C4130 server with NVIDIA® Tesla® V100™ GPUs demonstrates exceptional performance for HPC applications that require faster computational speed and highest data throughput. Applications like HPL, HPCG benefit from the additional PCIe links between CPU and GPU that are offered by Dell PowerEdge C4130 configuration G. On the other hand, applications like AMBER and LAMMPS were boosted with C4130 configuration K, owing to P2P access, higher bandwidth of NVLink and higher CUDA core clock speed. Overall, a PowerEdge C4130 with Tesla V100 GPUs performs 1.24x to 1.8x faster than a C4130 with P100 for HPL, HPCG, AMBER and LAMMPS.
HPC Innovation Lab. September 2017
In this blog, we will introduce the NVIDIA Tesla Volta-based V100 GPU and evaluate it with different deep learning frameworks. We will compare the performance of the V100 and P100 GPUs. We will also evaluate two types of V100: V100-PCIe and V100-SXM2. The results indicate that in training V100 is ~40% faster than P100 with FP32 and >100% faster than P100 with FP16, and in inference V100 is 3.7x faster than P100. This is one blog of our Tesla V100 blog series. Another blog of this series is about the general HPC applications performance on V100 and you can read it here.
Introduction to V100 GPU
In the 2017 GPU Technology Conference (GTC), NVIDIA announced the Volta-based V100 GPU. Similar to P100, there are also two types of V100: V100-PCIe and V100-SXM2. V100-PCIe GPUs are inter-connected by PCIe buses and the bi-directional bandwidth is up to 32 GB/s. V100-SXM2 GPUs are inter-connected by NVLink and each GPU has six links and the bi-directional bandwidth of each link is 50 GB/s, so the bi-directional bandwidth between different GPUs is up to 300 GB/s. A new type of core added in V100 is called tensor core which was designed specifically for deep learning. These cores are essentially a collection of ALUs for performing 4x4 matrix operations: specifically a fused multiply add (A*B+C), multiplying two 4x4 FP16 matrices together, and then adding to a FP16/FP32 4x4 matrix to generate a final 4x4 FP16/FP32 matrix. By fusing matrix multiplication and add in one unit, the GPU can achieve high FLOPS for this operation. A single Tensor Core performs the equivalent of 64 FMA operations per clock (for 128 FLOPS total), and with 8 such cores per Streaming Multiprocessor (SM), 1024 FLOPS per clock per SM. By comparison, even with pure FP16 operations, the standard CUDA cores in a SM only generate 256 FLOPS per clock. So in scenarios where these cores can be used, V100 is able to deliver 4x the performance versus P100. The detailed comparison between V100 and P100 is in Table 1.
As in our previous deep learning blog, we still use the three most popular deep learning frameworks: NVIDIA’s fork of Caffe (NV-Caffe), MXNet and TensorFlow. Both NV-Caffe and MXNet have been optimized for V100. TensorFlow still does not have any official release to support V100, but we applied some patches obtained from TensorFlow developers so that it is also optimized for V100 in these tests. For the dataset, we still use ILSVRC 2012 dataset whose training set contains 1281167 training images and 50000 validation images. For the testing neural network, we chose Resnet50 as it is a computationally intensive network. To get best performance, we used CUDA 9-rc compiler and CUDNN library in all of the three frameworks since they are optimized for V100. The testing platform is Dell EMC’s PowerEdge C4130 server. The C4130 server has multiple configurations, and we evaluated both P100-PCIe in configuration G and P100-SXM2 in configuration K. The difference between configuration G and configuration K is shown in Figure 1. There are mainly two differences: one is that configure G has two x16 PCIe link connecting from dual CPUs to the four GPUs, while configure K has only one x16 PCIe bus connecting from one CPU to four GPUs; another difference is that GPUs are connected by PCIe buses in configure G but by NVLink in configure K. The other hardware and software details are shown in Table 2.
Figure 1: Comparison between configure G and configure K
Table 2: The hardware configuration and software details
In this experiment, we trained various deep learning frameworks with one pass on the whole dataset since we were comparing only the training speed, not the training accuracy. Other important input parameters for different deep learning frameworks are listed in Table 3. For NV-Caffe and MXNet, in terms of different batch size, we doubled the batch size for FP16 tests since FP16 consumes half the memory for floating points as FP32. As TensorFlow does not support FP16 yet, we did not evaluate its FP16 performance in this blog. Because of different implementations, NV-Caffe consumes more memory than MXNet and TensorFlow for the same neural network, the batch size in FP32 mode is only half of that in MXNet and TensorFlow. In NV-Caffe, if FP16 is used, then the data type of several parameters need to be changed. We explain these parameters as follows: the solver_data_type controls the data type for master weights; the default_forward_type and default_backward_type controls the data type for training values; the default_forward_math and default_backward_math controls the data type for matrix-multiply accumulator. In this blog we used FP16 for training values, FP32 for matrix-multiply accumulator and FP32 for master weights. We will explore other combinations in our future blogs. In MXNet, we tried different values for the parameter “--data-nthreads” which controls the number of threads for data decoding.
Table 3: Input parameters used in different deep learning frameworks
Figure 1, Figure 2, and Figure 3 show the performance of V100 versus P100 with NV-Caffe, MXNet and TensorFlow, respectively. And Table 4 shows the performance improvement of V100 compared to P100. From these results, we can obtain the following conclusions:
In both PCIe and SXM2 versions, V100 is >40% faster than P100 in FP32 for both NV-Caffe and MXNet. This matches the theoretical speedup. Because FP32 is single precision floating points, and V100 is 1.5x faster than P100 in single precision. With TensorFlow, V100 is more than 30% faster than P100. Its performance improvement is lower than the other two frameworks and we think that is because of different algorithm implementations in these frameworks.
In both PCIe and SXM2 versions, V100 is >2x faster than P100 in FP16. Based on the specification, V100 tensor performance is ~6x than P100 FP16. The reason that the actual speedup does not match the theoretical speedup is that not all data are stored in FP16 and so not all operations are tensor operations (the FMA matrix multiply and add operation).
In V100, the performance of FP16 is close to 2x than that of FP32. This is because FP16 only requires half storage compared to FP32 and therefore we could double the batch size in FP16 to improve the computation speed.
In MXNet, we set the “--data-nthreads” to 16 instead of the default value 4. The default value is often sufficient to decode more than 1K images per second but still not fast enough for V100 GPU. In our testing, we found the default value 4 is enough for P100 but for V100 we need to set it at least 12 to achieve good performance, with a value of 16 being ideal.
Figure 2: Performance of V100 vs P100 with NV-Caffe
Figure 3: Performance of V100 vs P100 with MXNet
Figure 4: Performance of V100 vs P100 with TensorFlow
Table 4: Improvement of V100 compared to P100
Since V100 supports both deep learning training and inference, we also tested the inference performance with V100 using the latest TensorRT 3.0.0. The testing was done in FP16 mode on both V100-SXM2 and P100-PCIe and the result is shown in Figure 5. We used batch size 39 for V100 and 10 for P100. Different batches were chosen to make their inference latencies are close to each other (~7ms in the figure). The result shows that when their latencies are close, the inference throughput of V100 is 3.7x faster compared to P100.
Figure 5: Resnet50 inference performance on V100 vs P100
After evaluating the performance of V100 with three popular deep learning frameworks, we conclude that in training V100 is more than 40% faster than P100 in FP32 and more than 100% faster in FP16, and in inference V100 is 3.7x faster than P100. This demonstrates the performance benefits when the V100 tensor cores are used. In the future work, we will evaluate different data type combinations in FP16 and study the accuracy impact with FP16 in deep learning training. We will also evaluate the TensorFlow with FP16 once support is added into the software. Finally, we plan to scale the training to multiple nodes with these frameworks.
Garima Kochhar, Kihoon Yoon, Joshua Weage. HPC Innovation Lab. 25 Sep 2017.
The document below presents performance results on Skylake based Dell EMC 14th generation systems for a variety of HPC benchmarks and applications (Stream, HPL, WRF, BWA, Ansys Fluent, STAR-CCM+ and LS-DYNA). It compares the performance of these new systems to previous generations, going as far back as Westmere (Intel Xeon 5500 series) in Dell EMC's 11th generation servers, showing the potential improvements when moving to this latest instance of server technology.