Our community is talking about the new Dell Technologies. Join the discussion in the Dell EMC Community Network:
Dell Social Media | Webinar Series
Thanks to everyone that tuned in for this month’s technical support webinar. Don’t worry if you missed the live session though!
We’ve uploaded a recording of the webinar to our support page <Here> so you can share, and catch up.
You can also download the deck used in the presentation <Here>.
For all those who were able to join, we hope you found our presentation both enjoyable and informative. To learn more about our Technical Support Webinar series and sign up for future events at dell.com/webcast.
For additional BitLocker support find below a few of the most popular solutions we’ve identified.
In this quick 30 minute session we’ll take you through an introduction to BitLocker, along with some basic & advanced troubleshooting tips and tricks in answer to the questions we’ve identified as trending through social media, specifically regarding BitLocker key prompts.
Webinar | Archive
Want to learn more? You can find recordings of our previous Webinars below and at Dell.com/webcast.
Create & Use Dell Windows 10 Media
Here we discuss how to create bootable USB media for the installation of Windows 10 on your Dell computer, general operating system installation advice, and how to use command prompt for driver installation (pre-OS).
Ubuntu Basics & Installation
In this quick 40 minute session we’ll take you through the steps to complete an Ubuntu OS installation, while keeping your current Windows install and all your data intact. We’ll also be providing a brief introduction to Ubuntu and talking with Barton George who will be providing his insights into Dells Project Sputnik, Ubuntu collaboration.
Thunderbolt & TB16 Troubleshooting
In this webinar we take you through some of the intricacies relating to Thunderbolt technology, including technical specification overviews and comparisons, along with Docking/Adapter hardware solutions available from Dell that will help you best take advantage of the technology.
Have a topic that you’d be interested to see discussed?
Submit your suggestions to us @DellCaresPRO
Dell EMC | Social Media
Updated monthly, this publication provides you with new and recently revised information and is organized in the following categories; Documentation, Notifications, Patches, Product Life Cycle, Release, Knowledge Base Articles.
Subscribe to the RSS (Use IE only)
226670 - Mandatory Hotfix 653995 for 8.6 MR3 Connection Broker
This mandatory hotfix addresses the following issues: Broker CPU usage has increased and log file size...
Created: March 1, 2017
226832 - Users getting Your active session has expired when trying to log in to web portal after upgrade
After upgrading vWorkspace to 8.6.3 users immediately receive the message "Your active session has expired. Please log in to continue" and are...
Created: March 6, 2017
227674 - Video: How to Configure vWorkspace Web Access for Defender two factor authentication
Created: March 27, 2017
227676 - Quest Defender and Quest Desktop Virtualization integration
227768 - What are the subversion for the different Service Packs of vWorkspace 8.6.x
vWorkspace in its version 8.6 has 3 different Service Packs on top of its Main Release Those are however not the version number you will...
Created: March 29, 2017
227781 - How to set the minimum memory with HyperV
The Hyper-V role in Windows 2012 has an improved Dynamic memory feature that adds a property called Minimum Memory. This allows you to specify a...
227791 - Broker service will not start after server updates.
Server updates were performed on the vWorkspace Connection Broker and now the broker service will not start.
137081 - Are Generation 2 Hyper-V Virtual machines supported in vWorkspace
When attempting to create a template machine, the following error is seen when trying to install the instant provisioning tools: Floppy disk...
Revised: March 2, 2017
225565 - Is VMware 6.5 currently supported?
Is VMware vSphere 6.5 supported in any of the current versions of vWorkspace?
Revised: March 7, 2017
This mandatory hotfix addresses the following issues: Broker CPU usage has increased and when logging...
Revised: March 17, 2017
155517 - What are the Requirements for a Connector Broker 8.5?
What are the requirement to deploy the Connection Broker properly?
Revised: March 22, 2017
Product Life Cycle -vWorkspace
Revised: March 2017
This article presents performance comparisons of several typical MPI applications — LAMMPS, WRF, OpenFOAM, and STAR-CCM+ — running on a traditional, bare-metal HPC cluster versus a virtualized cluster running VMware’s vSphere virtualization platform. The tests were performed on a 32-node, EDR-connected Dell PowerEdge C6320 cluster, located in the Dell EMC HPC Innovation Lab in Austin, Texas. In addition to performance results, virtual cluster architecture and configuration recommendations for optimal performance are described.
Interest in HPC virtualization and cloud have grown rapidly. While much of the interest stems from gaining the general value of cloud technologies, there are specific benefits of virtualizing HPC and supporting it in a cloud environment, such as centralized operation, cluster resource sharing, research environment reproducibility, multi-tenant data security, fault isolation and resiliency, dynamic load balancing, efficient power management, etc. Figure 1 illustrates several HPC virtualization benefits.
Despite the potential benefits of moving HPC workloads to a private, public, or hybrid cloud, performance concerns have been a barrier to adoption. We focus here on the use of on-premises, private clouds for HPC — environments in which appropriate tuning can be applied to deliver maximum application performance. HPC virtualization performance is primarily determined by two factors; hardware virtualization support and virtual infrastructure capability. With advances in both VMware vSphere as well as x86 microprocessor architecture, throughput applications can generally run at close to full speed in the VMware virtualized environment — with less than 5% performance degradation compared to native, and often just 1 – 2% . MPI applications by nature are more challenging, requiring sustained and intensive communication between nodes, making them sensitive to interconnect performance. With our continued performance optimization efforts, we see decreasing overheads running these challenging HPC workloads  and this blog post presents some MPI results as examples.
Figure 1: Illustration of several HPC virtualization benefits
As illustrated in Figure 2, the testbed consists of 32 Dell PowerEdge C6320 compute nodes and one management node. vCenter , the vSphere management component, as well as NFS and DNS are running in virtual machines (VMs) on the management node. VMware DirectPath I/O technology  (i.e., passthrough mode) is used to allow the guest OS (the operating system running within a VM) to directly access the EDR InfiniBand device, which shortens the message delivery path by bypassing the network virtualization layer to deliver best performance. Native tests were run using CentOS on each host, while virtual tests were run with the VMware ESXi hypervisor running on each host along with a single virtual machine running the same CentOS version.
Figure 2: Testbed Virtual Cluster Architecture
Table 1 shows all cluster hardware and software details, and Table 2 shows a summary of BIOS and vSphere settings.
Table 1: Cluster Hardware and Software Details
Dell PowerEdge C6320
Dual 10-core Intel Xeon E5-2660 v3 firstname.lastname@example.orgGHz (Haswell)
Mellanox ConnectX-4 VPI adapter card; EDR IB (100Gb/s)
vCenter management server
BIOS, Firmware and OS
OS Distribution (virtual and native)
OFED and MPI
(LAMMPS, WRF and OpenFOAM)
Intel MPI (STAR-CCM+)
Table 2: BIOS and vSphere Settings
Performance Per Watt (OS)
ESXi power policy
Enabled for EDR InfiniBand
20 virtual CPUs, 100GB memory
Virtual NUMA topology (vNUMA)
Auto detected (default)
CPU Scheduler affinity
Figures 3-6 show native versus virtual performance ratios with the settings in Table 2 applied. A value of 1.0 means that virtual performance is identical to native. Applications were benchmarked using a strong scaling methodology — problem sizes remained constant as job sizes were scaled. In the Figure legends, ‘nXnpY’ indicates a test run on X nodes using a total of Y MPI ranks. Benchmark problems were selected to achieve reasonable parallel efficiency at the largest scale tested. All MPI processes were consecutively mapped from node 1 to node 32.
As can be seen from the results, the majority of tests show degradations under 5%, though there are increasing overheads as we scale. At the highest scale tested (n32np640), performance degradation varies by applications and benchmark problems, with the largest degradation seen with LAMMPS atomic fluid (25%) and the smallest seen with STAR-CCM+ EmpHydroCyclone_30M (6%). Single-node STAR-CCM+ results are anomalous and currently under study. As we continue our performance optimization work, we expect to report better and more scalable results in the future.
Figure 3: LAMMPS native vs. virtual performance. Higher is better.
Figure 4: WRF native vs. virtual performance. Higher is better.
Figure 5: OpenFOAM native vs. virtual performance. Higher is better.
Figure 6: STAR-CCM+ native vs. virtual performance. Higher is better.
The following configurations are suggested to achieve optimal virtual performance for HPC. For more comprehensive vSphere performance guidance, please see  and .
MPI workloads are CPU-heavy and can make use of all cores, thus requiring a large VM. However, CPU or memory overcommit would greatly impact performance. In our tests, each VM is configured with 20vCPUs, using all physical cores, and 100 GB fully reserved memory, leaving some free memory to consume ESXi hypervisor memory overhead.
There are three ESXi power management policies: “High Performance”, “Balanced” (default), “Low Power” and “Custom”. Though “High performance” power management would slightly increase performance of latency-sensitive workloads, in situations in which a system’s load is low enough to allow Turbo to operate, it will prevent the system from going into C/C1E states, leading to lower Turbo boost benefits. The “Balanced” power policy will reduce host power consumption while having little or no impact on performance. It’s recommended to use this default.
Virtual NUMA (vNUMA) exposes NUMA topology to the guest OS, allowing NUMA-aware OSes and applications to make efficient use of the underlying hardware. This is an out-of-the-box feature in vSphere.
Virtualization holds promise for HPC, offering new capabilities and increased flexibility beyond what is available in traditional, unvirtualized environments. These values are only useful, however, if high performance can be maintained. In this short post, we have shown that performance degradations for a range of common MPI applications can be kept under 10%, with our highest scale testing showing larger slowdowns in some cases. With throughput applications running at very close to native speeds, and with the results shown here, it is clear that virtualization can be a viable and useful approach for a variety of HPC use-cases. As we continue to analyze and address remaining sources of performance overhead, the value of the approach will only continue to expand.
If you have any technical questions regarding VMware HPC virtualization, please feel free to contact us!
These results have been produced in collaboration with our Dell Technology colleagues in the Dell EMC HPC Innovation Lab who have given us access to the compute cluster used to produce these results and to continue our analysis of remaining performance overheads.
Na Zhang is member of the technical staff working on HPC within VMware’s Office of the CTO. Her current focus is on performance and solutions of HPC virtualization. Na has Ph.D. degree in Applied Mathematics from Stony Brook University. Her research primarily focused on design and analysis of parallel algorithms for large- and multi-scale simulations running on supercomputers.
Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Mar. 2017
Introduction to P40 GPU and TensorRT
Deep Learning (DL) has two major phases: training and inference/testing/scoring. The training phase builds a deep neural network (DNN) model with the existing large amount of data. And the inference phase uses the trained model to make prediction from new data. The inference can be done in the data center, embedded system, auto and mobile devices, etc. Usually inference must respond to user request as quickly as possible (often in real time). To meet the low-latency requirement of inference, NVIDIA® launched Tesla® P4 and P40 GPUs. Aside from high floating point throughput and efficiency, both GPUs introduce two new optimized instructions designed specifically for inference computations. The two new instructions are 8-bit integer (INT8) 4-element vector dot product (DP4A) and 16-bit 2-element vector dot product (DP2A) instructions. Deep learning researchers have found using FP16 is able to achieve the same inference accuracy as FP32 and many applications only require INT8 or lower precision to keep an acceptable inference accuracy. Tesla P4 delivers a peak of 21.8 INT8 TIOP/s (Tera Integer Operations per Second), while P40 delivers a peak of 47.0 INT8 TIOP/s. This blog only focuses on P40 GPU.
TensorRTTM, previously called GIE (GPU Inference Engine), is a high performance deep learning inference engine for production deployment of deep learning applications that maximizes inference throughput and efficiency. TensorRT provides users the ability to take advantage of fast reduced precision instructions provided in the Pascal GPUs. TensorRT v2 supports the INT8 reduced precision operations that are available on the P40.
This blog quantifies the performance of deep learning inference using TensorRT on Dell’s PowerEdge C4130 server which is equipped with 4 Tesla P40 GPUs. Since TensorRT is only available for Ubuntu OS, all the experiments were done on Ubuntu. Table 1 shows the hardware and software details. The inference benchmark we used was giexec in TensorRT sample codes. The synthetic images which were filled with random non-zero numbers to simulate real images were used in this sample code. Two classic neural networks were tested: AlexNet (2012 ImageNet winner) and GoogLeNet (2014 ImageNet winner) which is much deeper and complicated than AlexNet.
We measured the inference performance in images/sec which means the number of images that can be processed per second. To measure the performance improvement of the current generation GPU P40, we also compared its performance with the previous generation GPU M40. The most important goal of this testing is to measure the inference performance in INT8 mode, compared to FP32 mode. P40 uses the new Pascal architecture and supports the new INT8 instructions. The previous generation GPU M40 uses Maxwell architecture and does not support INT8 instructions. The theoretical performance of INT8, FP32 in both M40 and P40 is shown in Table 2. We measured the performance FP32 on both devices and both FP32 and INT8 on the P40.
Table 1: Hardware configuration and software details
PowerEdge C4130 (configuration G)
2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)
256GB DDR4 @ 2400MHz
4x Tesla P40 with 24GB GPU memory
Software and Firmware
CUDA and driver version
Table 2: Comparison between Tesla M40 and P40
In this section, we will present the inference performance with TensorRT on GoogLeNet and AlexNet. We also implemented the benchmark with MPI so that it can be run on multiple P40 GPUs within a node. We will also compare the performance of P40 with M40. Lastly we will show the performance impact when using different batch sizes.
Figure 1 shows the inference performance with TensorRT library for both GoogLeNet and AlexNet. We can see that INT8 mode is ~3x faster than FP32 in both neural networks. This is expected since the theoretical speedup of INT8 is 4x compared to FP32 if only multiplications are performed and no other overhead is incurred. However, there are kernel launches, occupancy limits, data movement and math other than multiplications, so the speedup is reduced to about 3x faster.
Figure 1: Inference performance with TensorRT library
Dell’s PowerEdge C4130 supports up to 4 GPUs in a server. To make use of all GPUs, we implemented the inference benchmark using MPI so that each MPI process runs on each GPU. Figure 2 and Figure 3 show the multi-GPU inference performance on GoogLeNet and AlexNet, respectively. When using multiple GPUs, linear speedup were achieved for both neural networks. This is because each GPU processes its own images and there is no communications and synchronizations among used GPUs.
Figure 2: Multi-GPU inference performance with TensorRT GoogLeNet
Figure 3: Multi-GPU inference performance with TensorRT AlexNet
To highlight the performance advantage of P40 GPU and its native support for INT8, we compared the inference performance between P40 with the previous generation GPU M40. The result is shown in Figure 5 and Figure 6 for GoogLeNet and AlexNet, respectively. In FP32 mode, P40 is 1.7x faster than M40. And the INT8 mode in P40 is 4.4x faster than FP32 mode in M40.
Figure 4: Inference performance comparison between P40 and M40
Figure 5: Inference performance comparison between P40 and M40
Deep learning inference can be applied in different scenarios. Some scenarios require large batch size and some scenarios even requires no batching at all (i.e. batch size is 1). Therefore we also measured the performance difference when using different batch sizes and the result is shown in Figure 6. Note that the purpose here is not comparing the performance of GoogLeNet and AlexNet, instead the purpose is to check how the performance changes with different batch sizes for each neural network. It can be seen that without batch processing the inference performance is very low. This is because the GPU is not assigned enough workloads to keep it busy. The larger the batch size is, the higher the inference performance is, although the rate of the speed increasing becomes slower. When batch size is 4096, GoogLeNet stopped running because the required GPU memory for this neural network exceeds the GPU memory limit. But AlexNet was able to run because it is a less complicated neural network than GoogLeNet and therefore it requires less GPU memory. So the largest batch size is only limited by GPU memory.
Figure 6: Inference performance with different batch sizes
Conclusions and Future Work
In this blog, we presented the inference performance in deep learning with NVIDIA® TensorRT library on P40 and M40 GPUs. As a result, the INT8 support in P40 is about 3x faster than FP32 mode in P40 and 4.4x faster than FP32 mode in the previous generation GPU M40. Multiple GPUs can increase the inferencing performance linearly because of no communications and synchronizations. We also noticed that higher batch size leads to higher inference performance and the largest batch size is only limited by GPU memory size. In the future work, we will evaluate the inference performance with real world deep learning applications.
- by Matt Halsey
Have a connected home? Have an internet connection? Then you too can have a conversation with Chinese website Baidu.
Huge Vulnerability Discovered in the Ring Doorbell This article highlights the intrinsic need for there to be a means to secure IoT devices.
It was only a few months ago that the Mirai botnet, using home video surveillance cameras, was able to launch the largest DDoS attack in history.
Read the article.
Then you can read the comments from someone claiming to be the head of security at Ring, name Matt, here (italics added):
Hi I'm the VP of Security at Ring and I thought it might be helpful to give you all some background on what you are seeing.
Occasionally at the end of live call or motion, we will lose connectivity. Rather than abandoning the entire call, we send the last few audio packets that are corrupted anyway to a non-routable address on a protocol no one uses. The right way to do that is to use a virtual interface or the loopback to discard the packets. The choice to send it to somewhere across the world and let the ISP deal with blocking is a poor design choice that the teams on working on addressing ASAP.
From a risk/disclosure perspective, it's relatively benign but like the everyone else, when my team first saw it in the wild we had similar concerns.
i will circle back when we have updated firmware.
Ring Pro doorbell - calling China?
So what to do:
1. Go to Industrial Internet Consortium and see how Dell and EMC, now Dell|EMC and Dell Technologies are helping to secure the IoT world.
2. Realize that IoT is in its infancy if not earlier where security is concerned....like when we used to leave Telnet, TFTP, and FTP ports open on our internet facing servers....
3. Be ready to help our customers understand that encryption, especially our products, can help protect them when vendors of IoT devices don't finish their job in securing the devices.
By Brett Roberts with Debra Slapak
The amount of machine-generated data being created each day is massive and--as we all know--can be extremely valuable. Insights extracted from this data have the potential to help you improve operational efficiency, customer experience, security and much more. But getting started can present real challenges and really big questions, such as "How do we consolidate all of this complex data and analyze it to deliver actionable insights?" Dell EMC works with Splunk to address these challenges and simplify those first steps.
Splunk’s proven platform for real-time operational intelligence helps reduce the complexity of harnessing machine-generated data by providing users with an end-to-end platform to collect, search, analyze and visualize this data. For the Splunk platform to be used to its full potential, organizations need infrastructure that meets or exceeds Splunk’s reference architecture specifications. Dell EMC has partnered with Splunk to create highly-optimized and powerful solutions that help solve machine-generated data challenges. Read more in a recently posted blog about how Splunk and Dell EMC can help you on your journey to valuable insights with machine-generated data.
Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Feb 2017
Introduction to P100-PCIe GPU
This blog describes the performance analysis on NVIDIA® Tesla® P100™ GPUs on a cluster of Dell PowerEdge C4130 servers. There are two types of P100 GPUs: PCIe-based and SXM2-based. In PCIe-based server, GPUs are connected by PCIe buses and one P100 delivers around 4.7 and 9.3 TeraFLOPS of double and single precision performance, respectively. And in P100-SXM2, GPUs are connected by NVLink and one P100 delivers around 5.3 and 10.6 TeraFLOPS of double and single precision performance, respectively. This blog focuses on P100 for PCIe-based servers, i.e. P100-PCIe. We have already analyzed the P100 performance for several deep learning frameworks in this blog. The objective of this blog is to compare the performance of HPL, LAMMPS, NAMD, GROMACS, HOOMD-BLUE, Amber, ANSYS Mechanical and RELION. The hardware configuration of the cluster is the same as in the deep learning blog. Briefly speaking, we used a cluster of four C4130 nodes, each node has dual Intel Xeon E5-2690 v4 CPUs and four NVIDIA P100-PCIe GPUs and all nodes are connected with EDR Infiniband. Table 1 shows the detailed information about the hardware and software used in every compute node.
Table 1: Experiment Platform and Software Details
P100-PCIe with 16GB GPU memory
Mellanox ConnectX-4 VPI (EDR 100Gb/s Infiniband)
RHEL 7.2 x86_64
Linux Kernel Version
CUDA version and driver
CUDA 8.0.44 (375.20)
High Performance Linpack (HPL)
HPL is a multicomputer parallel application to measure how fast computers solve a dense n by n system of linear equations using LU decomposition with partial row pivoting and designed to be run at very large scale. The HPL running on the experimented cluster uses the double precision floating point operations. Figure 1 shows the HPL performance on the tested P100-PCIe cluster. It can be seen that 1 P100 is 3.6x faster than 2 x E5-2690 v4 CPUs. HPL also scales very well with more GPUs within nodes or across nodes. Recall that 4 P100 is within a server and therefore 8, 12 and 16 P100 are in 2, 3 and 4 servers. 16 P100 GPUs has the speedup of 14.9x compared to 1 P100. Note that the overall efficiency is calculated as: HPL Efficiency = rMax / (CPUs rPeak + GPUs rPeak), where rPeak is the highest theoretical FLOPS result that could be achieved with base clock, and the number reported by HPL is rMax and is the real performance that can be achieved. HPL cannot be run at the max boost clock. It is typically run at some number in between but the average is close to the base clock then to the max boost clock. That is why we used base clock for rPeak calculation. Although we also included CPU rPeak in the efficiency calculation, when running HPL on P100 we set DGEMM_SPLIT=1.0 which means CPU is not really contributing to the DGEMM, but only handling other overhead so it is not actually contributing a lot of FLOPS. Although we observed that CPUs stayed fully utilized they were just handling the overhead and data movement to keep the GPUs fed. What is the most important for P100 is that rMax is really big.
Figure 1: HPL performance on P100-PCIe
NAMD (for NAnoscale Molecular Dynamics) is a molecular dynamics application designed for high-performance simulation of large biomolecular systems. The dataset we used is Satellite Tobacco Mosaic Virus (STMV) which is a small, icosahedral plant virus that worsens the symptoms of infection by Tobacco Mosaic Virus (TMV). This dataset has 1,066,628 atoms and it is the largest dataset on NAMD utilities website. The performance metric in the output log of this application is “days/ns” (the lower the better). But its inverted metric “ns/day” is used in our plot since that is what most molecular dynamics users focus on. The average of all occurrences of this value in the output log was used. Figure 2 shows the performance within 1 node. It can be seen that the performance of using 2 P100 is better than that of using 4 P100. This is probably because of the communications among different CPU threads. This application launches a set of workers threads that handle the computation and communication threads that handle the data communication. As more GPUs are used, more communication threads are used and more synchronization is needed. In addition, based on the profiling result from NVIDIA’s CUDA profiler called nvprof, with 1 P100 the GPU computation takes less than 50% of the whole application time. According to Amdahl’s law, the speedup with more GPUs will be limited by another 50% work that is not parallelized by GPU. Based on this observation, we further ran this application on multiple nodes with two different settings (2 GPUs/node and 4 GPUs/node) and the result is shown in Figure 3. The result shows that no matter how many nodes are used, the performance of 2 GPUs/node is always better than 4 GPUs/node. Within a node, 2 P100 GPUs is 9.5x faster than dual CPUs.
Figure 2: NAMD Performance within 1 P100-PCIe node
Figure 3: NAMD Performance across Nodes
GROMACS (for GROningen MAchine for Chemical Simulations) primarily does simulations for biochemical molecules (bonded interactions). But because of its efficiency in calculating non-bonded interactions (atoms not linked by covalent bonds), the user base is expanding to non-biological systems. Figure 4 shows the performance of GROMACS on CPU, K80 GPUs and P100-PCIe GPUs. Since one K80 has two internal GPUs, from now on when we mention one K80 it always refers to two internal GPUs instead of one of the two internal GPUs. When testing with K80 GPUs, the same P100-PCIe GPUs based servers were used. Therefore, the CPUs and memory were kept the same and the only difference is that P100-PCIe GPUs were replaced to K80 GPUs. In all tests, there were four GPUs per server and all GPUs were utilized. For example, the 3 node data point is with 3 servers and 12 total GPUs. The performance of P100-PCIe is 4.2x – 2.8x faster than CPU from 1 node to 4 nodes, and is 1.5x – 1.1x faster than K80 GPU from 1 node to 4 nodes.
Figure 4: GROMACS Performance on P100-PCIe
LAMMPS (for Large Scale Atomic/Molecular Massively Parallel Simulator) is a classic molecular dynamics code, capable of simulations for solid-state materials (metals, semi-conductors), soft matter (biomolecules, polymers) and coarse-grained or mesoscopic systems. It can be used to model atoms or more generically as a parallel particle simulator at the atomic, meso or continuum scale. The dataset we used was LJ (Lennard-Jones liquid benchmark) which contains 512000 atoms. There are two GPU implementations in LAMMPS: GPU library version and kokkos version. In the experiment, we used kokkos version since it was much faster than the GPU library version.
Figure 5 shows LAMMPS performance on CPU and P100-PCIe GPUs. Using 16 P100 GPUs is 5.8x faster than using 1 P100. The reason that this application did not scale linearly is that the data transfer (CPU->GPU, GPU->CPU and GPU->GPU) time increases as more GPUs are used although the computation part reduces linearly. And the reason that the data transfer time increases is because this application requires the data communication among all GPUs used. However, the configuration G we used only allows Peer-to-Peer (P2P) access for two pairs of GPUs: GPU 1 - GPU 2 and GPU 3 - GPU 4. GPU 1/2 cannot communicate with GPU 3/4 directly. If the communication is needed, the data must go through CPU which slows the communication. The configuration B is able to ease this issue as it allows P2P access among all four GPUs within a node. The comparison between configuration G and configuration B is shown in Figure 6. By running LAMMPS on a configuration B server with 4 P100, the performance metric “timesteps/s” was improved to 510 compared to 505 in configuration G, resulting in 1% improvement. The reason why the improvement is not significant is because the data communication takes only less than 8% of the whole application time when running on configuration G with 4 P100. Figure 7 also compared the performance of P100-PCIe with that of CPU and K80 GPUs for this application. It is shown that within 1 node, 4 P100-PCIe is 6.6x faster than 2 E5-2690 v4 CPUs and 1.4x faster than 4 K80 GPUs.
Figure 5: LAMMPS Performance on P100-PCIe
Figure 6 : Comparison between Configuration G and Configuration B
Figure 7: LAMMPS Performance Comparison
HOOMD-blue (for Highly Optimized Object-oriented Many-particle Dynamics - blue) is a general purpose molecular dynamic simulator. Figure 8 shows the HOOMD-blue performance. Note that the y-axis is in logarithmic scale. It is observed that 1 P100 is 13.4x faster than dual CPU. The speedup of using 2 P100 is 1.5x compared to using only 1 P100. This is a reasonable speedup. However, with 4 P100 to 16 P100, the speedup is from 2.1x to 3.9x which is not high. The reason is that similar to LAMMPS, this application also involves lots of communications among all used GPUs. Based on the analysis in LAMMPS, using configuration B should reduce this communication bottleneck significantly. To verify this, we ran the same application again on a configuration B server. With 4 P100, the performance metric “hours for 10e6 steps” was reduced to 10.2 compared to 11.73 in configuration G, resulting in 13% performance improvement and the speedup compared to 1 P100 was improved to 2.4x from 2.1x.
Figure 8: HOOMD-blue Performance on CPU and P100-PCIe
Amber is the collective name for a suite of programs that allow users to carry out molecular dynamics simulations, particularly on biomolecules. The term Amber is also used to refer to the empirical force fields that are implemented in this suite. Figure 9 shows the performance of Amber on CPU and P100-PCIe. It can be seen that 1 P100 is 6.3x faster than dual CPU. Using 2 P100 GPUs is 1.2x faster than using 1 P100. However, the performance drops significantly when 4 or more GPUs are used. The reason is that similar to LAMMPS and HOOMD-blue, this application heavily relies on P2P access but configuration G only supports that between 2 pair GPUs. We verified this by again testing this application on a configuration B node. As a result, the performance of using 4 P100 was improved to 791 ns/day compared to 315 ns/day in configuration G, resulting in 151% performance improvement and the speedup of 2.5x. But even in configuration B, the multi-GPU scaling is still not good enough. This is because when the Amber multi-GPU support was originally designed the PCI-E bus speed was gen 2 x 16 and the GPUs were C1060 or C2050s. However, the current Pascal generation GPUs are > 16x faster than the C1060s while the PCI-E bus speed has only increased by 2x (PCI Gen2 x 16 to PCI Gen3 x 16) and Infiniband interconnects by about the same amount. Amber website explicitly states that “It should be noted that while the legacy MPI and GPU-Direct methods of multi-GPU communication are still supported, and will be used by the code automatically if peer to peer communication is not available, you are very unlikely to see any speedup by using multiple GPUs for a single job if the GPUs are newer than C2050s. Multi-node runs are almost impossible to get to scale.” This is consistent with our results on multi-node. Because it is obvious to see that in Figure 9, the more nodes are used, the worse the performance is.
Figure 9: Amber Performance on CPU and P100-PCIe
ANSYS® Mechanical™ software is a comprehensive finite element analysis (FEA) tool for structural analysis, including linear, nonlinear dynamic, hydrodynamic and explicit studies. It provides a complete set of element behavior, material models and equation solvers for a wide range of mechanical design problems. The finite element method is used to solve the partial differential equations which is a compute and memory intensive task. Our testing focused on the Power Supply Module (V17cg-1) benchmark. This is a medium sized job for iterative solvers and a good test for memory bandwidth. Figure 10 shows the performance of ANSYS Mechanical on CPU and P100-PCIe. It is shown that within a node, 4 P100 is 3.8x faster than dual CPUs. And with 4 nodes, 16 P100 is 2.3x faster than 8 CPUs. The figure also shows that the performance scales well with more nodes. The speedup with 4 nodes is 2.8x compared to 1 node.
Figure 10: ANSYS Mechanical Performance on CPU and P100-PCIe
RELION (for REgularised Likelihood OptimisationN) is a program that employs an empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM). Figure 11 shows the performance of RELION on CPU and P100-PCIe. Note that y-axis is in logarithmic scale. It demonstrates that 1 P100 is 8.8x faster than dual CPU. From the figure we also notice that it does not scale well starting from 4 P100 GPUs. Because of the long execution time, we did not perform the profiling for this application. But it is possible that the reason of the weak scaling is similar to LAMMPS, HOOMD-blue and Amber.
Figure 11: RELION Performance on CPU and P100-PCIe
In this blog, we presented and analyzed the performance of different applications on Dell PowerEdge C4130 servers with P100-PCIe GPUs. In all of the tested applications, HPL, GROMACS and ANSYS Mechanical benefit from the balanced CPU-GPU configuration in configuration G, because they do not require P2P access among GPUs. However, LAMMPS, HOOMD-blue, Amber (and possibly RELION) rely on P2P accesses. Therefore, with configuration G, they scale well up to 2 P100 GPUs, then scale weakly with 4 or more P100 GPUs. But with Configuration B, they scale better than G with 4 GPUs, so configuration B is more suitable and recommended for applications implemented with P2P accesses.
In the future work, we will run these applications on P100-SXM2 and compare the performance difference between P100-PCIe and P100-SXM2.
About the Author: Shyam Iyer is a Software Sys Sr Principal Engineer in the Server Solutions Office of the CTO focused on accelerating S/W stacks and Applications with H/W assists.
If you take a look at the typical architecture of a Hyperconverged Infrastructure offering in broad terms, a picture speaks a thousand words.
The hypervisor/Host is a glue connecting the Compute with the Storage. If horizontal scaling is the name of the game then HCI solves this by isolating the storage network from the compute network using abstracted storage stacks more commonly branded as Software defined storage (SDS).
The idea has led to significant changes in the storage industry in the last five years not just in terms of how storage is viewed but also in the manner commodity servers and components with their huge supply chain advantage have democratized the aggregation of resources. Storage happened to be the first to get revolutionized but Networking was in lock step with “software defined” being the buzzword for moving anything to an x86 server.
But the real truth was in the supply chain economics being just right for the type of workload being tested for.
VDI was the first winner and the benefits could be quickly realized. For a virtual workstation demanded by a consumer in a school district or a hospital the administrator/CIO didn’t have to shop for expensive large systems to just keep the shop running. And when companies provided the agility to scale on demand the customer lapped them up way too easily.
This led to a sort of revolution by word of mouth that builds trust in an architecture and customers become emboldened to try newer workloads. And that is exactly what has happened…
In an IDC briefing organized in the backdrop of VMworld 2016 Eric Sheppard described the changing workload profile being deployed in HCI deployments.
Essentially an HCI architecture is beginning to look more conducing to customers as a primary storage architecture for more demanding applications.
If that is an artifact of a changing customer usage model then underlying technology trends are moving right towards it creating a perfect storm at the compute.
And while this is happening the compute is undergoing its own revolution.
So, while software defined is cool it has to run on something. The demands of an application mean you can’t just dumb down hardware and layer software on it but you also need to solve the bottlenecks/pain points by working on a solution that leverages Hardware innovatively.
For, example in the picture shown here the vSwitch/Network latency between a VM and a hypervisor is shown.
There are two observations here in this picture.
1) The latency is increasing with increasing packet size.
2) The latency is higher when the system is loaded.
As core counts and VM/container density increase latency is going to be a critical metric. I believe latency needs to be solved outside of the realm of a compute/storage network. I also envision a need for data services to depend on H/W assists.
An approach could be to take an off the shelf H/W part and use it innovatively to fit into the HCI deployment towards solving a problem. This is valiant and sometimes necessary too. The cost economics of an off the shelf part can be hard to beat. But, many times this can be more limiting then liberating. The flexibility in an H/W architecture to solve customer problems is imperative for a solution provider. This allows you to have a solution to the next problem an application demand presents. Enter programmable hardware like FPGAs. Once thought of as being useful for simulating ASIC design FPGAs are becoming interesting enough that an entire workload acceleration industry is taking of. And sooner than later the ecosystem gravity will catchup.
So if you are a data center geek like me watching this industry wondering where the action is you just stumbled on it. As for me, I am going to be rolling my sleeves and get back to work.
Product Release Notification – vWorkspace 8.6.3
Type: Patch Release Created: February 2017
Created: February 1, 2017
226202 - What's new in version 8.6.3?
Please see below for what is new in version 8.6.3.
Created: February 16, 2017
226365 - With Windows 10 - 1607 the Client Session window disappears after minimizing
This occurs with Windows 10 Anniversary version (1607) With Connector installed, while the vWorkspace bis configured to display the...
Created: February 22, 2017
223593 - Optional Hotfix 653818 for 8.6 MR2 Windows Connector
This is an optional hotfix for the vWorkspace Windows Connector. Below is the list of issues addressed in this hotfix: Client proxy...
Revised: February 1, 2017
223804 - Local Printer Issue - slow printing
When using the Universal Printers setting to redirect local printers it may be slow to print large documents.
105489 - Video: How to configure the Webaccess Timeout Warning
vWorkspace 8.0 introduces a Timeout warning that allows the user to stay logged into the website. This shows you how to configure it.
Revised: February 3, 2017
204908 - What are the supported screen resolution supported by vWorkspace for Windows?
Up to which resolution can vWorkspace 8.5+ support
Revised: February 6, 2017
224308 - Hyper-V host is showing offline and is unable to be initialized.
Hyper-V host fails to initialize and is showing offline. The following message may be seen in the vWorkspace console: Remote computer could...
225412 - How to make vWorkspace more tolerant of a bad network
When a network is known to be having issues, is there a setting that can help the vWorkspace Connection stay connected during packet drops.
181327 - Blank screen when connecting through HTML5 connector
When trying to connect to any published applications using the HTML5 connector, the user is presented with a black screen and does not logon.
Revised: February 8, 2017
204417 - Hypercache VM Count is wrong and prevents deletion of old Parent VHDs
When viewing the Hypercache report, it shows the VM Count per template as the total number of machines across all templates. This means that old...
120107 - Data collector service fails to start automatically after reboot
On some servers, the Data Collector service fails to start when the server is rebooted.
Revised: February 14, 2017
102751 - How to subscribe to RSS Feeds/Product Notifications
How to subscribe to RSS Feeds/Product Notifications to opt into support notifications to receive emails about the latest software patches, version releases, and updates to our Knowledge Base.
Revised: February 19, 2017
137215 - Server Updates always in Pending State
Server updates show as pending within vWorkspace management console. Any task submitted into vWorkspace console shows as pending, and the task is...
Revised: February 24, 2017
106284 - vWorkspace steps to upgrade a vWorkspace Farm to the new version
How to upgrade from a vWorkspace environment to the new version
Revised: February 27, 2017
Revised: February 2017
This is a mandatory hotfix and can be installed on the following vWorkspace roles -
This release provides support for the following -
Broker CPU usage has increased and log file size increases quickly when logging is enabled.
This hotfix is available for download at: https://support.quest.com/vworkspace/kb/226670