Dell Community

Blog Group Posts
Application Performance Monitoring Blog Foglight APM 105
Blueprint for HPC - Blog Blueprint for High Performance Computing 0
CommAutoTestGroup - Blog CommAutoTestGroup 1
Custom Solutions Engineering Blog Custom Solutions Engineering 5
Data Security Data Security 8
Dell Big Data - Blog Dell Big Data 68
Dell Cloud Blog Cloud 42
Dell Cloud OpenStack Solutions - Blog Dell Cloud OpenStack Solutions 0
Dell Lifecycle Controller Integration for SCVMM - Blog Dell Lifecycle Controller Integration for SCVMM 0
Dell Premier - Blog Dell Premier 3
Dell TechCenter TechCenter 1,854
Desktop Authority Desktop Authority 25
Featured Content - Blog Featured Content 0
Foglight for Databases Foglight for Databases 35
Foglight for Virtualization and Storage Management Virtualization Infrastructure Management 256
General HPC High Performance Computing 226
High Performance Computing - Blog High Performance Computing 35
Hotfixes vWorkspace 57
HPC Community Blogs High Performance Computing 27
HPC GPU Computing High Performance Computing 18
HPC Power and Cooling High Performance Computing 4
HPC Storage and File Systems High Performance Computing 21
Information Management Welcome to the Dell Software Information Management blog! Our top experts discuss big data, predictive analytics, database management, data replication, and more. Information Management 229
KACE Blog KACE 143
Life Sciences High Performance Computing 5
OMIMSSC - Blogs OMIMSSC 0
On Demand Services Dell On-Demand 3
Open Networking: The Whale that swallowed SDN TechCenter 0
Product Releases vWorkspace 13
Security - Blog Security 3
SharePoint for All SharePoint for All 388
Statistica Statistica 24
Systems Developed by and for Developers Dell Big Data 1
TechCenter News TechCenter Extras 47
The NFV Cloud Community Blog The NFV Cloud Community 0
Thought Leadership Service Provider Solutions 0
vWorkspace - Blog vWorkspace 510
Windows 10 IoT Enterprise (WIE10) - Blog Wyse Thin Clients running Windows 10 IoT Enterprise Windows 10 IoT Enterprise (WIE10) 3
Latest Blog Posts
  • General HPC

    Deep Learning Inference on P40 GPUs

    Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Mar. 2017

    Introduction to P40 GPU and TensorRT

    Deep Learning (DL) has two major phases: training and inference/testing/scoring. The training phase builds a deep neural network (DNN) model with the existing large amount of data. And the inference phase uses the trained model to make prediction from new data. The inference can be done in the data center, embedded system, auto and mobile devices, etc. Usually inference must respond to user request as quickly as possible (often in real time). To meet the low-latency requirement of inference, NVIDIA® launched Tesla® P4 and P40 GPUs. Aside from high floating point throughput and efficiency, both GPUs introduce two new optimized instructions designed specifically for inference computations. The two new instructions are 8-bit integer (INT8) 4-element vector dot product (DP4A) and 16-bit 2-element vector dot product (DP2A) instructions. Deep learning researchers have found using FP16 is able to achieve the same inference accuracy as FP32 and many applications only require INT8 or lower precision to keep an acceptable inference accuracy. Tesla P4 delivers a peak of 21.8 INT8 TIOP/s (Tera Integer Operations per Second), while P40 delivers a peak of 47.0 INT8 TIOP/s. This blog only focuses on P40 GPU.

    TensorRTTM, previously called GIE (GPU Inference Engine), is a high performance deep learning inference engine for production deployment of deep learning applications that maximizes inference throughput and efficiency. TensorRT provides users the ability to take advantage of fast reduced precision instructions provided in the Pascal GPUs. TensorRT v2 supports the INT8 reduced precision operations that are available on the P40.

    Testing Methodology

    This blog quantifies the performance of deep learning inference using TensorRT on Dell’s PowerEdge C4130 server which is equipped with 4 Tesla P40 GPUs. Since TensorRT is only available for Ubuntu OS, all the experiments were done on Ubuntu. Table 1 shows the hardware and software details. The inference benchmark we used was giexec in TensorRT sample codes. The synthetic images which were filled with random non-zero numbers to simulate real images were used in this sample code. Two classic neural networks were tested: AlexNet (2012 ImageNet winner) and GoogLeNet (2014 ImageNet winner) which is much deeper and complicated than AlexNet.

    We measured the inference performance in images/sec which means the number of images that can be processed per second. To measure the performance improvement of the current generation GPU P40, we also compared its performance with the previous generation GPU M40. The most important goal of this testing is to measure the inference performance in INT8 mode, compared to FP32 mode. P40 uses the new Pascal architecture and supports the new INT8 instructions. The previous generation GPU M40 uses Maxwell architecture and does not support INT8 instructions. The theoretical performance of INT8, FP32 in both M40 and P40 is shown in Table 2. We measured the performance FP32 on both devices and both FP32 and INT8 on the P40.

    Table 1: Hardware configuration and software details

    Platform

    PowerEdge C4130 (configuration G)

    Processor

    2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)

    Memory

    256GB DDR4 @ 2400MHz

    Disk

    400GB SSD

    GPU

    4x Tesla P40 with 24GB GPU memory

    Software and Firmware

    Operating System

    Ubuntu 14.04

    BIOS

    2.3.3

    CUDA and driver version

    8.0.44 (375.20)

    TensorRT Version

    2.0 EA


    Table 2: Comparison between Tesla M40 and P40

     

    Tesla M40

    Tesla P40

    INT8 (TIOP/s)

    N/A

    47.0

    FP32 (TFLOP/s)

    6.8

    11.8


    Performance Evaluation

    In this section, we will present the inference performance with TensorRT on GoogLeNet and AlexNet. We also implemented the benchmark with MPI so that it can be run on multiple P40 GPUs within a node. We will also compare the performance of P40 with M40. Lastly we will show the performance impact when using different batch sizes.

    Figure 1 shows the inference performance with TensorRT library for both GoogLeNet and AlexNet. We can see that INT8 mode is ~3x faster than FP32 in both neural networks. This is expected since the theoretical speedup of INT8 is 4x compared to FP32 if only multiplications are performed and no other overhead is incurred. However, there are kernel launches, occupancy limits, data movement and math other than multiplications, so the speedup is reduced to about 3x faster.


    Figure 1: Inference performance with TensorRT library

    Dell’s PowerEdge C4130 supports up to 4 GPUs in a server. To make use of all GPUs, we implemented the inference benchmark using MPI so that each MPI process runs on each GPU. Figure 2 and Figure 3 show the multi-GPU inference performance on GoogLeNet and AlexNet, respectively. When using multiple GPUs, linear speedup were achieved for both neural networks. This is because each GPU processes its own images and there is no communications and synchronizations among used GPUs.


    Figure 2: Multi-GPU inference performance with TensorRT GoogLeNet


    Figure 3: Multi-GPU inference performance with TensorRT AlexNet

    To highlight the performance advantage of P40 GPU and its native support for INT8, we compared the inference performance between P40 with the previous generation GPU M40. The result is shown in Figure 5 and Figure 6 for GoogLeNet and AlexNet, respectively. In FP32 mode, P40 is 1.7x faster than M40. And the INT8 mode in P40 is 4.4x faster than FP32 mode in M40.


    Figure 4: Inference performance comparison between P40 and M40


    Figure 5: Inference performance comparison between P40 and M40

    Deep learning inference can be applied in different scenarios. Some scenarios require large batch size and some scenarios even requires no batching at all (i.e. batch size is 1). Therefore we also measured the performance difference when using different batch sizes and the result is shown in Figure 6. Note that the purpose here is not comparing the performance of GoogLeNet and AlexNet, instead the purpose is to check how the performance changes with different batch sizes for each neural network. It can be seen that without batch processing the inference performance is very low. This is because the GPU is not assigned enough workloads to keep it busy. The larger the batch size is, the higher the inference performance is, although the rate of the speed increasing becomes slower. When batch size is 4096, GoogLeNet stopped running because the required GPU memory for this neural network exceeds the GPU memory limit. But AlexNet was able to run because it is a less complicated neural network than GoogLeNet and therefore it requires less GPU memory. So the largest batch size is only limited by GPU memory.


    Figure 6: Inference performance with different batch sizes

    Conclusions and Future Work

    In this blog, we presented the inference performance in deep learning with NVIDIA® TensorRT library on P40 and M40 GPUs. As a result, the INT8 support in P40 is about 3x faster than FP32 mode in P40 and 4.4x faster than FP32 mode in the previous generation GPU M40. Multiple GPUs can increase the inferencing performance linearly because of no communications and synchronizations. We also noticed that higher batch size leads to higher inference performance and the largest batch size is only limited by GPU memory size. In the future work, we will evaluate the inference performance with real world deep learning applications.


  • Data Security

    Ding Dong...Baidu at the Door

    - by Matt Halsey

    Have a connected home? Have an internet connection? Then you too can have a conversation with Chinese website Baidu.

    Huge Vulnerability Discovered in the Ring Doorbell This article highlights the intrinsic need for there to be a means to secure IoT devices.

    It was only a few months ago that the Mirai botnet, using home video surveillance cameras, was able to launch the largest DDoS attack in history.

    Read the article.

    Then you can read the comments from someone claiming to be the head of security at Ring, name Matt, here (italics added):

    Hi I'm the VP of Security at Ring and I thought it might be helpful to give you all some background on what you are seeing.

    Occasionally at the end of live call or motion, we will lose connectivity. Rather than abandoning the entire call, we send the last few audio packets that are corrupted anyway to a non-routable address on a protocol no one uses. The right way to do that is to use a virtual interface or the loopback to discard the packets. The choice to send it to somewhere across the world and let the ISP deal with blocking is a poor design choice that the teams on working on addressing ASAP.

    From a risk/disclosure perspective, it's relatively benign but like the everyone else, when my team first saw it in the wild we had similar concerns.

    i will circle back when we have updated firmware.

    -Matt

    Ring Pro doorbell - calling China?

    So what to do:

    1. Go to Industrial Internet Consortium and see how Dell and EMC, now Dell|EMC and Dell Technologies are helping to secure the IoT world.

    2. Realize that IoT is in its infancy if not earlier where security is concerned....like when we used to leave Telnet, TFTP, and FTP ports open on our internet facing servers....

    3. Be ready to help our customers understand that encryption, especially our products, can help protect them when vendors of IoT devices don't finish their job in securing the devices.

  • Dell Big Data - Blog

    Getting started with machine-generated data

    By Brett Roberts with Debra Slapak

     

    The amount of machine-generated data being created each day is massive and--as we all know--can be extremely valuable. Insights extracted from this data have the potential to help you improve operational efficiency, customer experience, security and much more. But getting started can present real challenges and really big questions, such as "How do we consolidate all of this complex data and analyze it to deliver actionable insights?" Dell EMC works with Splunk to address these challenges and simplify those first steps.

     

    Splunk’s proven platform for real-time operational intelligence helps reduce the complexity of harnessing machine-generated data by providing users with an end-to-end platform to collect, search, analyze and visualize this data.  For the Splunk platform to be used to its full potential, organizations need infrastructure that meets or exceeds Splunk’s reference architecture specifications. Dell EMC has partnered with Splunk to create highly-optimized and powerful solutions that help solve machine-generated data challenges. Read more in a recently posted blog about how Splunk and Dell EMC can help you on your journey to valuable insights with machine-generated data. 

     

  • Dell TechCenter

    HCI architecture poised for disruption?

    About the Author:  Shyam Iyer is a Software Sys Sr Principal Engineer in the Server Solutions Office of the CTO focused on accelerating S/W stacks and Applications with H/W assists.

    If you take a look at the typical architecture of a Hyperconverged Infrastructure offering in broad terms, a picture speaks a thousand words.

    The hypervisor/Host is a glue connecting the Compute with the Storage. If horizontal scaling is the name of the game then HCI solves this by isolating the storage network from the compute network using abstracted storage stacks more commonly branded as Software defined storage (SDS). 

    Storage

    The idea has led to significant changes in the storage industry in the last five years not just in terms of how storage is viewed but also in the manner commodity servers and components with their huge supply chain advantage have democratized the aggregation of resources. Storage happened to be the first to get revolutionized but Networking was in lock step with “software defined” being the buzzword for moving anything to an x86 server.

    But the real truth was in the supply chain economics being just right for the type of workload being tested for.

    VDI was the first winner and the benefits could be quickly realized. For a virtual workstation demanded by a consumer in a school district or a hospital the administrator/CIO didn’t have to shop for expensive large systems to just keep the shop running.  And when companies provided the agility to scale on demand the customer lapped them up way too easily.

    This led to a sort of revolution by word of mouth that builds trust in an architecture and customers become emboldened to try newer workloads. And that is exactly what has happened…

    In an IDC briefing organized in the backdrop of VMworld 2016 Eric Sheppard described the changing workload profile being deployed in HCI deployments.

    IDC

    Essentially an HCI architecture is beginning to look more conducing to customers as a primary storage architecture for more demanding applications.

    If that is an artifact of a changing customer usage model then underlying technology trends are moving right towards it creating a perfect storm at the compute.

    Trend

    And while this is happening the compute is undergoing its own revolution.

    http://en.community.dell.com/dell-blogs/dell4enterprise/b/dell4enterprise/archive/2017/01/26/memory-centric-architecture-vision

    So, while software defined is cool it has to run on something. The demands of an application mean you can’t just dumb down hardware and layer software on it but you also need to solve the bottlenecks/pain points by working on a solution that leverages Hardware innovatively.

    For, example in the picture shown here the vSwitch/Network latency between a VM and a hypervisor is shown.

    There are two observations here in this picture.

    1)      The latency is increasing with increasing packet size.

    2)      The latency is higher when the system is loaded.

    As core counts and VM/container density increase latency is going to be a critical metric. I believe latency needs to be solved outside of the realm of a compute/storage network.  I also envision a need for data services to depend on H/W assists.

    An approach could be to take an off the shelf H/W part and use it innovatively to fit into the HCI deployment towards solving a problem.  This is valiant and sometimes necessary too. The cost economics of an off the shelf part can be hard to beat. But, many times this can be more limiting then liberating. The flexibility in an H/W architecture to solve customer problems is imperative for a solution provider. This allows you to have a solution to the next problem an application demand presents. Enter programmable hardware like FPGAs. Once thought of as being useful for simulating ASIC design FPGAs are becoming interesting enough that an entire workload acceleration industry is taking of. And sooner than later the ecosystem gravity will catchup.

    So if you are a data center geek like me watching this industry wondering where the action is you just stumbled on it. As for me, I am going to be rolling my sleeves and get back to work.

  • vWorkspace - Blog

    What's new for vWorkspace - February 2017

    Updated monthly, this publication provides you with new and recently revised information and is organized in the following categories; Documentation, Notifications, Patches, Product Life Cycle, Release, Knowledge Base Articles.

    Subscribe to the RSS (Use IE only)

     

    Downloads

    Product Release Notification – vWorkspace 8.6.3

    Type: Patch Release Created: February 2017

      

    Knowledgebase Articles

    New 

     

    225565 - Is VMware 6.5 currently supported?

    Is VMware vSphere 6.5 supported in any of the current versions of vWorkspace?

    Created: February 1, 2017

     

    226202 - What's new in version 8.6.3?

    Please see below for what is new in version 8.6.3.

    Created: February 16, 2017

     

    226365 - With Windows 10 - 1607 the Client Session window disappears after minimizing

    This occurs with Windows 10 Anniversary version (1607) With Connector installed, while the vWorkspace bis configured to display the...

    Created: February 22, 2017 

     

    Revised

    223593 - Optional Hotfix 653818 for 8.6 MR2 Windows Connector

    This is an optional hotfix for the vWorkspace Windows Connector. Below is the list of issues addressed in this hotfix: Client proxy...

    Revised: February 1, 2017

     

    223804 - Local Printer Issue - slow printing

    When using the Universal Printers setting to redirect local printers it may be slow to print large documents.

    Revised: February 1, 2017

     

    105489 - Video: How to configure the Webaccess Timeout Warning

    vWorkspace 8.0 introduces a Timeout warning that allows the user to stay logged into the website. This shows you how to configure it.

    Revised: February 3, 2017

     

    204908 - What are the supported screen resolution supported by vWorkspace for Windows?

    Up to which resolution can vWorkspace 8.5+ support

    Revised: February 6, 2017

     

    224308 - Hyper-V host is showing offline and is unable to be initialized.

    Hyper-V host fails to initialize and is showing offline. The following message may be seen in the vWorkspace console: Remote computer could...

    Revised: February 6, 2017

     

    225412 - How to make vWorkspace more tolerant of a bad network

    When a network is known to be having issues, is there a setting that can help the vWorkspace Connection stay connected during packet drops.

    Revised: February 6, 2017

     

    181327 - Blank screen when connecting through HTML5 connector

    When trying to connect to any published applications using the HTML5 connector, the user is presented with a black screen and does not logon.

    Revised: February 8, 2017

     

    204417 - Hypercache VM Count is wrong and prevents deletion of old Parent VHDs

    When viewing the Hypercache report, it shows the VM Count per template as the total number of machines across all templates. This means that old...

    Revised: February 8, 2017

     

    120107 - Data collector service fails to start automatically after reboot

    On some servers, the Data Collector service fails to start when the server is rebooted.

    Revised: February 14, 2017

     

    102751 - How to subscribe to RSS Feeds/Product Notifications

    How to subscribe to RSS Feeds/Product Notifications to opt into support notifications to receive emails about the latest software patches, version releases, and updates to our Knowledge Base.

    Revised: February 19, 2017

     

    137215 - Server Updates always in Pending State

    Server updates show as pending within vWorkspace management console. Any task submitted into vWorkspace console shows as pending, and the task is...

    Revised: February 24, 2017

     

    106284 - vWorkspace steps to upgrade a vWorkspace Farm to the new version

    How to upgrade from a vWorkspace environment to the new version

    Revised: February 27, 2017

     

    Product Life Cycle -vWorkspace

    Revised: February 2017

  • Hotfixes

    Mandatory Hotfix 653995 for 8.6 MR3 Connection Broker Released

    This is a mandatory hotfix and can be installed on the following vWorkspace roles -

     

    • vWorkspace Connection Broker

     

     This release provides support for the following -   

     Feature

    Description

    Feature ID

    Connection Broker

    Broker CPU usage has increased and log file size increases quickly when logging is enabled.

    653992

     

    This hotfix is available for download at: https://support.quest.com/vworkspace/kb/226670 

  • vWorkspace - Blog

    Mandatory Hotfix 653995 Version 8.6.309.4267 for Connection Broker Released

    This is a mandatory hotfix and can be installed on the following vWorkspace roles -

     

    • vWorkspace Connection Broker

     

     This release provides support for the following -   

     Feature

    Description

    Feature ID

    Connection Broker

    Broker CPU usage has increased and log file size increases quickly when logging is enabled.

    653992

     

    This hotfix is available for download at: https://support.quest.com/vworkspace/kb/226670 

  • Dell TechCenter

    Dell EMC Elect 2017 Nominations

    The Dell TechCenter Rockstar program started several years ago and has been a leading community recognition program since inception. Likewise for the EMC Elect program. And now the two programs have been combined into one, the Dell EMC Elect.  

    Dell EMC Elect is a community-driven recognition program, recognizing and rewarding individual's engagement with Dell EMC as a brand over the last calendar year.

    Key characteristics of an outstanding Dell EMC Elect candidate include:

    • Engagement - Engaging with the community on various social media channels.
    • Commitment - Being optimistic about the Dell EMC brand day in and day out while still offering honest feedback.
    • Leadership - Helping lead the effort for engagement in their social circle.

    Use this link to submit yourself or your peers for nomination.

    Nominations are open until March 17th and will be vetted by an expert panel of Dell EMC Elect members.

    Members can enjoy benefits such as VIP treatment at Dell EMC events, blogger briefing sessions, NDA sessions, whisper suite access, and more.

    If you have any questions, please email or DM me on twitter:

    satish.singh@dell.com

    @thesatishsingh 

    Thank you and have a great day,