Dell Community

Blog Group Posts
Application Performance Monitoring Blog Foglight APM 105
Blueprint for HPC - Blog Blueprint for High Performance Computing 0
Custom Solutions Engineering Blog Custom Solutions Engineering 8
Data Security Data Security 8
Dell Big Data - Blog Dell Big Data 68
Dell Cloud Blog Cloud 42
Dell Cloud OpenStack Solutions - Blog Dell Cloud OpenStack Solutions 0
Dell Lifecycle Controller Integration for SCVMM - Blog Dell Lifecycle Controller Integration for SCVMM 0
Dell Premier - Blog Dell Premier 3
Dell TechCenter TechCenter 1,858
Desktop Authority Desktop Authority 25
Featured Content - Blog Featured Content 0
Foglight for Databases Foglight for Databases 35
Foglight for Virtualization and Storage Management Virtualization Infrastructure Management 256
General HPC High Performance Computing 227
High Performance Computing - Blog High Performance Computing 35
Hotfixes vWorkspace 66
HPC Community Blogs High Performance Computing 27
HPC GPU Computing High Performance Computing 18
HPC Power and Cooling High Performance Computing 4
HPC Storage and File Systems High Performance Computing 21
Information Management Welcome to the Dell Software Information Management blog! Our top experts discuss big data, predictive analytics, database management, data replication, and more. Information Management 229
KACE Blog KACE 143
Life Sciences High Performance Computing 9
On Demand Services Dell On-Demand 3
Open Networking: The Whale that swallowed SDN TechCenter 0
Product Releases vWorkspace 13
Security - Blog Security 3
SharePoint for All SharePoint for All 388
Statistica Statistica 24
Systems Developed by and for Developers Dell Big Data 1
TechCenter News TechCenter Extras 47
The NFV Cloud Community Blog The NFV Cloud Community 0
Thought Leadership Service Provider Solutions 0
vWorkspace - Blog vWorkspace 511
Windows 10 IoT Enterprise (WIE10) - Blog Wyse Thin Clients running Windows 10 IoT Enterprise Windows 10 IoT Enterprise (WIE10) 4
Latest Blog Posts
  • General HPC

    Scaling Deep Learning on Multiple V100 Nodes

    Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula.

    HPC Innovation Lab. November 2017



    In our previous blog, we presented the deep learning performance on single Dell PowerEdge C4130 node with four V100 GPUs. For very large neural network models, a single node is still not powerful enough to quickly train those models. Therefore, it is important to scale the training model to multiple nodes to meet the computation demand. In this blog, we will evaluate the multi-node performance of deep learning frameworks MXNet and Caffe2. The Interconnect in use is Mellanox EDR InfiniBand. The results will show that both frameworks scale well on multiple V100-SXM2 nodes.

    Overview of MXNet and Caffe2

    In this section, we will give an overview about how MXNet and Caffe2 are implemented for distributed training on multiple nodes. Usually there are two ways to parallelize neural network training on multiple devices: data parallelism and model parallelism. In data parallelism, all devices have the same model but different devices work on different pieces of data. While in model parallelism, difference devices have parameters of different layers of a neural network. In this blog, we only focus on the data parallelism in deep learning frameworks and will evaluate the model parallelism in the future. Another choice in most deep learning frameworks is whether to use synchronous or asynchronous weight update. The synchronous implementation aggregates the gradients over all workers in each iteration (or mini-batch) before updating the weights. However, in asynchronous implementation, each worker updates the weight independently with each other. Since the synchronous way guarantees the model convergence while the asynchronous way is still an open question, we only evaluate the synchronous weight update.  

    MXNet is able to launch jobs on a cluster in several ways including SSH, Yarn, MPI. For this evaluation, SSH was chosen. In SSH mode, the processes in different nodes use rsync to synchronize the working directory from root node into slave nodes. The purpose of synchronization is to aggregate the gradients over all workers in each iteration (or mini-batch). Caffe2 uses Gloo library for multi-node training and Redis library to facilitate management of nodes in distributed training. Gloo is a MPI like library that comes with a number of collective operations like barrier, broadcast and allreduce for machine learning applications. The Redis library used by Gloo is used to connect all participating nodes.

    Testing Methodology

    We chose to evaluate two deep learning frameworks for our testing, MXNet and Caffe2. As with our previous benchmarks, we will again use the ILSVRC 2012 dataset which contains 1,281,167 training images and 50,000 validation images. The neural network in the training is called Resnet50 which is a computationally intensive network that both frameworks support. To get the best performance, CUDA 9 compiler, CUDNN 7 library and NCCL 2.0 are used for both frameworks, since they are optimized for V100 GPUs. The testing platform has four Dell EMC’s PowerEdge C4130 servers in configuration K. The system layout of configuration K is shown in Figure 1. As we can see, the server has four V100-SXM2 GPUs and all GPUs are connected by NVLink. The other hardware and software details are shown in Table 1. Table 2 shows the input parameters that are used to train Resnet50 neural network in both frameworks.

    Figure 1: C4130 configuration K

    Table 1: The hardware configuration and software details

    Table 2: Input parameters used in different deep learning frameworks

    Performance Evaluation

    Figure 2 and Figure 3 show the Resnet50 performance and speedup results on multiple nodes with MXNet and Caffe2, respectively. As we can see, the performance scales very well with both frameworks. With MXNet, compared to 1*V100, the speedup of using 16*V100 (in 4 nodes) is 15.4x in FP32 mode and 13.8x in FP16, respectively. And compared to FP32, FP16 improved the performance by 63.28% - 82.79%. Such performance improvement was contributed to the Tensor Cores in V100.

    In Caffe2, compared to 1*V100, the speedup of using 16*V100 (4 nodes) is 14.8x in FP32 and 13.6x in FP16, respectively. And the performance improvement of using FP16 compared to FP32 is 50.42% - 63.75% excluding the 12*V100 case. With 12*V100, using FP16 is only 29.26% faster than using FP32. We are still investigating the exact reason of it, but one possible explanation is that 12 is not the power of 2, which may make some operations like reductions slower.

    Figure 2: Performance of MXNet Resnet50 on multiple nodes

    Figure 3: Performance of Caffe2 Resnet50 on multiple nodes

    Conclusions and Future Work

    In this blog, we present the performance of MXNet and Caffe2 on multiple V100-SXM2 nodes. The results demonstrate that the deep learning frameworks are able to scale very well on multiple Dell EMC’s PowerEdge servers. At this time the FP16 support in TensorFlow is still experimental, our evaluation is in progress and the results will be included in future blogs. We are also working on containerizing these frameworks with Singularity to make their deployment much easier.

  • Dell TechCenter

    Dell PowerEdge VRTX support for VMware ESXi 6.5

    This blog post is written by Thiru Navukkarasu and Krishnaprasad K from Dell Hypervisor Engineering. 

    Dell PowerEdge VRTX was not supported for VMware ESXi 6.5 branch of ESXi thus far. Dell announced support for VRTX from Dell customized version of ESXi 6.5 A04 revision onwards. However there was a late breaking issue identified on dell-shared-perc8 (Version: 06.806.89.00) and ESXi 6.5.x. To resolve this issue, install/upgrade to version 06.806.90.00 and above. This revised driver is part of Dell customized VMware ESXi 6.5 Update1 A04 which is available for download from here

    From VMware ESXi 6.5 onwards, the shared PERC8 controller in VRTX use dell_shared_perc8 native driver instead of megaraid_sas vmklinux driver in ESXi 6.0.x branch. 

    You may look at the following command outputs in ESXi to verify if you have the supported image installed on PowerEdge VRTX blades. 

    ~] vmware –lv

    VMware ESXi 6.5.0 build-6765664 

    VMware ESXi 6.5.0 Update 1

     ~] cat /etc/vmware/oem.xml

    You are running DellEMC Customized Image ESXi 6.5 Update 1 A04 (based on ESXi VMKernel Release Build 6765664)

    ~] esxcli storage core adapter list

    HBA Name  Driver             Link State  UID                   Capabilities  Description

    --------  -----------------  ----------  --------------------  ------------  ----------------------------------------------------------

    vmhba3    dell_shared_perc8  link-n/a    sas.0                               (0000:0a:00.0) LSI / Symbios Logic Shared PERC 8 Mini

    vmhba4    dell_shared_perc8  link-n/a    sas.c000016000c00        (0000:15:00.0) LSI / Symbios Logic Shared PERC 8 Mini


  • Dell TechCenter

    Building the Optimal Machine Learning Platform


    Austin Shelnutt ESI Architecture team

    While various forms of machine learning have existed for several decades, neural brainthe past few years of development have yielded some extraordinary progress in democratizing the capabilities and use cases for artificial intelligence in a wide multitude of industries. Image classification, voice recognition, fraud detection, medical diagnostics, and process automation are just a handful of the burgeoning use cases for machine learning that are reinventing the very world we live in.  This blog provides a brief overview of some of the basic principles of Machine Learning and describes the challenges and trade-offs involved in constructing the optimal Machine Learning platform for different use cases.

    Neural Networks are key to machine learning

    At the center of the growth in machine learning is a modeling technique referred to as neural networks (also known as deep neural networks, or deep learning), which is based on our understanding of how the human brain learns and processes information. Neural networks are not a new concept, and have been proposed as a model for computational learning since the 1940’s. What makes neural networks so attractive for machine learning is that they provide a mathematical ecosystem that allows the decision making accuracy of a computer to scale beyond explicit programming rules and, in a sense, learn from experience.

    Previously, the limiting factor of neural network models has been that they are extremely computation intensive and require a tremendous amount of labeled data input to be able to “learn”. This double hurdle of processing power and available data had prevented them from becoming relevant…. until now.

    Today, we stand at the intersection of huge data sets being generated in all corners of industry and the rise of massively-parallel compute infrastructure in the form of enhanced CPU instruction sets, GPUs, FPGAs, and new ASICs - designed specifically to accelerate neural network math. Neural network models that would have taken weeks or even months to run even a couple of years ago can learn (or be “trained”) in just a few hours on today’s hardware. 

    While this convergence of available data and compute has opened up seemingly limitless potential applicability to end users, it presents a new challenge to hardware architects: What does the ultimate machine learning platform look like? To answer that, we need to better understand the machine learning platform stack.

    The machine learning stack consists of:

    The neural network (application) layer – this is the data analysis model

    The framework layer - provides the specialized software neural networks run on

    The math libraries layer - houses the math routines the frameworks call

    The operating system layer – choice of OS

    The hardware platform layer -offers a number of different accelerator options

    The platform choices made at each of these layers can impact the performance and capabilities of the targeted learning function. The following sections expand on important points that should be considered for these layers.

    Neural Network layer

    Neural networks are symbolic representations of the mathematical models created for a specific learned function (for example, speech recognition). Neural networks come in many different shapes, sizes and functions, depending upon both the type of data being ingested and the intended goal (output) of the learned function. The complexity of neural network construction can vary by:nn

    • Specific activation functions
    • Number of activation layers
    • Data set manipulation types:  forward/backward propagation, convolutions, recurrence, LSTM, etc.

    At the highest level, all neural networks break down input features  (such as the pixels in a photo) into multi-dimensional arrays of data (tensors) and then pass them through one or more  layers of parameters (or weights) into activation functions which can be represented as neurons in the neural network.

    Input tensors are multiplied by parameter tensors and activation functions to yield a hypothesis that can be used for a decision – for example, classification that an object shown in a given picture is a cat or a dog. The size and dimensions of the input features and the number of activation layers is what determines how to handle the necessary math operations in the hardware layer (i.e., you may require multiple GPUs).

    When designing the optimal platform to use for a neural network, how that particular neural network is constructed is crucial in determining what options are best for it at other layers of the stack. In general, the platform designer’s goal is to understand how data is moved in, out, and around inside of the system to tune features in a manner that most efficiently eliminates data choke points or bottlenecks.


    For example, small neural networks that can be computed relatively quickly might create a tremendous demand on data set ingest bandwidth either from local storage or remote data pools and consequently would be potentially bottlenecked by slow storage devices or narrow I/O bandwidth.  Pairing this type of neural model with a high performance accelerator platform that lacks significant I/O bandwidth would also result in under-utilized compute hardware.

    As another example, very large neural networks with a large number of input features and/or activation layers, may not fit comfortably inside of a single accelerator’s onboard memory or need to swap weight calculations in and out of the page file during each iteration. This type of model might operate most efficiently when the stored weights can be exchanged and multiplied across multiple accelerators. So, a hardware platform that offers multiple accelerators would be the right choice in this case.  But note that the distribution of operations to multiple accelerators is handled differently by different hardware offerings and frameworks, so the efficiency of distribution varies accordingly. Also note that not every neural network equally benefits from a multiple accelerators – or at least not at the same scaling efficiency. (See the following sections.)

    Framework layer

    Neural network models run on deep learning software frameworks. The proliferation of frameworks, while primarily open source in nature, has largely stemmed from academia and a number of hyperscale service providers – each attempting to advance their own particular code.  You can run virtually any neural network on any deep learning framework, but they are certainly not all created equal. The manner in which frameworks utilize the underpinning hardware varies from framework to framework. While end users often choose a framework based on coding familiarity, there are a number of factors to consider that impact neural network performance:

    • How a framework makes math library calls (and which libraries it uses), how it pulls apart the tensor multiplication operations, and how it maps these operations into the physical hardware are all unique to that framework.
    • Some frameworks are better at scaling outside of a single server to use multiple servers working together - and some are not capable of scaling out at all.
    • Some frameworks are well suited to orchestrating neural network mathematics across a large number of parallel compute devices (i.e. GPUs) within a single server, while others scale very poorly on multiple accelerators.

    Each of these points needs to be considered in light of the characteristics of the specific neural network. They may ultimately influence the choice of framework and the accelerator options.

    Hardware Platform layer

    Choosing the right hardware technology to support a given machine learning application is another challenge for platform design. While CPUs can be used for deep learning, they are scalar multiplication engines by nature, and poorly suited to the higher-order tensor operations common to deep learning (vectors, matrices, and beyond).  So machine learning platforms typically incorporate some form of accelerator technology - GPU, FPGA, or ASIC. But even at that level there are trade-offs to consider – particularly concerning the distribution of operations across multiple accelerators and how that impacts scaling.  These considerations are described below.


    GPUs have been the cornerstone of the deep learning growth in recent years because of their powerful parallel compute capabilities derived from their relatively large number of independent logic cores. The different models for how data is exchanged between GPUs is a differentiating feature when considering platform design.

    FPGAs & ASICs

    Though GPUs currently occupy a fortress on the deep learning market today, technology vendors from across the globe are lining up to take aim at specific soft spots in the GPU’s dominance. The latest FPGA and ASIC technology delivers new levels of component-level performance-per-dollar, performance-per-watt and small-batch efficiency that will result in competitive offerings in 2018 that are intended to blow the doors off contemporary deep learning hardware.

    PCIe-based Accelerators

    bottleneckUsing PCI-Express accelerators for machine learning has become popular for a number of previously discussed reasons, however, one primary benefit is the ability to ‘scale-up’ to use multiple accelerators in the same server. The challenge for effectively using more than one accelerator is data exchange between the cards. The latency and bandwidth limitations for data going back through the host CPU’s PCIE root complex, for example, can be a large performance penalty that negates the multi-accelerator benefit.

    PCIe switchModern non-blocking PCIE switches can be a great solution to this challenge by allowing the PCIE accelerators to exchange data directly without passing through the host root complex, if the framework comprehends this type of communication path.

    Again, here, balance is the key. As you add accelerators to the switch, eventually the host bandwidth between the switch and the (single host) CPU becomes the new bottleneck. Unfortunately, due to the variations in neural networks, data sets, and frameworks, this point is a moving target, and very difficult to predict.

    Specialized accelerator-to-accelerator communication

    Many technology companies are now implementing specialized accelerator to accelerator connection links. Nvidia’s NVlink is a great example of a specialized communication path that dramatically improves bandwidth between accelerators for applications that benefit from peer-to-peer data exchange.

    SpecializedTo be clear, while these auxiliary connection types are extremely valuable for some end customer use cases, there are other deep learning applications that yield very little benefit from this type of interconnect. Furthermore, these specialty interconnects can be costly, both in terms of materials and design changes required to accommodate them.

    In fact, the current proprietary interconnect trend is driving unique server designs - just to support the interconnect; resulting in wide variations in hardware from vendor to vendor. Accelerator technology vendors are, seemingly, abandoning all forms of conventional design guidelines in their own pursuit of maximum peer-to-peer bandwidth. This may be the single biggest pain point for designing a truly optimized deep learning platform.

    What’s next for Machine Learning platforms?

    Physical manifestation of the peer-to-peer interconnect is not the only place where deep learning technology providers are departing from conventional techniques. In the pursuit of ever-improved performance, some vendors are moving beyond PCIE form factor, pushing beyond the accepted power/heat limits, and writing new math libraries.  Platform designers need to be aware that the technology underpinning the explosive growth in machine learning is still very fluid and divergent.


    Machine learning customers have more choices than ever for neural network models and frameworks. Those choices impact the type, number, and form factor of the preferred accelerator, the dataflow topology between accelerators and CPUs, the amount and speed of direct attached storage, and the necessary bandwidth of I/O devices. The resulting platform must:

    • Serve ‘their’ specific learning model– not an unrelated deep learning model.
    • Stay within their data center requirements for server form factor, rack depth, power and cooling
    • Be management agnostic

    Solving a platform optimization challenge with this many degrees of freedom may seem daunting, but Dell EMC is committed to helping our customers meet this challenge. Today, we are already working with a wide range of customers, across a number of industries to solve some of the most complex and interesting machine learning problems.  And going forward, we are committing resources to ensure we remain a technology leader in this arena.

    For more information on what Dell EMC Extreme Scale Infrastructure is doing with Machine Learning, contact .



  • Hotfixes

    Mandatory Hotfix 654693 for 8.6 MR3 Linux Connector Released

    This is a mandatory hotfix for: 


    • Linux Connector


    The following is a list of issues resolved in this release.



    Feature ID


    The Linux Connector might close during the second attempt to download the configuration using an incorrect FQDN path to the Connection Broker



    The error message is displayed during the first connection when an auto-launch application is configured according to the user`s configuration, but not assigned to him



    Configuration is not saved in the Linux Connector if the user clicks Cancel on the Credentials screen after configuration has already been downloaded


    Dynamic Resize

    There is an inform message every time when user resizes remote window



    In Notepad, the Menu drop-down list is displayed shifted when Notepad is placed near the screen border



    The Seamless application does not detect the Unity taskbar



    The session window of an MS Seamless application might disappear after reconnection



    Multiple Monitors

    The resize frame has incorrect position and size when resizing an MS Seamless application if the start point of the primary monitor is not 0,0



    Multiple Monitors

    An MS seamless application might be displayed incorrectly after the left click on it when the upper borders of the monitors are not aligned


    Multiple Monitors

    The session window might be cropped if the Span Multiple Monitors option is not selected and the start point of the primary monitor is not 0,0


    Password Management

    The error message is displayed after attempt to change password for users with non-English names


    Password Management

    If the Require Authentication option is not selected for the farm, the Change Password window is not displayed when the Change Password option is checked either on the Welcome or on the Auto-Configuration screens


    Password Management

    The Password Management messages are not displayed in the client’s language


    This hotfix is available for download at: 

  • Dell TechCenter

    Network Automation with Dell EMC OS9, OS10, Open Switch OPX and Ansible

    This blog describes the Open Switch OPX network automation demo based on the Ansible framework, as delivered by Dell EMC at AnsibleFest2017. For more details on the demo including videos, configuration playbooks and other details please contact

    Demo Overview

    This demo goes over the process of deploying a BGP fabric in a leaf-spine topology using Ansible. We use a single Ansible playbook to configure and bring BGP across OS9, OS10 Enterprise Edition and OPX (Open Switch).

     Ansible Playbook Details

    The Ansible playbook is deployed on the Ansible server that is part of the switch management subnet. The Ansible server is configured to connect with all the switches with SSH and deliver the configuration.

    The playbook deployment includes three primary components switch inventory file, host variable files and the main playbook. This playbook directory structure is shown below.


     Switch Inventory File

    The file lists the switches that would be configured by the playbook. Each node in the data center topology is listed with the OS name, Mgmt. IP address and node name.


    Host Var Files

    Each node in the topology has host variable file associated with it. Host var file for leaf1 node is shown below.

    Ansible Playbook

    The first line in the playbook shows target hosts as datacenter. From the switch inventory we can see that datacenter list includes all the leaf and spine nodes.

    The role entry shows the Dell EMC roles for configuring BGP, interface, system and version information. Ansible runs each of these roles with the help from the host vars files to build the CLI commands from a set of predefined templates and deliver to the devices.

    Running the Playbook


    The arguments to the ansible-playbook command includes the switch list from the inventory file and the playbook file datacenter.yaml. YAML  is new serialization standard used to write the Ansible script.

    The execution of playbook results in the switch configuration generated by programming the jinja2 templates defined for each role used in the playbook. The commands are then delivered to the devices via SSH client connection.

    Repeated execution of the Ansible playbook, will only update any changes made to the playbook or host var file or switch inventory file and thus safe to repeat. All the modules have idempotence baked in. That is, running a module multiple times in a sequence should have the same effect as running it just once.

    Playbook Results!

    A portion of the output on the console is shown below.


    As can be seen, all 4 leaf and 2 spine that are part of the data center topology have been configured with Quagga [BGP] to run a L3 network.

    You can also look at the CLI command file for each node that were delivered to the device under the /tmp directory. This can be a way to troubleshoot deployment.

    For more detailed console output, you can use –vvvv option. This makes the console output very granular and shows the details of all the steps that Ansible takes to deploy to the switches. A useful way to debug the playbooks.

    You can login to one of the switches to check out the deployed configuration. A sample is shown below.


    The demo highlights Ansible as the right, flexible automation framework for switch manageability and a simple programmable environment.

    Ansible can prove to be pretty powerful and the network as a whole doesn’t have to be automated overnight. It’s about thinking a little differently and exploring some automation to see if it makes sense for any given environment. There are steps that can be taken to learn about these new processes and tools. And the best part is, all of these configuration management tools are open source and OPX as well can be tried at no cost.

    Questions? Help? Contact: Open Networking Team at Dell EMC