Our community is talking about the new Dell Technologies. Join the discussion in the Dell EMC Community Network:
Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula.
HPC Innovation Lab. November 2017
In our previous blog, we presented the deep learning performance on single Dell PowerEdge C4130 node with four V100 GPUs. For very large neural network models, a single node is still not powerful enough to quickly train those models. Therefore, it is important to scale the training model to multiple nodes to meet the computation demand. In this blog, we will evaluate the multi-node performance of deep learning frameworks MXNet and Caffe2. The Interconnect in use is Mellanox EDR InfiniBand. The results will show that both frameworks scale well on multiple V100-SXM2 nodes.
Overview of MXNet and Caffe2
In this section, we will give an overview about how MXNet and Caffe2 are implemented for distributed training on multiple nodes. Usually there are two ways to parallelize neural network training on multiple devices: data parallelism and model parallelism. In data parallelism, all devices have the same model but different devices work on different pieces of data. While in model parallelism, difference devices have parameters of different layers of a neural network. In this blog, we only focus on the data parallelism in deep learning frameworks and will evaluate the model parallelism in the future. Another choice in most deep learning frameworks is whether to use synchronous or asynchronous weight update. The synchronous implementation aggregates the gradients over all workers in each iteration (or mini-batch) before updating the weights. However, in asynchronous implementation, each worker updates the weight independently with each other. Since the synchronous way guarantees the model convergence while the asynchronous way is still an open question, we only evaluate the synchronous weight update.
MXNet is able to launch jobs on a cluster in several ways including SSH, Yarn, MPI. For this evaluation, SSH was chosen. In SSH mode, the processes in different nodes use rsync to synchronize the working directory from root node into slave nodes. The purpose of synchronization is to aggregate the gradients over all workers in each iteration (or mini-batch). Caffe2 uses Gloo library for multi-node training and Redis library to facilitate management of nodes in distributed training. Gloo is a MPI like library that comes with a number of collective operations like barrier, broadcast and allreduce for machine learning applications. The Redis library used by Gloo is used to connect all participating nodes.
We chose to evaluate two deep learning frameworks for our testing, MXNet and Caffe2. As with our previous benchmarks, we will again use the ILSVRC 2012 dataset which contains 1,281,167 training images and 50,000 validation images. The neural network in the training is called Resnet50 which is a computationally intensive network that both frameworks support. To get the best performance, CUDA 9 compiler, CUDNN 7 library and NCCL 2.0 are used for both frameworks, since they are optimized for V100 GPUs. The testing platform has four Dell EMC’s PowerEdge C4130 servers in configuration K. The system layout of configuration K is shown in Figure 1. As we can see, the server has four V100-SXM2 GPUs and all GPUs are connected by NVLink. The other hardware and software details are shown in Table 1. Table 2 shows the input parameters that are used to train Resnet50 neural network in both frameworks.
Figure 1: C4130 configuration K
Table 1: The hardware configuration and software details
Table 2: Input parameters used in different deep learning frameworks
Figure 2 and Figure 3 show the Resnet50 performance and speedup results on multiple nodes with MXNet and Caffe2, respectively. As we can see, the performance scales very well with both frameworks. With MXNet, compared to 1*V100, the speedup of using 16*V100 (in 4 nodes) is 15.4x in FP32 mode and 13.8x in FP16, respectively. And compared to FP32, FP16 improved the performance by 63.28% - 82.79%. Such performance improvement was contributed to the Tensor Cores in V100.
In Caffe2, compared to 1*V100, the speedup of using 16*V100 (4 nodes) is 14.8x in FP32 and 13.6x in FP16, respectively. And the performance improvement of using FP16 compared to FP32 is 50.42% - 63.75% excluding the 12*V100 case. With 12*V100, using FP16 is only 29.26% faster than using FP32. We are still investigating the exact reason of it, but one possible explanation is that 12 is not the power of 2, which may make some operations like reductions slower.
Figure 2: Performance of MXNet Resnet50 on multiple nodes
Figure 3: Performance of Caffe2 Resnet50 on multiple nodes
Conclusions and Future Work
In this blog, we present the performance of MXNet and Caffe2 on multiple V100-SXM2 nodes. The results demonstrate that the deep learning frameworks are able to scale very well on multiple Dell EMC’s PowerEdge servers. At this time the FP16 support in TensorFlow is still experimental, our evaluation is in progress and the results will be included in future blogs. We are also working on containerizing these frameworks with Singularity to make their deployment much easier.
This blog post is written by Thiru Navukkarasu and Krishnaprasad K from Dell Hypervisor Engineering.
Dell PowerEdge VRTX was not supported for VMware ESXi 6.5 branch of ESXi thus far. Dell announced support for VRTX from Dell customized version of ESXi 6.5 A04 revision onwards. However there was a late breaking issue identified on dell-shared-perc8 (Version: 06.806.89.00) and ESXi 6.5.x. To resolve this issue, install/upgrade to version 06.806.90.00 and above. This revised driver is part of Dell customized VMware ESXi 6.5 Update1 A04 which is available for download from here.
From VMware ESXi 6.5 onwards, the shared PERC8 controller in VRTX use dell_shared_perc8 native driver instead of megaraid_sas vmklinux driver in ESXi 6.0.x branch.
You may look at the following command outputs in ESXi to verify if you have the supported image installed on PowerEdge VRTX blades.
~] vmware –lv
VMware ESXi 6.5.0 build-6765664
VMware ESXi 6.5.0 Update 1
~] cat /etc/vmware/oem.xml
You are running DellEMC Customized Image ESXi 6.5 Update 1 A04 (based on ESXi VMKernel Release Build 6765664)
~] esxcli storage core adapter list
HBA Name Driver Link State UID Capabilities Description
-------- ----------------- ---------- -------------------- ------------ ----------------------------------------------------------
vmhba3 dell_shared_perc8 link-n/a sas.0 (0000:0a:00.0) LSI / Symbios Logic Shared PERC 8 Mini
vmhba4 dell_shared_perc8 link-n/a sas.c000016000c00 (0000:15:00.0) LSI / Symbios Logic Shared PERC 8 Mini
Austin Shelnutt - ESI Architecture team
While various forms of machine learning have existed for several decades, the past few years of development have yielded some extraordinary progress in democratizing the capabilities and use cases for artificial intelligence in a wide multitude of industries. Image classification, voice recognition, fraud detection, medical diagnostics, and process automation are just a handful of the burgeoning use cases for machine learning that are reinventing the very world we live in. This blog provides a brief overview of some of the basic principles of Machine Learning and describes the challenges and trade-offs involved in constructing the optimal Machine Learning platform for different use cases.
Neural Networks are key to machine learning
At the center of the growth in machine learning is a modeling technique referred to as neural networks (also known as deep neural networks, or deep learning), which is based on our understanding of how the human brain learns and processes information. Neural networks are not a new concept, and have been proposed as a model for computational learning since the 1940’s. What makes neural networks so attractive for machine learning is that they provide a mathematical ecosystem that allows the decision making accuracy of a computer to scale beyond explicit programming rules and, in a sense, learn from experience.
Previously, the limiting factor of neural network models has been that they are extremely computation intensive and require a tremendous amount of labeled data input to be able to “learn”. This double hurdle of processing power and available data had prevented them from becoming relevant…. until now.
Today, we stand at the intersection of huge data sets being generated in all corners of industry and the rise of massively-parallel compute infrastructure in the form of enhanced CPU instruction sets, GPUs, FPGAs, and new ASICs - designed specifically to accelerate neural network math. Neural network models that would have taken weeks or even months to run even a couple of years ago can learn (or be “trained”) in just a few hours on today’s hardware.
While this convergence of available data and compute has opened up seemingly limitless potential applicability to end users, it presents a new challenge to hardware architects: What does the ultimate machine learning platform look like? To answer that, we need to better understand the machine learning platform stack.
The machine learning stack consists of:
The neural network (application) layer – this is the data analysis model
The framework layer - provides the specialized software neural networks run on
The math libraries layer - houses the math routines the frameworks call
The operating system layer – choice of OS
The hardware platform layer -offers a number of different accelerator options
The platform choices made at each of these layers can impact the performance and capabilities of the targeted learning function. The following sections expand on important points that should be considered for these layers.
Neural Network layer
Neural networks are symbolic representations of the mathematical models created for a specific learned function (for example, speech recognition). Neural networks come in many different shapes, sizes and functions, depending upon both the type of data being ingested and the intended goal (output) of the learned function. The complexity of neural network construction can vary by:
At the highest level, all neural networks break down input features (such as the pixels in a photo) into multi-dimensional arrays of data (tensors) and then pass them through one or more layers of parameters (or weights) into activation functions which can be represented as neurons in the neural network.
Input tensors are multiplied by parameter tensors and activation functions to yield a hypothesis that can be used for a decision – for example, classification that an object shown in a given picture is a cat or a dog. The size and dimensions of the input features and the number of activation layers is what determines how to handle the necessary math operations in the hardware layer (i.e., you may require multiple GPUs).
When designing the optimal platform to use for a neural network, how that particular neural network is constructed is crucial in determining what options are best for it at other layers of the stack. In general, the platform designer’s goal is to understand how data is moved in, out, and around inside of the system to tune features in a manner that most efficiently eliminates data choke points or bottlenecks.
For example, small neural networks that can be computed relatively quickly might create a tremendous demand on data set ingest bandwidth either from local storage or remote data pools and consequently would be potentially bottlenecked by slow storage devices or narrow I/O bandwidth. Pairing this type of neural model with a high performance accelerator platform that lacks significant I/O bandwidth would also result in under-utilized compute hardware.
As another example, very large neural networks with a large number of input features and/or activation layers, may not fit comfortably inside of a single accelerator’s onboard memory or need to swap weight calculations in and out of the page file during each iteration. This type of model might operate most efficiently when the stored weights can be exchanged and multiplied across multiple accelerators. So, a hardware platform that offers multiple accelerators would be the right choice in this case. But note that the distribution of operations to multiple accelerators is handled differently by different hardware offerings and frameworks, so the efficiency of distribution varies accordingly. Also note that not every neural network equally benefits from a multiple accelerators – or at least not at the same scaling efficiency. (See the following sections.)
Neural network models run on deep learning software frameworks. The proliferation of frameworks, while primarily open source in nature, has largely stemmed from academia and a number of hyperscale service providers – each attempting to advance their own particular code. You can run virtually any neural network on any deep learning framework, but they are certainly not all created equal. The manner in which frameworks utilize the underpinning hardware varies from framework to framework. While end users often choose a framework based on coding familiarity, there are a number of factors to consider that impact neural network performance:
Each of these points needs to be considered in light of the characteristics of the specific neural network. They may ultimately influence the choice of framework and the accelerator options.
Hardware Platform layer
Choosing the right hardware technology to support a given machine learning application is another challenge for platform design. While CPUs can be used for deep learning, they are scalar multiplication engines by nature, and poorly suited to the higher-order tensor operations common to deep learning (vectors, matrices, and beyond). So machine learning platforms typically incorporate some form of accelerator technology - GPU, FPGA, or ASIC. But even at that level there are trade-offs to consider – particularly concerning the distribution of operations across multiple accelerators and how that impacts scaling. These considerations are described below.
GPUs have been the cornerstone of the deep learning growth in recent years because of their powerful parallel compute capabilities derived from their relatively large number of independent logic cores. The different models for how data is exchanged between GPUs is a differentiating feature when considering platform design.
FPGAs & ASICs
Though GPUs currently occupy a fortress on the deep learning market today, technology vendors from across the globe are lining up to take aim at specific soft spots in the GPU’s dominance. The latest FPGA and ASIC technology delivers new levels of component-level performance-per-dollar, performance-per-watt and small-batch efficiency that will result in competitive offerings in 2018 that are intended to blow the doors off contemporary deep learning hardware.
Using PCI-Express accelerators for machine learning has become popular for a number of previously discussed reasons, however, one primary benefit is the ability to ‘scale-up’ to use multiple accelerators in the same server. The challenge for effectively using more than one accelerator is data exchange between the cards. The latency and bandwidth limitations for data going back through the host CPU’s PCIE root complex, for example, can be a large performance penalty that negates the multi-accelerator benefit.
Modern non-blocking PCIE switches can be a great solution to this challenge by allowing the PCIE accelerators to exchange data directly without passing through the host root complex, if the framework comprehends this type of communication path.
Again, here, balance is the key. As you add accelerators to the switch, eventually the host bandwidth between the switch and the (single host) CPU becomes the new bottleneck. Unfortunately, due to the variations in neural networks, data sets, and frameworks, this point is a moving target, and very difficult to predict.
Specialized accelerator-to-accelerator communication
Many technology companies are now implementing specialized accelerator to accelerator connection links. Nvidia’s NVlink is a great example of a specialized communication path that dramatically improves bandwidth between accelerators for applications that benefit from peer-to-peer data exchange.
To be clear, while these auxiliary connection types are extremely valuable for some end customer use cases, there are other deep learning applications that yield very little benefit from this type of interconnect. Furthermore, these specialty interconnects can be costly, both in terms of materials and design changes required to accommodate them.
In fact, the current proprietary interconnect trend is driving unique server designs - just to support the interconnect; resulting in wide variations in hardware from vendor to vendor. Accelerator technology vendors are, seemingly, abandoning all forms of conventional design guidelines in their own pursuit of maximum peer-to-peer bandwidth. This may be the single biggest pain point for designing a truly optimized deep learning platform.
What’s next for Machine Learning platforms?
Physical manifestation of the peer-to-peer interconnect is not the only place where deep learning technology providers are departing from conventional techniques. In the pursuit of ever-improved performance, some vendors are moving beyond PCIE form factor, pushing beyond the accepted power/heat limits, and writing new math libraries. Platform designers need to be aware that the technology underpinning the explosive growth in machine learning is still very fluid and divergent.
Machine learning customers have more choices than ever for neural network models and frameworks. Those choices impact the type, number, and form factor of the preferred accelerator, the dataflow topology between accelerators and CPUs, the amount and speed of direct attached storage, and the necessary bandwidth of I/O devices. The resulting platform must:
Solving a platform optimization challenge with this many degrees of freedom may seem daunting, but Dell EMC is committed to helping our customers meet this challenge. Today, we are already working with a wide range of customers, across a number of industries to solve some of the most complex and interesting machine learning problems. And going forward, we are committing resources to ensure we remain a technology leader in this arena.
For more information on what Dell EMC Extreme Scale Infrastructure is doing with Machine Learning, contact ESI@dell.com .
This is a mandatory hotfix for:
The following is a list of issues resolved in this release.
The Linux Connector might close during the second attempt to download the configuration using an incorrect FQDN path to the Connection Broker
The error message is displayed during the first connection when an auto-launch application is configured according to the user`s configuration, but not assigned to him
Configuration is not saved in the Linux Connector if the user clicks Cancel on the Credentials screen after configuration has already been downloaded
There is an inform message every time when user resizes remote window
In Notepad, the Menu drop-down list is displayed shifted when Notepad is placed near the screen border
The Seamless application does not detect the Unity taskbar
The session window of an MS Seamless application might disappear after reconnection
The resize frame has incorrect position and size when resizing an MS Seamless application if the start point of the primary monitor is not 0,0
An MS seamless application might be displayed incorrectly after the left click on it when the upper borders of the monitors are not aligned
The session window might be cropped if the Span Multiple Monitors option is not selected and the start point of the primary monitor is not 0,0
The error message is displayed after attempt to change password for users with non-English names
If the Require Authentication option is not selected for the farm, the Change Password window is not displayed when the Change Password option is checked either on the Welcome or on the Auto-Configuration screens
The Password Management messages are not displayed in the client’s language
This hotfix is available for download at: https://support.quest.com/kb/234432
This blog describes the Open Switch OPX network automation demo based on the Ansible framework, as delivered by Dell EMC at AnsibleFest2017. For more details on the demo including videos, configuration playbooks and other details please contact feedback-ansible-dell-networking@Dell.com
This demo goes over the process of deploying a BGP fabric in a leaf-spine topology using Ansible. We use a single Ansible playbook to configure and bring BGP across OS9, OS10 Enterprise Edition and OPX (Open Switch).
Ansible Playbook Details
The Ansible playbook is deployed on the Ansible server that is part of the switch management subnet. The Ansible server is configured to connect with all the switches with SSH and deliver the configuration.
The playbook deployment includes three primary components switch inventory file, host variable files and the main playbook. This playbook directory structure is shown below.
Switch Inventory File
The file lists the switches that would be configured by the playbook. Each node in the data center topology is listed with the OS name, Mgmt. IP address and node name.
Host Var Files
Each node in the topology has host variable file associated with it. Host var file for leaf1 node is shown below.
The first line in the playbook shows target hosts as datacenter. From the switch inventory we can see that datacenter list includes all the leaf and spine nodes.
The role entry shows the Dell EMC roles for configuring BGP, interface, system and version information. Ansible runs each of these roles with the help from the host vars files to build the CLI commands from a set of predefined templates and deliver to the devices.
Running the Playbook
The arguments to the ansible-playbook command includes the switch list from the inventory file and the playbook file datacenter.yaml. YAML is new serialization standard used to write the Ansible script.
The execution of playbook results in the switch configuration generated by programming the jinja2 templates defined for each role used in the playbook. The commands are then delivered to the devices via SSH client connection.
Repeated execution of the Ansible playbook, will only update any changes made to the playbook or host var file or switch inventory file and thus safe to repeat. All the modules have idempotence baked in. That is, running a module multiple times in a sequence should have the same effect as running it just once.
A portion of the output on the console is shown below.
As can be seen, all 4 leaf and 2 spine that are part of the data center topology have been configured with Quagga [BGP] to run a L3 network.
You can also look at the CLI command file for each node that were delivered to the device under the /tmp directory. This can be a way to troubleshoot deployment.
For more detailed console output, you can use –vvvv option. This makes the console output very granular and shows the details of all the steps that Ansible takes to deploy to the switches. A useful way to debug the playbooks.
You can login to one of the switches to check out the deployed configuration. A sample is shown below.
The demo highlights Ansible as the right, flexible automation framework for switch manageability and a simple programmable environment.
Ansible can prove to be pretty powerful and the network as a whole doesn’t have to be automated overnight. It’s about thinking a little differently and exploring some automation to see if it makes sense for any given environment. There are steps that can be taken to learn about these new processes and tools. And the best part is, all of these configuration management tools are open source and OPX as well can be tried at no cost.
Questions? Help? Contact: Open Networking Team at Dell EMC