Dell Community

Blog Group Posts
Application Performance Monitoring Blog Foglight APM 105
Blueprint for HPC - Blog Blueprint for High Performance Computing 0
CommAutoTestGroup - Blog CommAutoTestGroup 1
Custom Solutions Engineering Blog Custom Solutions Engineering 2
Data Security Data Security 8
Latest Blog Posts
  • General HPC

    Containerizing HPC Applications with Singularity

    Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula.

    HPC Innovation Lab. October 2017

    Overview

    In this blog, we will give an introduction to Singularity containers and how they should be used to containerize HPC applications. We run different deep learning frameworks with and without Singularity containers and show that there is no performance loss with Singularity containers. We also show that Singularity can be easily used to run MPI applications.

    Introduction to Singularity

    Singularity is a container system developed by Lawrence Berkeley Lab to provide container technology like Docker for High Performance Computing (HPC). It wraps applications into an isolated virtual environment to simplify application deployment. Unlike virtual machines, the container does not have a virtual hardware layer and its own Linux kernel inside the host OS. It is just sandboxing the environment; therefore, the overhead and the performance loss are minimal. The goal of the container is reproducibility. The container has all environment and libraries an application needs to run, and it can be deployed anywhere so that anyone can reproduce the results the container creator generated for that application.

    Besides Singularity, another popular container is Docker, which has been widely used for many applications. However, there are several reasons that Docker is not suitable for an HPC environment. The following are various reasons that we choose Singularity rather than Docker:

    • Security concern. The Docker daemon has root privileges and this is a security concern for several high performance computing centers. In contrast, Singularity solves this by running the container with the user’s credentials. The access permissions of a user are the same both inside the container and outside the container. Thus, a non-root user cannot change anything outside of his/her permission.

    • HPC Scheduler. Docker does not support any HPC job scheduler, but Singularity integrates seamlessly with all job schedulers including SLURM, Torque, SGE, etc.

    • GPU support. Docker does not support GPU natively. Singularity is able to support GPUs natively. Users can install whatever CUDA version and software they want on the host which can be transparently passed to Singularity.

      MPI support. Docker does not support MPI natively. So if a user wants to use MPI with Docker, a MPI-enabled Docker needs to be developed. If a MPI-enabled Docker is available, the network stacks such as TCP and those needed by MPI are private to the container which makes Docker containers not suitable for more complicated networks like Infiniband. In Singularity, the user’s environment is shared to the container seamlessly.

    Challenges with Singularity in HPC and Workaround

    Many HPC applications, especially deep learning applications, have deep library dependences and it is time consuming to figure out these dependences and debug build issues. Most deep learning frameworks are developed in Ubuntu but they need to be deployed to Red Hat Enterprise Linux. So it is beneficial to build those applications once in a container and then deploy them anywhere. The most important goal of Singularity is portability which means once a Singularity container is created, the container can be run on any system. However, so far it is still not easy to achieve this goal. Usually we build a container on our own laptop, a server, a cluster or a cloud, and then deploy that container on a server, a cluster or a cloud. When building a container, one challenge is in GPU-based systems which have GPU driver installed. If we choose to install GPU driver inside the container, but the driver version does not match the host GPU driver, then an error will occur. So the container should always use the host GPU driver. The next option is to bind the paths of the GPU driver binary file and libraries to the container so that these paths are visible to the container. However, if the container OS is different than the host OS, such binding may have problems. For instance, assume the container OS is Ubuntu while the host OS is RHEL, and on the host the GPU driver binaries are installed in /usr/bin and the driver libraries are installed in /usr/lib64. Note that the container OS also have /usr/bin and /usr/lib64; therefore, if we bind those paths from the host to the container, the other binaries and libraries inside the container may not work anymore because they may not be compatible in different Linux distributions. One workaround is to move all those driver related files to a central place that does not exist in the container and then bind that central place.

    The second solution is to implement the above workaround inside the container so that the container can use those driver related files automatically. This feature has already been implemented in the development branch of Singularity repository. A user just need to use the option “--nv” when launching the container. However, based on our experience, a cluster usually installs GPU driver in a shared file system instead of the default local path on all nodes, and then Singularity is not able to find the GPU driver path if the driver is not installed in the default or common paths (e.g. /usr/bin, /usr/sbin, /bin, etc.). Even if the container is able to find the GPU driver and the corresponding driver libraries and we build the container successfully, if in the deployment system the host driver version is not new enough to support the GPU libraries which were linked to the application when building the container, then an error will occur. Because of the backward compatibility of GPU driver, the deployment system should keep the GPU driver up to date to ensure its libraries are equal to or newer than the GPU libraries that were used for building the container.

    Another challenge is to use InfiniBand with the container because InfiniBand driver is kernel dependent. There is no issue if the container OS and host OS are the same or compatible. For instance, RHEL and Centos are compatible, and Debian and Ubuntu are compatible. But if these two OSs are not compatible, then it will have library compatibility issue if we let the container use the host InfiniBand driver and libraries. If we choose to install the InfiniBand driver inside the container, then the drivers in the container and the host are not compatible. The Singularity community is still trying hard to solve this InfiniBand issue. Our current solution is to make the container OS and host OS be compatible and let the container reuse the InfiniBand driver and libraries on the host.

    Singularity on Single Node

    To measure the performance impact of using Singularity container, we ran the neural network Inception-V3 with three deep learning frameworks: NV-Caffe, MXNet and TensorFlow. The test server is a Dell PowerEdge C4130 configuration G. We compared the training speed in images/sec with Singularity container and bare-metal (without container). The performance comparison is shown in Figure 1. As we can see, there is no overhead or performance penalty when using Singularity container.

    Figure 1: Performance comparison with and without Singularity

     

    Singularity at Scale

    We ran HPL across multiple nodes and compared the performance with and without container. All nodes are Dell PowerEdge C4130 configuration G with four P100-PCIe GPUs, and they are connected via Mellanox EDR InfiniBand. The result comparison is shown in Figure 2. As we can see, the percent performance difference is within ± 0.5%. This is within normal variation range since the HPL performance is slightly different in each run. This indicates that MPI applications such as HPL can be run at scale without performance loss with Singularity.

     

    Figure 2: HPL performance on multiple nodes

    Conclusions

    In this blog, we introduced what Singularity is and how it can be used to containerize HPC applications. We discussed the benefits of using Singularity over Docker. We also mentioned the challenges of using Singularity in a cluster environment and the workarounds available. We compared the performance of bare metal vs Singularity container and the results indicated that there is no performance loss when using Singularity. We also showed that MPI applications can be run at scale without performance penalty with Singularity.

  • Hotfixes

    Mandatory Hotfix 654624 for 8.6 MR3 Mac Connector Released

    This is a mandatory hotfix for: 

     

    • Mac Connector

       

    The following is a list of issues resolved in this release.

    Feature

    Description

    Feature ID

    BYOD

    It is possible to create two connections for the same user if using different registry letters in user name (test, Test etc.)

    615712

    BYOD

    Configuration is not updated automatically if the user target has a higher priority during the first connection to the configuration

    653742

    DI mode

    The notification message should be shown after each attempt to launch the connector when it is already launched in the DI mode

    628347

    DI mode

    Connector may close if the user cancels the connection action when connector is in the DI mode

    654343

    Microphone

    Microphone redirection is not displayed in the connection setting policies

    654556

    Microphone

    Add the Microphone redirection support for the Mac connector

    654549

    Session launching

    The application will not launch until the tutorial window is closed

    653906

    Session launching

    Session may close during launching on the 10.10.5 system

    654341

    Connection settings

    Connector may close after clicking on the Connection Settings button on the 10.9.5 system

    654345

    Password Management

    The Change Password checkbox is not displayed on the Welcome screen if the Require Authentication checkbox is unchecked and PM is configured

    653754

    UI

    User's credentials aren’t filled if the user cancels finding of the configuration process

    654136

    UI

    The Details button is displayed in the Adding Manual Connection window

    630956

    UI

    The Next button is disabled when using clipboard to paste into the User Name field on the Auto-Configure screen

    654164

    UI

    There is an empty space in the extended settings if the Detect connection quality automatically connection speed is selected in the configuration properties

    654570

    UI

    Mac connector doesn’t display the non-login broker errors

    654344

    UI

    There are alignment issues in the connection settings if user saves credentials after the connection setting was opened at least once

    654558

    This hotfix is available for download at: https://support.quest.com/vworkspace/kb/233661 

  • Hotfixes

    Mandatory Hotfix 654525 for 8.6 MR3 iOS Connector Released

    This is a mandatory hotfix for: 

     

    • iOS Connector

       

    The following is a list of issues resolved in this release.

    Feature

    Description

    Feature ID

    Rendering

    Corruptions appear in Microsoft Excel 2010 if the session is launched from 2008R2 server with Windows 7

    654402

    Rendering

    The cursor is not displayed in Microsoft Word 2010 if the session is launched from 2008R2 server with Windows 7

    654513

    UI

    The Connection bar is not resized in the Split View mode

    616879

    UI

    The remote session is not resized properly in the Split View mode

    654490

    UI

    The width of the vWs website\email address field changes when the user enters data into it

    654530

    This hotfix is available for download from the Apple Store at: https://itunes.apple.com/us/app/vworkspace/id406043462 

  • Custom Solutions Engineering Blog

    Windows Server 2008 Install on Dell EMC's 14th Generation PowerEdge R640/R740 Servers


    Written by Andrew Bachler

     

    Microsoft Windows 2008 Operating System installation is Not officially supported on Dell PowerEdge 14th Generation (14G) Servers. (This Install procedure was only tested on R640 / R740 platforms)

    Dell highly recommends that you migrate to a supported OS. This should not be considered a long term solution and only used temporarily until migration to a supported OS and Platform.

    For the 14th Generation PowerEdge Servers there is only support for USB 3.0 and Windows 2008 only supported USB 2.0 from the base installation media. To work around this issue we purchased a Generic USB PCIe 2.0 card with 3 external USB 2.0 ports to connect the (DVD, Mouse, and Keyboard) We also used a USB hub to daisy chain the additional USB Key to install the required Perc Driver during the installation process.

    Windows 2008 installation is supported in Virtualized environment. This method is only for the Physical PowerEdge server installation.

    By using this method of USB Card and PERC drivers and steps as shown below will result in successful installation of Windows 2004 on 14G servers. There is NO Dell official support for this process and should be used at your own risk

    Please contact your Sales/Account Team if you are interested to learn more about this solution.

    To learn more about Dell Custom Solutions Engineering visit www.dell.com/customsolutions

    We also require a Disclaimer form to be signed as a acknowledgment by the customer they are aware this is not supported by Dell and there are associated risks the customer must assume.

     

     

  • vWorkspace - Blog

    What's new for vWorkspace - July / August / September 2017

    Now updated quarterly, this publication provides you with new and recently revised information and is organized in the following categories; Documentation, Notifications, Patches, Product Life Cycle, Release, Knowledge Base Articles.

    Subscribe to the RSS (Use IE only)

     

    Knowledgebase Articles

    New 

    232045 - Error 1904.Module C:\Windows\SysWOW64\pnsftdll.dll failed to register on Windows Server 2016.

    vWorkspace Connection broker, User Profile Management role installation on Windows Server 2016 fails or partially succeeds. Connection broker...

    Created: August 14, 2017

     

    232665 - Can SSLv3 be disabled and connections forced to use only TLS 1.2?

    For security reasons, one can choose to disable SSLv3 and use only TLS 1.2 How can this be done within vWorkspace?

    Created: September 6, 2017

     

    232979 - Mandatory Hotfix 654335 for 8.6 MR3 Windows connector

    This is a Mandatory hotfix and can be installed on the following vWorkspace roles: Windows Connector Please see the full Release Notes...

    Created: September 18, 2017

     

    232974 - Mandatory Hotfix 654337 for 8.6 MR3 Web Access

    This is a Mandatory hotfix and can be installed on the following vWorkspace roles: Web Access Please see the full Release Notes attached below. 

    Created: September 18, 2017

     

    233001 - Optional Hotfix 654339 for 8.6 MR3 Password Management

    This is a Mandatory hotfix and can be installed on the following vWorkspace roles: Password Management Please see the full Release Notes...

    Created: September 19, 2017

     

    Revised

    182253 - Poor performance with using EOP USB to redirect USB drive

    When using EOP USB to redirect a USB drive or memory stick to a VDI redirection is sporadic the data transfer rate is poor when it is redirected.

    Revised: July 21, 2017

     

    232952 - Mandatory Hotfix 654338 for 8.6 MR3 Management Console

    This is a Mandatory hotfix and can be installed on the following vWorkspace roles: Management Console Please see the full Release Notes...

    Revised: September 18, 2017

     

    232958 - Mandatory Hotfix 654514 for PNTools/RDSH role

    This is a Mandatory hotfix and can be installed on the following vWorkspace roles: Remote Desktop Session Host (RDSH) PNTools (VDI...

    Revised: September 18, 2017

     

    228613 - Mandatory Hotfix 654037 for 8.6 MR3 Mac Connector

    This mandatory hotfix addresses the following issues: Picture is pasted instead of copied text when copying from Excel 2016 to Mac version of...

    Revised: September 18, 2017

     

    232947 - Mandatory Hotfix 654336 for 8.6 MR3 Connection Broker

    This is a Mandatory hotfix and can be installed on the following vWorkspace roles: Connection Broker Please see the full Release Notes...

    Revised: September 18, 2017

     

    Product Life Cycle - vWorkspace

  • Windows 10 IoT Enterprise (WIE10) - Blog

    Write Filter (WF) in Windows Thin Clients

    The Unified Write Filter or UWF is a Microsoft feature built-in to Windows 10 IoT Enterprise (Previous versions like XP Embedded and WES7 used the Enhanced Write Filter). Both provide similar protection for your local NAND based storage device.

    The WF intercepts all writes to disk and stores it in an overlay in RAM. Some applications like Windows Defender need writes to be committed and so you can create exclusions for these programs for them to operate. The WF has a two-fold advantage and should be turned ON for normal thin client operation (Except for maintenance like Windows Updates or Application installs/Upgrades).

    1. Prevents unauthorized changes (applications, data, etc. by end user or malicious software) from being stored on a thin client filling up local storage
    2. More importantly, it reduces the wear and tear on your NAND based storage medium like eMMC Flash or SSDs. In typical usecases, the life of storage is enhanced several times. HDD-based systems also see benefits albeit not as significant as NAND-based ones.

    Write Filter

  • Custom Solutions Engineering Blog

    Dell EMC, Redfish & Docker – Simplifying Modern Datacenter Management

    Written by Ajeet Raina

     

    The growing scale of cloud and web-based datacenter infrastructure is reshaping the needs of IT administration world-wide. New approaches to systems management are needed to scale the datacenter by growing the server infrastructure. With comparison to the traditional datacenter, the modern scale-out infrastructure demands vendor-independent APIs, interfaces, system management tools & solutions.

    There is indeed great demand for modern, software-defined approaches & standards which can support the full range of server architecture ranging from monolithic to converged hyper-scale architecture. Restful APIs are providing a new way of interoperability between computer systems on the network. Due to their simplicity, usability, encrypted connections and security, a programmatic interface that can easily be managed using scripts, they are widely accepted as standards for developing web APIs and data exchange formats.

     

    Most Dell Server Management software offerings, as well as the entire Software Defined Infrastructure, are built upon standard implanted using RESTful architecture called Redfish. Redfish is a next generation system management standard using a data model representation inside a hypermedia RESTful interface. The data model is defined in terms of a standard, machine-readable schema, with the payload of the messages expressed in JSON and the protocol using OData v4. Since it is a hypermedia API, Redfish is capable of representing a variety of implementations by using a consistent interface. It has mechanisms for discovering and managing data center resources, handling events, and managing long-lived tasks. It is easy to implement, easy to consume and offer scalability advantages over old technologies. Redfish is a RESTful interface over HTTPS in JSON format based on ODATA v4 usable by clients, scripts, and browser-based GUIs

     

    DellEMC iDRAC RESTful API is available from Dell 12G servers’ model and later running DellEMC iDRAC with firmware version 2.40.40.40 or later with the iDRAC Standard license. To access the RESTful API, you need an HTTPS-capable client, using a web browser or with the Postman REST Client plugin extension or CURL (a popular command line HTTP utility).

     Why Docker?

    The rise and adoption of Docker containers is rapidly gaining popularity inside the datacenter. Docker is an open platform which make application and workloads more portable and distributed in an effective and standardized way. Combining Docker containers, micro services along with software-defined infrastructure makes the datacenter more agile and quick resource reallocation. Hence, this architecture works well to improve the datacenter operations.

    Figure 1.1 shows the architecture of DellEMC Universal Systems Manager. It is a modern approach to DellEMC Server Management solution integrated with monitoring and logging pipeline using Docker & Microservices. This management solution completely runs on Docker containers and has been tested with the latest stable Docker 17.06 release. DellEMC iDRAC9 comes with enhancement to DellEMC iDRAC Restful API for Server Configuration Profiles (SCP) support and iDRAC configuration including firmware inventory & update. This underlying solution simplifies and automates the BIOS configuration changes for one & multiple systems, builds up a centralized logging pipeline using modern logging solution like ELKF and records time series data model using modern monitoring stack called Prometheus, Grafana & Alert Manager.

     


    Figure: 1.1

     

    Please contact your Sales/Account Team if you are interested to learn more about this solution.

    To learn more about Dell Custom Solutions Engineering visit www.dell.com/customsolutions


  • Dell TechCenter

    Linux Multipathing

    Author: Steven Lemons

    What exactly is Linux multipathing? It’s the process used to configure multiple I/O access paths from Dell™ PowerEdge™ Servers to the Dell EMC™ SC Series storage block devices.

    When thinking about how best to access these SC Series block devices from a Linux host, there are generally two configuration options available when needing the fail-over and load-balancing benefits of multipathing:

    • Dell EMC PowerPath™
    • Native Device Mapper Multipathing (DM-Multipath)

    Much like gaming, each of us prefers a specific character configuration just as each Linux admin has their preferred method of multipathing configuration, this blog calls out PowerPath with its support for SC Series storage in addition to highlighting some of the native DM-Multipathing configuration attributes that might be taken at “default” value (you seasoned Linux admins will get that joke). While highlighting these additional configuration attributes, which might enrich your DM-Multipath journey (let’s face it, the wrong choice here is a headache generator), we’d also like to present some simple BASH functions for inclusion within your administrative tool belt.

    The scenarios discussed in this blog adhere to the following Dell EMC recommended best practices. If you’re already familiar with our online resources, give yourself +10 Wisdom.

    Note: All referenced testing or scripting was completed on RHEL 6.9, RHEL 7.3 & SLES 12 SP2.

    Dell EMC PowerPath

    PowerPath supports SC Series storage. Go ahead, take a moment to let that sink in.

    YES! This exciting announcement was made at Dell EMC World 2016, in the live music capital of the world – Austin, TX.

    So, let’s see here….

    • PowerPath supports the following Operating Systems (OSes): AIX®, HP-UX, Linux, Oracle Solaris®, Windows®, and vSphere ESXi®.
    • PowerPath supports many Dell EMC Storage Arrays, including Dell EMC Unity, SC Series, and Dell EMC VPLEX
    • PowerPath provides a single management interface across expanding physical and virtual environments, easing the learning curve
    • PowerPath provides a single configuration script for dynamic LUN scanning for addition or removal from host (/etc/opt/emcpower/emcplun_linux – this is a huge time saver vs cli crafting a bunch of for loops with echo statements when needing to scan SAN changes at the Linux host level)
    • PowerPath provides built-in performance monitoring (disabled by default) for all of its managed devices

    The below screenshot of the powermt (PowerPath Management Utility) output highlights the attached SC Series storage array (red box) and the policy in use for the configured I/O paths (blue box) of the attached LUNs.

    Since we’re talking about multipathing, I’d be remiss if I did not cover PowerPath’s configuration ability to provide load balancing and failover policies specific to your individual storage array. A licensed version of PowerPath will default to the proprietary Adaptive load balancing and failover policy that assigns I/O requests based on an algorithmic decision of path load and logical device priority. The below screenshot of the powermt man page shows some of these available policies.

    If you haven’t taken the time to introduce PowerPath into your environment for multipathing needs, hopefully this blog sparked an interest that will get you in front of this outstanding multipathing management utility. You can give PowerPath a try, free for a 45-day trial, by reviewing PowerPath Downloads.

    Native Device Mapper Multipathing (DM-MPIO)

    Do you spend hours kludging that perfect /etc/multipath.conf file with your bash one-liners? For those keeping score, give yourself +10 Experience.

    Configuring DM-MPIO natively on Linux, whether it’s RHEL 6.3, RHEL 7.3 or SLES 12 SP2, follows a programmatic approach by first finding those WWIDs being presented to your host from your SC Series array and then applying various configuration attributes on top of the multipath device (/dev/mapper/SC_VOL_01, for example) being created.

    Speaking of finding WWIDs, the below find_sc_wwids function (available for your .bashrc consideration) should help and increase your score by +5 Ability.

    find_sc_wwids () {
      for x in `/usr/bin/lsscsi|/usr/bin/egrep -i compelnt|/usr/bin/awk -F' ' '{print $7}'`; do /usr/bin/echo -n "$x:";/usr/lib/udev/scsi_id -g -u $x; done | /usr/bin/sort -t":" -k2
    }

    Note: All .bashrc functions provided in this blog post were written and tested on RHEL 6.3, RHEL 7.3 & SLES 12 SP2.

    While there are configurable attributes within the mulitpaths section for each specified device (alias, uid, gid, mode, etc.) of the configuration file, there are some attributes within the defaults section that may need adjusting according to the performance requirements of your project. The next couple of sections focus on these attributes while providing some coding examples to help with your unique implementation.

    Default value for path_selector

    Configuring DM-MPIO on RHEL 6.3, RHEL 7.3 or SLES 12 SP2 changes the default path_selector algorithm from “round-robin 0” to the more latency sensitive “service-time 0”. Performance gains can be obtained by moving to “service-time 0” due to its ability to monitor the latency of configured I/O paths and load balance accordingly. This is a much more efficient I/O path selector service than the prior round-robin method which simply balanced the load between all active paths, regardless of latency impacting those active paths.

    Multipath device configuration attributes

    Now that it’s time to start digging into the weeds of those configurable attributes of each multipath device being created, save yourself some cli kludging with the following dm_iosched_values function (available for your .bashrc consideration) and increase your score by +5 Ability.

    dm_iosched_values () {
    for x in `/usr/bin/ls -al /dev/mapper/SC*|/usr/bin/awk -F' ' '{gsub(/\.\.\//,"");print $11}'`;
    do
      printf "%b\n" "Config files for /sys/block/$x/queue/"
      for q in `/usr/bin/find /sys/block/$x/queue/ -type f`;
      do
        printf "%b" "\t$x/queue$q:"
        /usr/bin/cat "$q"
      done
    done
    }

    The above dm_iosched_values function will give you a quick reference to all configurable attributes, as shown in the below screenshot, which should help with configuring the host for optimal DM-MPIO performance.

    max_sectors_kb

    Depending on performance requirements of your project, the “max_sectors_kb” value of the host configured DM-MPIO devices may need to be changed for application requirements. The below .bashrc functions can help with both discovery and changing this value (available for your .bashrc consideration) and increasing your score with +5 Ability.

    sc_max_sectors_kb () {
      for x in `/usr/bin/ls -al /dev/mapper/SC*|/usr/bin/awk -F' ' '{gsub(/\.\.\//,"");print $11}'`; do /usr/bin/echo -n "$x:Existing max_sectors_kb ->"; /usr/bin/cat /sys/block/$x/queue/max_sectors_kb;done
    }

    set_512_MaxSectorsKB () {
    for x in `/usr/bin/ls -al /dev/mapper/SC*|/usr/bin/awk -F' ' '{gsub(/\.\.\//,"");print $11}'`;
    do
      printf "%b" "$x:\tExisting max_sectors_kb:"
      /usr/bin/cat /sys/block/$x/queue/max_sectors_kb
      /usr/bin/echo 512 > /sys/block/$x/queue/max_sectors_kb
      printf "%b" "\tNew max_sectors_kb:"
      /usr/bin/cat /sys/block/$x/queue/max_sectors_kb
    done
    }

    set_2048_MaxSectorsKB () {
      for x in `/usr/bin/ls -al /dev/mapper/SC*|/usr/bin/awk -F' ' '{gsub(/\.\.\//,"");print $11}'`; do /usr/bin/echo -n "$x:Existing max_sectors_kb:";/usr/bin/cat /sys/block/$x/queue/max_sectors_kb; /usr/bin/echo 2048 > /sys/block/$x/queue/max_sectors_kb; /usr/bin/echo -n "$x:New max_sectors_kb:";/usr/bin/cat /sys/block/$x/queue/max_sectors_kb; done
    }

    When SC Series arrays present LUNs to the host, this value is automatically set based on what the SC array. For example, if the SC series had a 512k page pool, the host would set the “max_sectors_kb” value to 512k, the same for a 2MB page pool which the host would assign a “max_sectors_kb” value of 2048k.

    queue_depth

    Queue depth is one of those configuration values that must be tested and configured to achieve optimal performance within your SAN environment. The best way to achieve this optimal performance is by evaluating your SAN environment, testing the evaluation and then adjusting the values to meet your performance requirements. The below .bashrc functions can help in both discovery and changing this queue_depth value (available for your .bashrc consideration):

    sc_queue_depths () {
      for x in `/usr/bin/lsscsi|/usr/bin/egrep -i compelnt|/usr/bin/awk -F' ' '{gsub(/\/dev\//,"");print $7}'`; do /usr/bin/echo -n "$x:Existing Queue Depth -> ";/usr/bin/cat /sys/block/$x/device/queue_depth; done
    }

    set_64_queue_depth () {
      for x in `/usr/bin/lsscsi|/usr/bin/egrep -i compelnt|/usr/bin/awk -F' ' '{gsub(/\/dev\//,"");print $7}'`; do /usr/bin/echo -n "$x:Existing Queue Depth:";/usr/bin/cat /sys/block/$x/device/queue_depth; /usr/bin/echo 64 > /sys/block/$x/device/queue_depth; /usr/bin/echo -n "$x:New Queue Depth:";/usr/bin/cat /sys/block/$x/device/queue_depth; done
    }

    set_32_queue_depth () {
      for x in `/usr/bin/lsscsi|/usr/bin/egrep -i compelnt|/usr/bin/awk -F' ' '{gsub(/\/dev\//,"");print $7}'`; do /usr/bin/echo -n "$x:Existing Queue Depth:";/usr/bin/cat /sys/block/$x/device/queue_depth; /usr/bin/echo 32 > /sys/block/$x/device/queue_depth; /usr/bin/echo -n "$x:New Queue Depth:";/usr/bin/cat /sys/block/$x/device/queue_depth; done
    }

    Additional information

    I know, this post almost qualified for an entire tl;dr. (Too Long, Didn’t Read) opt-out, yet we’re at the end. Please give yourself, +10 Stamina points.

    With this post highlighting the PowerPath multipathing management utility while also touching on some of those unique configuration scenarios when working with native DM-MPIO, hopefully you are now thinking about your own multipathing policies in place and how they could be tuned up or replaced, altogether.

    Dell EMC PowerPath

    SUSE Storage Administration Guide

    Red Hat Enterprise Linux 6 – DM Multipath

    Red Hat Enterprise Linux 7 – DM Multipath

  • General HPC

    HPC Applications Performance on V100

    Authors: Frank Han, Rengan Xu, Nishanth Dandapanthula.

    HPC Innovation Lab. August 2017

     

    Overview

    This is one of two articles in our Tesla V100 blog series. In this blog, we present the initial benchmark results of NVIDIA® Tesla® Volta-based V100™ GPUs on 4 different HPC benchmarks, as well as a comparative analysis against previous generation Tesla P100 GPUs. We are releasing another V100 series blog, which discusses our V100 and deep learning applications. If you haven’t read it yet, it is highly recommend to take a look here.

    PowerEdge C4130 with V100 GPU support

    The NVIDIA® Tesla® V100 accelerator is one of the most advanced accelerators available in the market right now and was launched within one year of the P100 release. In fact, Dell EMC is the first in the industry to integrate Tesla V100 and bring it to market. As was the case with the P100, V100 supports two form factors: V100-PCIe and the mezzanine version V100-SXM2. The Dell EMC PowerEdge C4130 server supports both types of V100 and P100 GPU cards. Table 1 below notes the major enhancements in V100 over P100:

    Table 1: The comparison between V100 and P100

    PCIe

    SXM2

    P100

    V100

    Improvement

    P100

    V100

    Improvement

    Architecture

    Pascal

    Volta

    Pascal

    Volta

    CUDA Cores

    3584

    5120

    3584

    5120

    GPU Max Clock rate (MHz)

    1329

    1380

    1481

    1530

    Memory Clock rate (MHz)

    715

    877

    23%

    715

    877

    23%

    Tensor Cores

    N/A

    640

    N/A

    640

    Tensor Cores/SM

    N/A

    8

    N/A

    8

    Memory Bandwidth (GB/s)

    732

    900

    23%

    732

    900

    23%

    Interconnect Bandwidth
    Bi-Directional (GB/s)

    32

    32

    160

    300

    Deep Learning (TFlops)

    18.6

    112

    6x

    21.2

    125

    6x

    Single Precision (TFlops)

    9.3

    14

    1.5x

    10.6

    15.7

    1.5x

    Double Precision (TFlops)

    4.7

    7

    1.5x

    5.3

    7.8

    1.5x

    TDP (Watt)

    250

    300

     

    V100 not only significantly improves performance and scalability as will be shown below, but also comes with new features. Below are some highlighted features important for HPC Applications:

    • Second-Generation NVIDIA NVLink™

      All four V100-SXM2 GPUs in the C4130 are connected by NVLink™ and each GPU has six links. The bi-directional bandwidth of each link is 50 GB/s, so the bi-directional bandwidth between different GPUs is 300 GB/s. This is useful for applications requiring a lot of peer-to-peer data transfers between GPUs.

    • New Streaming Multiprocessor (SM)

      Single precision and double precision capability of the new SM is 50% more than the previous P100 for both PCIe and SXM2 form factors. The TDP (Thermal Design Power) of both cards are the same, which means V100 is ~1.5 times more energy efficient than the previous P100.

    • HBM2 Memory: Faster, Higher Efficiency

      The 900 GB/sec peak memory bandwidth delivered by V100, is 23% higher than P100. Also the DRAM utilization has been improved from 76% to 95%, which allows for a 1.5x improvement in delivered memory bandwidth.

      More in-depth details of all new features of V100 GPU card can be found at this Nvidia website.

       

    Hardware and software specification update

     

    All the performance results in this blog were measured on a PowerEdge Server C4130 using Configuration G (4x PCIe V100) and Configuration K (4x V100-SXM2). Both these configurations have been used previously in P100 testing. Also except for the GPU, the hardware components remain identical to those used in the P100 tests as well: dual Intel Xeon E5-2690 v4 processors, 256GB (16GB*16 2400 MHz) Memory and an NFS file system mounted via IPoIB on InfiniBand EDR were used. Complete specs details are included in our previous blog. Moreover, if you are interested in other C4130 configurations besides G and K, you can find them in our K80 blog.

    There are some changes on the software front. In order to unleash the power of the V100, it was necessary to use the latest version of all software components. Table 2 lists the versions used for this set of performance tests. To keep the comparison fair, the P100 tests were reran using the new software stack to normalize for the upgraded software.

    Table 2: The changes in software versions

    Software

    Current Version

    Previous version in P100 blog

    OS

    RHEL 7.3

    RHEL 7.2

    GPU Driver

    384.59

    361.77/375.20

    CUDA Toolkit

    9.0.103RC

    8.0.44

    OpenMPI

    1.10.7 & 2.1.2

    1.10.1 & 2.0.1

    HPL

    Compiled with sm7.0

    Compiled with sm6.0

    HPCG

    Compiled with sm7.0

    -

    AMBER

    16, AmberTools17 update 20

    16 AmberTools16 update3

    LAMMPS

    patch_17Aug2017

    30Sep16

     

    p2pBandwidthLatencyTest

     

    p2pBandwidthLatencyTest is a micro-benchmark included in the CUDA SDK. It tests the card to card bandwidth and latency with and without GPUDirect™ Peer-to-Peer enabled. Since the full output matrix is fairly long, the unidirectional P2P result is listed below as an example here to demonstrate the way to verify the NVLINKs speed on both V100 and P100.

    In theory, V100 has 6x 25GB/s uni-directional links, giving 150GB/s throughput. The previous P100-SXM2 only has 4x 20GB/s links, delivering 80GB/s. The results of p2pBandwitdhtLatencyTest on both cards are in Table 3. “D/D” represents “device-to-device”, that is the bandwidth available between two devices (GPUs). The achievable bandwidth of GPU0 was calculated by aggregating the second, third and fourth value in the first line, which represent the throughput from GPU0 to GPU1, GPU2 and GPU3 respectively.

    Table 3: Unidirectional peer-to-peer bandwidth

    Unidirectional P2P=Enabled Bandwidth Matrix (GB/s).   Four GPUs cards in the server.

    P100

    V100

    D\D

    0

    1

    2

    3

    D\D

    0

    1

    2

    3

    0

    231.53

    18.36

    18.31

    36.39

    0

    727.38

    47.88

    47.9

    47.93

    1

    18.31

    296.74

    36.54

    18.33

    1

    47.92

    725.61

    47.88

    47.89

    2

    18.35

    36.08

    351.51

    18.36

    2

    47.91

    47.91

    726.41

    47.95

    3

    36.59

    18.42

    18.42

    354.79

    3

    47.96

    47.89

    47.9

    725.02

    It is clearly seen that V100-SXM2 on C4130 configuration K is significant faster than P100-SXM2, on:

    1. Achievable throughput. V100-SXM2 has 47.88+47.9+47.93= 143.71 GB/s aggregated achievable throughput, which is 95.8% of the theoretical value 150GB/s and significant higher than 73.06GB/s and 91.3% on P100-SXM2. The bandwidth for bidirectional traffic is twice that of unidirectional traffic and is also very close to the theoretically 300 GB/s throughput.

    2. Real world application. Symmetric access is the key for real world applications, on each chipset, P100 has 4 links, out of which three are connected to each of the other three GPUS. The remaining fourth link is connected to one of the other three GPUs. So, there are two links between GPU0 and GPU3, but only 1 link between GPU0 and GPU1 as well as GPU0 and GPU2. This is not symmetrical. The above numbers of p2pBandwidthLatencyTest in blue show this imbalance, as the value between GPU0 to GPU3 reaches 36.39 GB/s, which is double the bandwidth between GPU0 and GPU1 or GPU0 and GPU2. In most real world applications, it is common for the developer to treat all cards equally and not take such architectural differences into account. Therefore it will be likely that the faster pair of GPUs will need to wait for the slowest transfers, which means that 18.31 GB/s is the actual speed between all pairs of GPUs.

      On the other hand, V100 has a symmetrical design with 6 links as seen in Figure 1. GPU0 to GPU1, GPU2, or GPU3 all have 2 links between each pair. So 47.88 GB/s is the achievable link bandwidth for each, which is 2.6 times faster than the P100.


    Figure1: V100 and P100 Topologies on C4130 configuration K

     

    High Performance Linpack (HPL)

    Figure2: HPL Multi-GPU results with V100 and P100 on C4130 configuration G and K

    Figure 2 shows the HPL performance on the C4130 platform with 1, 2 and 4 V100-PCIe and V100-SXM2 installed. P100’s performance number is also listed for comparison. It can be observed:

    1. Both P100 and V100 scaling well, performance increases as more GPUs are added.

    2. V100 is ~30% faster than P100 on both PCIe (Config G) and SMX2 (Config K).

    3. A single C4130 server with 4x V100 reaches over 20TFlops on PCIe (Config G).

      HPL is a system level benchmark and its performance is limited by other components like CPU, memory and PCIe bandwidth. Configuration G is a balanced design, which has 2 PCIe links between CPU and GPU and this is why it outperforms configuration K with 4x GPUs in the HPL benchmark. We do see some other applications perform better in Configuration K, since SXM2 (Config K) supports NVLink, higher core clock speed and peer-to-peer data transfer, these are described below.

       

    HPCG

    Figure 3: HPCG Performance results with 4x V100 and P100 on C4130 configuration G and K

    HPCG, the High Performance Conjugate Gradients benchmark, is another well-known metric for HPC system ranking. Unlike HPL, its performance is strongly influenced by memory bandwidth. Credit to the faster and higher efficient HBM2 memory of V100, the performance improvement observed is 44% over P100 on both Configuration G and K.

    AMBER

    Figure 4: AMBER Multi-GPU results with V100 and P100 on C4130 configuration G and K

    Figure 4 illustrates AMBER’s results with Satellite Tobacco Mosaic Virus (STMV) dataset. On SXM2 system (Config K), AMBER scales weakly with 2 and 4 GPUs. Even though the scaling is not strong, V100 has noticeable improvement than P100, giving ~78% increase in single card runs, and 1x V100 is actually 23% faster than 4x P100. On the PCIe (Config G) side, 1 and 2 cards perform similar to SXM2, but 4 cards’ results dropped sharply. This is because PCIe (Config G) only supports Peer-to-Peer access between GPU0/1 and GPU2/3 and not among all four GPUs. Since AMBER has redesigned the way data transfers among GPUs to address the PCIe bottleneck, it relies heavily on Peer-to-Peer access for performance with multiple GPU cards. Hence a fast, direct interconnect like NVLink between all GPUs in SXM2 (Config K) is vital for AMBER multiple GPU performance.

    Figure 5: AMBER Multi-GPU Aggregate results with V100 and P100 on C4130 configuration G and K

    To compensate for a single job’s weak scaling on multiple GPUs, there is another use case promoted by AMBER developers, which is running multiple jobs in the same node concurrently but where each job uses only 1 or 2 GPUs. Figure 5 shows the results of 1-4 individual jobs on one C4130 with V100s and the numbers indicate that those individual jobs have little impact on each other. This is because AMBER is designed to run pretty much entirely on the GPUs and has very low dependency on the CPU. The aggregate throughput of multiple individual jobs scales linearly in this case. Without any card to card communication, the 5% better performance on SXM2 is contributed by its higher clock speed.

     

    LAMMPS

     

    Figure 6: LAMMPS 4-GPU results with V100 and P100 on C4130 configuration G and K

    Figure 6 shows LAMMPS performance on both configurations G and K. The testing dataset is Lennard-Jones liquid dataset, which contains 512000 atoms, and LAMMPS compiled with the kokkos package. V100 is 71% and 81% faster on Config G and Config K respectively. Comparing V100-SXM2 (Config K) and V100-PCIe (Config G), the former is 5% faster due to NVLINK and higher CUDA core frequency.

     

    Conclusion

     

    Figure 7: V100 Speedups on C4130 configuration G and K

    The C4130 server with NVIDIA® Tesla® V100™ GPUs demonstrates exceptional performance for HPC applications that require faster computational speed and highest data throughput. Applications like HPL, HPCG benefit from the additional PCIe links between CPU and GPU that are offered by Dell PowerEdge C4130 configuration G. On the other hand, applications like AMBER and LAMMPS were boosted with C4130 configuration K, owing to P2P access, higher bandwidth of NVLink and higher CUDA core clock speed. Overall, a PowerEdge C4130 with Tesla V100 GPUs performs 1.24x to 1.8x faster than a C4130 with P100 for HPL, HPCG, AMBER and LAMMPS.

     

  • General HPC

    Deep Learning on V100

    Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula.

    HPC Innovation Lab. September 2017

    Overview

    In this blog, we will introduce the NVIDIA Tesla Volta-based V100 GPU and evaluate it with different deep learning frameworks. We will compare the performance of the V100 and P100 GPUs. We will also evaluate two types of V100: V100-PCIe and V100-SXM2. The results indicate that in training V100 is ~40% faster than P100 with FP32 and >100% faster than P100 with FP16, and in inference V100 is 3.7x faster than P100. This is one blog of our Tesla V100 blog series. Another blog of this series is about the general HPC applications performance on V100 and you can read it here

    Introduction to V100 GPU

    In the 2017 GPU Technology Conference (GTC), NVIDIA announced the Volta-based V100 GPU. Similar to P100, there are also two types of V100: V100-PCIe and V100-SXM2. V100-PCIe GPUs are inter-connected by PCIe buses and the bi-directional bandwidth is up to 32 GB/s. V100-SXM2 GPUs are inter-connected by NVLink and each GPU has six links and the bi-directional bandwidth of each link is 50 GB/s, so the bi-directional bandwidth between different GPUs is up to 300 GB/s. A new type of core added in V100 is called tensor core which was designed specifically for deep learning. These cores are essentially a collection of ALUs for performing 4x4 matrix operations: specifically a fused multiply add (A*B+C), multiplying two 4x4 FP16 matrices together, and then adding to a FP16/FP32 4x4 matrix to generate a final 4x4 FP16/FP32 matrix. By fusing matrix multiplication and add in one unit, the GPU can achieve high FLOPS for this operation. A single Tensor Core performs the equivalent of 64 FMA operations per clock (for 128 FLOPS total), and with 8 such cores per Streaming Multiprocessor (SM), 1024 FLOPS per clock per SM. By comparison, even with pure FP16 operations, the standard CUDA cores in a SM only generate 256 FLOPS per clock. So in scenarios where these cores can be used, V100 is able to deliver 4x the performance versus P100. The detailed comparison between V100 and P100 is in Table 1. 

                                                                      

    Table 1: The comparison between V100 and P100

    Testing Methodology

    As in our previous deep learning blog, we still use the three most popular deep learning frameworks: NVIDIA’s fork of Caffe (NV-Caffe), MXNet and TensorFlow. Both NV-Caffe and MXNet have been optimized for V100. TensorFlow still does not have any official release to support V100, but we applied some patches obtained from TensorFlow developers so that it is also optimized for V100 in these tests. For the dataset, we still use ILSVRC 2012 dataset whose training set contains 1281167 training images and 50000 validation images. For the testing neural network, we chose Resnet50 as it is a computationally intensive network. To get best performance, we used CUDA 9-rc compiler and CUDNN library in all of the three frameworks since they are optimized for V100. The testing platform is Dell EMC’s PowerEdge C4130 server. The C4130 server has multiple configurations, and we evaluated both P100-PCIe in configuration G and P100-SXM2 in configuration K. The difference between configuration G and configuration K is shown in Figure 1. There are mainly two differences: one is that configure G has two x16 PCIe link connecting from dual CPUs to the four GPUs, while configure K has only one x16 PCIe bus connecting from one CPU to four GPUs; another difference is that GPUs are connected by PCIe buses in configure G but by NVLink in configure K. The other hardware and software details are shown in Table 2.

      

    Figure 1: Comparison between configure G and configure K

    Table 2: The hardware configuration and software details


    In this experiment, we trained various deep learning frameworks with one pass on the whole dataset since we were comparing only the training speed, not the training accuracy. Other important input parameters for different deep learning frameworks are listed in Table 3. For NV-Caffe and MXNet, in terms of different batch size, we doubled the batch size for FP16 tests since FP16 consumes half the memory for floating points as FP32. As TensorFlow does not support FP16 yet, we did not evaluate its FP16 performance in this blog. Because of different implementations, NV-Caffe consumes more memory than MXNet and TensorFlow for the same neural network, the batch size in FP32 mode is only half of that in MXNet and TensorFlow. In NV-Caffe, if FP16 is used, then the data type of several parameters need to be changed. We explain these parameters as follows: the solver_data_type controls the data type for master weights; the default_forward_type and default_backward_type controls the data type for training values; the default_forward_math and default_backward_math controls the data type for matrix-multiply accumulator. In this blog we used FP16 for training values, FP32 for matrix-multiply accumulator and FP32 for master weights. We will explore other combinations in our future blogs. In MXNet, we tried different values for the parameter “--data-nthreads” which controls the number of threads for data decoding.
     

    Table 3: Input parameters used in different deep learning frameworks

     

    Performance Evaluation

    Figure 1, Figure 2, and Figure 3 show the performance of V100 versus P100 with NV-Caffe, MXNet and TensorFlow, respectively. And Table 4 shows the performance improvement of V100 compared to P100. From these results, we can obtain the following conclusions:

    • In both PCIe and SXM2 versions, V100 is >40% faster than P100 in FP32 for both NV-Caffe and MXNet. This matches the theoretical speedup. Because FP32 is single precision floating points, and V100 is 1.5x faster than P100 in single precision. With TensorFlow, V100 is more than 30% faster than P100. Its performance improvement is lower than the other two frameworks and we think that is because of different algorithm implementations in these frameworks.

    • In both PCIe and SXM2 versions, V100 is >2x faster than P100 in FP16. Based on the specification, V100 tensor performance is ~6x than P100 FP16. The reason that the actual speedup does not match the theoretical speedup is that not all data are stored in FP16 and so not all operations are tensor operations (the FMA matrix multiply and add operation).

    • In V100, the performance of FP16 is close to 2x than that of FP32. This is because FP16 only requires half storage compared to FP32 and therefore we could double the batch size in FP16 to improve the computation speed.

    • In MXNet, we set the “--data-nthreads” to 16 instead of the default value 4. The default value is often sufficient to decode more than 1K images per second but still not fast enough for V100 GPU. In our testing, we found the default value 4 is enough for P100 but for V100 we need to set it at least 12 to achieve good performance, with a value of 16 being ideal.

    Figure 2: Performance of V100 vs P100 with NV-Caffe

    Figure 3: Performance of V100 vs P100 with MXNet

     

     

    Figure 4: Performance of V100 vs P100 with TensorFlow

    Table 4: Improvement of V100 compared to P100

     

    Since V100 supports both deep learning training and inference, we also tested the inference performance with V100 using the latest TensorRT 3.0.0. The testing was done in FP16 mode on both V100-SXM2 and P100-PCIe and the result is shown in Figure 5. We used batch size 39 for V100 and 10 for P100. Different batches were chosen to make their inference latencies are close to each other (~7ms in the figure). The result shows that when their latencies are close, the inference throughput of V100 is 3.7x faster compared to P100.

     Figure 5: Resnet50 inference performance on V100 vs P100

    Conclusions and Future Work

    After evaluating the performance of V100 with three popular deep learning frameworks, we conclude that in training V100 is more than 40% faster than P100 in FP32 and more than 100% faster in FP16, and in inference V100 is 3.7x faster than P100. This demonstrates the performance benefits when the V100 tensor cores are used. In the future work, we will evaluate different data type combinations in FP16 and study the accuracy impact with FP16 in deep learning training. We will also evaluate the TensorFlow with FP16 once support is added into the software. Finally, we plan to scale the training to multiple nodes with these frameworks.