Dell Community
High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • HPC Community Honors Dell EMC with Highly Coveted HPCwire 2016 Editor’s Choice Awards

    Ed Turkel, customer Karen Green (CRC) with Tom Tabor receiving HPCwire's Editors Choice Award for Best Use of High Performance Data Analytics
    Ed Turkel (HPC Sr. Strategist) with Tom Tabor (Tabor Communications CEO) receiving HPCwire's Editors Choice Award for Top Five Vendors to Watch

    Just before the kick-off of the opening gala for the SC16 international supercomputing conference, HPCwire unveiled the winners of the 2016 HPCwire Editors’ Choice Awards. Each year, this awards program recognizes the best and the brightest developments that have happened in high performance computing over the past 12 months. Selected by a panel of HPCwire editors and thought leaders in HPC, these awards are highly coveted as prestigious recognition of achievements by the HPC community.

    Traditionally revealed and presented each year to kick off the Supercomputing Conference (SC16), which showcases high performance computing, networking, storage, and data analysis, the awards are an annual feature of the publication and spotlight outstanding breakthroughs and achievements in HPC.

    Tom Tabor, CEO of Tabor Communications, the publisher of HPCwire, announced the list of winners in Salt Lake City, UT.

    “From thought leaders to end users, the HPCwire readership reaches and engages every corner of the high performance computing community,” said Tabor. “Receiving their recognition signifies community support across the entire HPC space, as well as the breadth of industries it serves.

    Dell EMC was honored to be presented with two 2016 HPCwire Editors’ Choice Awards:

    Best Use of High Performance Data Analytics:
    The Best Use of High Performance Data Analytics award was presented to UNC-Chapel Hill Institute of Marine Sciences (IMS) and Coastal Resilience Center of Excellence (CRC), Renaissance Computing Institute (RENCI), and Dell EMC. UNC-Chapel Hill IMS and CRC work with the Dell EMC-powered RENCI Hatteras Supercomputer to predict dangerous coastal storm surges, including Hurricane Matthew, a long-lived, powerful and deadly tropical cyclone which became the first Category 5 Atlantic hurricane since 2007.

    Top 5 Vendors to Watch
    Dell EMC was recognized by the 2016 HPCwire Editors’ Choice Awards panel, along with Fujitsu, HPE, IBM and NVIDIA, as one of the Top 5 Vendors to Watch in high performance computing. As the only true end-to-end solutions provider in the HPC market, Dell EMC is committed to serving customer needs. And with the combination of Dell, EMC and VMware, we are a leader in the technology of today, with the world’s greatest franchises in servers, storage, virtualization, cloud software and PCs. Looking forward, we will occupy a very strong position in the most strategic areas of technology of tomorrow: digital transformation, software defined data center, hybrid cloud, converged infrastructure, mobile and security.

    To learn more about HPC at Dell EMC, join the Dell EMC HPC Community at www.Dellhpc.org, or visit us online at www.Dell.com/hpc and www.HPCatDell.com.

  • With Blazing Speed: Some of the Fastest Systems on the Planet are Powered by Dell EMC

    MIT Lincoln Laboratory Supercomputing Center created a 1 petaflop system in less than a month to further research in autonomous systems, device physics and machine learning.

    Twice each year, the TOP500 list ranks the 500 most powerful general-purpose computer systems known. In the present list, released at the SC16 conference in Salt Lake City, UT, computers in common use for high-end applications are ranked by their performance on the LINPACK Benchmark. Sixteen of these world-class systems are powered by Dell EMC. Collectively, these customers are accomplishing amazing results, continually innovating and breaking new ground to solve the biggest, most important challenges of today and tomorrow while also making major contributions to the advancement of HPC.

    Here are just a few examples:

    • Texas Advanced Computing Center/University of Texas
      Stampede - PowerEdge C8220, Xeon E5-2680 8C 2.700GHz, Infiniband FDR, Intel Xeon Phi SE10P and
      Stampede-KNL - Intel S7200AP Cluster, Intel Xeon Phi 7250 68C 1.4GHz, Intel Omni-Path
      The Texas Advanced Computing Center (TACC), a Dell EMC HPC Innovation Center, designs and operates some of the world's most powerful computing resources. The Center's mission is to enable discoveries that advance science and society through the application of advanced computing technologies. TACC supports the University of Texas System and National Science Foundation researchers with the newest version of their Stampede high-performance computing cluster.

    • MIT Lincoln Laboratory
      TX-Green - S7200AP Cluster, Intel Xeon Phi 7210 64C 1.3GHz, Intel Omni-Path
      MIT Lincoln Laboratory Supercomputing Center (LLSC) supports research and development aimed at solutions to problems that are critical to the Nation. The research spans diverse fields such as space observations, robotic vehicles, communications, cyber security, machine learning, sensor processing, electronic devices, bioinformatics, and air traffic control. LLSC addresses the supercomputing needs of thousands of MIT scientists and engineers by providing interactive, on-demand supercomputing and big data capabilities with zero carbon footprint.

    • Centre for High Performance Computing, South Africa         
      Lengau - PowerEdge C6320, Xeon E5-2690v3 12C 2.6GHz, Infiniband FDR
      The Centre for High Performance Computing (CHPC), a Dell EMC HPC Innovation Center, is part of South Africa’s Council for Scientific and Industrial Research and hosts the fastest computer on the African continent. CHPC’s new Dell EMC-powered Lengau system will enable new opportunities and avenues in cutting-edge research, such as building the world's largest radio telescope, and will provide the computational capacity to build the private sector and non-academic user base of CHPC to help spur national economic growth.

    • University of Florida
      HiperGator 2.0 - PowerEdge C6320, Xeon E5-2698v3 16C 2.3GHz, Infiniband
      The University of Florida’s HiPerGator 2.0 system performs complex calculations and data analyses for researchers and scholars at UF and their collaborators worldwide. It is helping researchers find life-saving drugs and get them from the computer to the clinic more quickly, make more accurate, decades-long weather forecasts and improve body armor for troops.

    • Ohio Supercomputer Center
      Owens - Dell PowerEdge C6320/R730, Xeon E5-2680v4 14C 2.4GHz, Infiniband EDR
      The Ohio Supercomputer Center empowers a wide array of groundbreaking innovation and economic development activities in the fields of bioscience, advanced materials, data exploitation and other areas of state focus by providing a powerful high performance computing, research and educational cyberinfrastructure for a diverse statewide/regional constituency.

    • Dell EMC HPC Innovation Lab
      Zenith - Dell PowerEdge C6320 & Dell PowerEdge R630, Xeon E5-2697v4 18C 2.3GHz, Intel Omni-Path
      The Dell EMC HPC Innovation Lab is dedicated to HPC research, development and innovation. Its engineers are meeting real-life, workload-specific challenges through collaboration with the global HPC research community and are publishing whitepapers on their research findings. They are utilizing the lab’s world-class Infrastructure to characterize performance behavior and to test and validate upcoming technologies. The Dell EMC HPC Innovation Lab is also an OpenHPC R&D contributor.
  • Advancing HPC: A Closer Look at Cool New Tech

    Dell EMC has partnered with Scientific Computing, publisher of HPC Source, and NVIDIA to produce an exclusive high performance computing supplement that takes a look at some of today’s cool new HPC technologies, as well as some of the work being done to extend HPC capabilities and opportunities.

    This special publication, “New Technologies in HPC,” highlights topics such as innovative technologies in HPC and the impact they are having on the industry, HPC trends to watch, and advancing science with AI. It also looks at how organizations are extending supercomputing with cloud, machine learning technologies for the modern data center, and getting starting with deep learning.

    This digital supplement can be viewed on-screen or downloaded as a PDF.

    Taking our dive into new HPC technologies a bit deeper — we also brought together technology experts Paul Teich, Principal Analyst at TIRIAS Research, and Will Ramey, Senior Product Manager for GPU Computing at NVIDIA, for a live, interactive discussion with contributing editor Tim Studt: “Accelerate Your Big Data Strategy with Deep Learning.

    Paul and Will share their unique perspectives on where artificial intelligence is leading the next wave of industry transformation, helping companies go from data deluge to data-hungry. They provide insights on how organizations can accelerate their big data strategies with deep learning, the fastest growing field in AI, and discuss how, by using data-driven algorithms powered by GUP accelerators, companies can get faster insights, as well as how companies can see dynamic correlations, and achieve actionable knowledge about their business.

    For those who couldn't make the live broadcast, it is available for on-demand viewing.

    To learn more about HPC at Dell EMC, join the Dell EMC HPC Community at www.Dellhpc.org, or visit us online at www.Dell.com/hpc and www.HPCatDell.com.

  • Dell China Receives AI Innovation Award

    Innovation Award of Artificial Intelligence Technology and Practice presented to Dell China by CCF (China Computer Federation)

    Dell China has been honored with an “Innovation Award of Artificial Intelligence in Technology & Practice” in recognition of Dell’s collaboration with the Institute of Automation, Chinese Academy of Sciences (CASIA) in establishing the Artificial Intelligence and Advanced Computing Joint-Lab. The advanced computing platform was jointly unveiled by Dell China and CASIA in November 2015, and the AI award was presented by the Technical Committee of High Performance Computing (TCHPC), China Computer Federation (CCF), at the China HPC 2016 conference in Xi’an City, Shanxi Province China, on October 27, 2016. About a half dozen additional awards were presented at HPC China, an annual national conference on high performance computing organized by TCHPC. However, Dell China was the only vendor to receive an award in the emerging field of artificial intelligence in HPC.

    The Artificial Intelligence and Advanced Computing Joint-Lab’s focus is on research and applications of new computing architectures in brain information processing and artificial intelligence, including cognitive function simulation, deep learning, brain computer simulation, and related new computing systems. The lab also supports innovation and development of brain science and intellect technology research, promoting Chinese innovation and breakthroughs at the forefront of science, and working to produce and industrialize these core technologies in accordance with market and industry development needs.​

    CASIA, a leading AI research organization in China, has huge requirements for computing and storage, and the new advanced computing platform — designed and set up by engineers and professors from Dell and CASIA — is just the tip of the iceberg with respect to CASIA’s research requirements. It features leading Dell HPC systems components designed by the Dell USA team, including servers, storage, networking and software, as well as leading global HPC partner products, including Intel CPU, NVIDIA GPU, Mellanox IB Network and Bright Computing software. The Dell China Services team implemented installation and deployment of the system, which was completed in February 2016.

    The November 3, 2015, unveiling ceremony for the Artificial Intelligence and Advanced Computing Joint-Lab was held in Beijing. Marius Haas, Chief Commercial Officer and President, Enterprise Solutions of Dell; Dr. Chenhong Huang, President of Dell Greater China; and Xu Bo, Director of CASIA attended the ceremony and addressed the audience.

    “As a provider of end-to-end solutions and services, Dell has been focusing on and promoting the development of frontier science and technologies, and applying the latest technologies to its solutions and services to help customers achieve business transformation and meet their ever-changing demands,” Haas said at the unveiling. “We’re glad to cooperate with CASIA in artificial intelligence, which once again shows Dell’s commitment to China’s market and will drive innovation in China’s scientific research.”

    “Dell is well-positioned to provide innovative end-to-end solutions. Under the new 4.0 strategy of ‘In China, For China’, we will strengthen the cooperation with Chinese research institutes and advance the development of frontier technologies,” Huang explained. “Dell’s cooperation with CASIA represents a combination of computing and scientific research resources, which demonstrates a major trend in artificial intelligence and industrial development.”

    China is a role model for emerging market development and practice sharing for other emerging countries. Partnering with CASIA and other strategic partners is Dell’s way of embracing the “Internet+” national strategy, promoting Chinese innovation and breakthroughs at the forefront of science.

    “China’s strategy in innovation-driven development raises the bar for scientific research institutes. The fast development of information technologies in recent years also brings unprecedented challenges to CASIA,” added Bo. “CASIA always has intelligence technologies in mind as their main focus of strategic development. The cooperation with Dell China on the lab will further the computing advantages of the Institute of Automation, strengthen the integration between scientific research and industries, and advance artificial intelligence innovation.”

    Dell China is looking forward to continued cooperation with CASIA in driving artificial intelligence across many more fields, such as meteorology, biology and medical research, transportation, and manufacturing.

  • Cryo-EM in HPC with KNL

    By Garima Kochhar and Kihoon Yoon. Dell EMC HPC Innovation Lab. October 2016

    This blog presents performance results for the 2D alignment and 2D classification phases of the Cryo-electron microscopy (Cryo-EM) data processing workflow using the new Intel Knights Landing architecture, and compares these results to the performance of the Intel Xeon E5-2600 v4 family. A quick description of Cryo-EM and the different phases in the process of reconstructing 3D molecular structures with electron microscopy is provided below, followed by the specific tests conducted in this study and the performance results.

    Cryo-EM allows molecular samples to be studied in near-native states and down to nearly atomic resolutions. Studying the 3D structure of these biological specimens can lead to new insights into their functioning and interactions, especially with proteins and nucleic acids, and allows structural biologists to examine how alterations in their structures affect their functions. This information can be used in system biology research to understand the cell signaling network which is part of a complex communication system. This communication system controls fundamental cell activities and actions to maintain normal cell homeostasis. Errors in the cellular signaling process can lead to diseases such as cancer, autoimmune disorders, and diabetes. Studying the functioning of the proteins responsible for an illness enables a biologist to develop specific drugs that can interact with the protein effectively, thus improving the efficacy of treatment. 

    The workflow from the time a molecular sample is created to the creation of a 3D model of its molecular structure involves multiple steps. These steps are briefly (and simplistically!) described below.

    1. Samples of the molecule (protein, enzyme, etc.) are purified and concentrated in a solution.
    2. This sample is placed on an electron microscope grid and plunge-frozen.  This forms a very thin layer of vitreous ice to that surrounds and immobilizes the sample in its near-native state.
    3. The frozen sample is now placed in the microscope for imaging.
    4. The output of the microscope consists of many large image files and across multiple fields of view (many terabytes of data).
    5. Due to the low energy beams used in Cryo-EM (to avoid damaging the structures being studied), the images produced by the microscope have a bad signal-to-noise ratio. To improve the results, the microscope takes multiple images for each field of view.  Motion-correction techniques are then applied to allow the multiple images of the same molecule to be added together into an image with less noise.
    6. The next step is a manual process of picking “good-looking” molecule images from a few fields of view.
    7. The frozen sample consists of many molecules that are in many different positions. The resultant Cryo-EM images therefore also consists of images or shadows of the particle from different angles. So, the next step is a 2D alignment phase to uniformly orient the images by image rotation and translation.
    8. Next a 2D classification phase searches through these oriented images and sorts them into “classes”, grouping images that have the same view.
    9. After alignment and classification, there should be multiple collections of images, where each collection contains images showing a view of the molecule from the same angle and showing the same shape of the molecule (a “class”).  The images in a class are now combined into a composite image that provides a higher quality representation of that shape.
    10. Finally a 3D reconstruction of the molecule is built from all the composite 2D images.
    11. This 3D model can then be handed back to the structural biologist for further analysis, visualization, etc.

    As is now clear, the Cryo-EM processing workflow must comprehend a lot of data, requires rich compute algorithms and considerable compute power for the 2D and 3D phases, and must move data efficiently across the multiple phases in the workflow. Our goal is to design a complete HPC system that can support the Cryo-EM workflow from start to finish and is optimized for performance, energy efficiency and data efficiency.

     

    Performance Tests and Configuration

    Focusing for now on the 2D phases of the workflow, this blog presents results for the steps #7 and #8 listed above - the 2D alignment and 2D classification phases. Two software packages in this domain, ROME and RELION were benchmarked on the Knights Landing (KNL, code name for the Intel Xeon Phi 7200 family) and Broadwell (BDW, code name for Intel Xeon E5-2600 v4 family) processors.

    The tests were run on systems with the following configuration.

    Broadwell-based systems

    Server

    12 * Dell PowerEdge C6320

    Processor

    Intel Xeon E5-2697 v4. 18 cores per socket, 2.3 GHz

    Memory

    128 GB at 2400 MT/s

    Interconnect

    Intel Omni-Path fabric

    KNL-based systems

    Server

    12 * Dell PowerEdge C6320p

    Processor

    Intel Xeon Phi 7230.  64 cores, 1.3 GHz

    Memory

    96 GB at 2400 MT/s

    Interconnect

    Intel Omni-Path fabric

    Software

    Operating System

    Red Hat Enterprise Linux 7.2

    Compilers

    Intel 2017, 17.0.0.098 Build 20160721

    MPI

    Intel MPI 5.1.3

    ROME

    1.0a

    RELION

    1.4

    Benchmark Datasets

    RING11_ALL

    Set1. Inflammasome data: 16306 images of NLRC4/NAIP2 inflammasome with a size of 2502 pixels

    DATA6 

    Set4. RP-a: 57001 images of proteasome regulatory particles (RP) with a size of 1602 pixels

    DATA8

    Set2. RP-b: 35407 images of proteasome regulatory particles (RP) with a size of 1602 pixels

     

    ROME

    ROME performs the 2D alignment (step #7 above) and the 2D classification (step #8 above) in two separate phases called the MAP phase and the SML phase respectively. For our tests we used “-k” for MAP equal to 50 (i.e. 50 initial classes) and “-k” for SML equal to 1000 (i.e. 1000 final 2D classes).

    The first set of graphs below, Figure 1 and Figure 2 show the performance of the SML phase on KNL. The compute portion of the SML phase scales linearly as more KNL systems are added into the test bed, from 1 to 12 servers as shown in Figure 1. The total time to run shown in Figure 2 is slightly lower than linear, and includes an I/O component as well as the compute component. The test bed used in this study did not have a parallel file system and used just local disks on the KNL servers. Future work for this project includes evaluating the impact of adding a Lustre parallel file system to this test bed and its effect on total time for SML.

    Figure 1 - ROME SML scaling on KNL, compute time

    Figure 2 - ROME SML scaling on KNL, total time

    The next set of graphs compare the ROME SML performance on KNL and Broadwell. Figure 3, Figure 4 and Figure 5 plot the compute time for SML on 1 to 12 servers. The black circle on the graph shows the improvement in KNL runtime when compared to BDW. For all three datasets that were benchmarked, KNL is about 3x faster than BDW. Note we’re comparing one single-socket KNL server to a dual-socket Broadwell server, so this is a server to server comparison (not socket to socket). KNL is 3x faster than BDW across different numbers of servers, showing that ROME SML scales well on Omni-Path on both KNL and BDW, but the absolute compute time on KNL is 3x faster irrespective of the number of servers in test.

    Considering total time to run on KNL versus BDW, we measured KNL to be 2.4x to 3.3x faster than BDW at all node counts. Specifically, DATA6 is ~ 2.4x faster on KNL, DATA8 is 3x faster on KNL and RING11_ALL is 3.4x faster on KNL when considering total time to run. As mentioned before, the total time includes an I/O component and one of the next step in this study is to evaluate the performance improvement if adding a parallel file system to the test bed.

    Figure 3 - DATA8 ROME SML on KNL and BDW

    Figure 4 - DATA6 ROME SML on KNL and BDW.

      

    Figure 5 - RING11_ALL ROME SML on KNL and BDW

     

    RELION

    RELION accomplishes the 2D alignment and classification steps mentioned above in one phase. Figure 6 shows our preliminary results on RELION on KNL across 12 servers and on two of the test datasets. The “--K” parameter for RELION was set to 300, i.e., 300 classes for 2D classification. There are several things to be still tried here – the impact of a parallel file system on RELION (as we discussed for ROME earlier) and dataset sensitivity to the parallel file system. Additionally we plan to benchmark RELION on Broadwell, across different node counts and with different input parameters.

    Figure 6 - RELION 2D alignment and classification on KNL

    Next Steps

    The next steps in this project include adding a parallel file system to measure the impact on the workflow, tuning the test parameters for ROME MAP, SML and RELION, and testing on more datasets. We also plan to measure the power consumption of the cluster when running Cryo-EM workloads to analyze performance per watt and performance per dollar metrics for KNL.

  • Deep Learning Performance with P100 GPUs

    Authors: Rengan Xu and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. October 2016

    Introduction to Deep Learning and P100 GPU

    Deep Learning (DL), an area of Machine Learning, has achieved significant progress in recent years. Its application area includes pattern recognition, image classification, Natural Language Processing (NLP), autonomous driving and so on. Deep learning attempts to learn multiple levels of features of the input large data sets with multi-layer neural networks and make predictive decision for the new data. This indicates two phases in deep learning: first, the neural network is trained with large number of input data; second, the trained neural network is used to test/inference/predict the new data. Due to the large number of parameters (the weight matrix connecting neurons in different layers and the bias in each layer, etc.) and training set size, the training phase requires tremendous amounts of computation power.

    To approach this problem, we utilize accelerators which include GPU, FPGA and DSP and so on. This blog focuses on GPU accelerator. GPU is a massively parallel architecture that employs thousands of small but efficient cores to accelerate the computational intensive tasks. Especially, NVIDIA® Tesla® P100™ GPU uses the new Pascal™ architecture to deliver very high performance for HPC and hyperscale workloads. In PCIe-based servers, P100 delivers around 4.7 and 9.3 TeraFLOPS of double and single precision performance, respectively. And in NVLink™-optimized servers, P100 delivers around 5.3 and 10.6 TeraFLOPS of double and single precision performance, respectively. This blog focuses on P100 for PCIe-based servers. P100 is also equipped with High Bandwidth Memory 2 (HBM2) which offers higher bandwidth than the traditional GDDR5 memory. Therefore, the high compute capability and high memory bandwidth make GPU an ideal candidate to accelerate deep learning applications.

    Deep Learning Frameworks and Dataset

    In this blog, we will present the performance and scalability of P100 GPUs with different deep learning frameworks on a cluster. Three deep learning frameworks were chosen: NVIDIA’s fork of Caffe (NV-Caffe), MXNet and TensorFlow. Caffe is a well-known and widely used deep learning framework which is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. It focuses more on the image classification and it supports multiple GPUs within a node but not across nodes. MXNet, jointly developed by collaborators from multiple universities and companies, is a lightweight, portable and flexible deep learning framework designed for both efficiency and flexibility. This framework scales to multiple GPUs within a node and across nodes. TensorFlow, developed by Google’s Brain team, is a library for numerical computation using data flow graphs. TensorFlow also supports multiples GPUs and can scale to multiple nodes.

    All of the three deep learning frameworks we chose are able to perform the image classification task. With this in mind, we chose the well-known ImageNet Large Scale Visual Recognition Competition (ILSVRC) 2012 dataset. This training dataset contains 1281167 training images and 50000 validation images. All images are grouped into 1000 categories or classes. Another reason we chose ILSVRC 2012 dataset is that its workload is large enough to perform long time training and it is a benchmark dataset used by many deep learning researchers.

    Testing Methodology

    This blog quantifies the performance of deep learning frameworks using NVIDIA’s P100-PCIe GPU and Dell’s PowerEdge C4130 server architecture. Figure 1 shows the testing cluster. The cluster includes one head node which is Dell’s PowerEdge R630 and four compute nodes which are Dell’s PowerEdge C4130. All nodes are connected by an InfiniBand network and they share disk storage through NFS. Each compute node has 2 CPUs and 4 P100-PCIe GPUs. All of the four compute nodes have the same configurations. Table 1 shows the detailed information about the hardware configuration and software used in every compute node.

    Figure 1: Testing Cluster for Deep Learning

     

    Table 1: Hardware Configuration and Software Details

    Platform

    PowerEdge C4130 (configuration G)

    Processor

    2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)

    Memory

    256GB DDR4 @ 2400MHz

    Disk

    9TB HDD

    GPU

    P100-PCIe with 16GB GPU memory

    Nodes Interconnects

    Mellanox ConnectX-4 VPI (EDR 100Gb/s Infiniband)

    Infiniband Switch

    Mellanox SB7890

    Software and Firmware

    Operating System

    RHEL 7.2 x86_64

    Linux Kernel Version

    3.10.0-327.el7

    BIOS

    Version 2.1.6

    CUDA version and driver

    CUDA 8.0 (361.77)

    NCCL version

    Version 1.2.3

    cuDNN library

    Version 5.1.3

    Intel Compiler

    Version 2017.0.098

    Python

    2.7.5

    Deep Learning Frameworks

    NV-Caffe

    Version 0.15.13

    Intel-Caffe

    Version 1.0.0-rc3

    MXNet

    Version 0.7.0

    TensorFlow

    Version 0.11.0-rc2

    We measured the metrics of both images/sec and training time.

    The images/sec is the measurement for training speed while the training time is the wall clock time for training, I/O operation and other overhead. The images/sec number was obtained from “samples/sec” in MXNet and TensorFlow in the output log files. NV-Caffe listed “M s/N iter” as output which means M seconds were taken to process N iterations, or N batches. The metric images/sec was calculated by “batch_size*N/M”. The batch size is the number of training samples in one forward/backward pass through all layers of a neural network. The images/sec number was averaged across all iterations to take into account the deviations.

    The training time was obtained from “Time cost” in MXNet output logs. For NV-Caffe and TensorFlow, their output log files contained the wall-clock timestamps during the whole training. So the time difference from the start to the end of the training was calculated as the training time.

    Since NV-Caffe did not support distributed training, it was not executed on multiple nodes. The MXNet framework was able to run on multiple nodes. However, the caveat was that it could only use the Ethernet interface (10 Gb/s) on the compute nodes by default, and therefore the performance was not as high as expected. To solve this issue, we have manually changed its source code so that the high-speed InfiniBand interface (EDR 100 Gb/s) was used. The training with TensorFlow on multiple nodes was able to run but with poor performance and the reason is still under investigation.

    Table 2 shows the input parameters used in different deep learning frameworks. In all deep learning frameworks, the neural network training requires many epochs or iterations. Whether the term epoch or iteration is used is determined by each framework. An epoch is a complete pass through all samples in a given dataset, while one iteration processes only one batch of samples. Therefore, the relationship between iterations and epochs is: epochs = (iterations*batch_size)/training_samples. Every framework only needs either epochs or iterations so that another parameter can be easily determined by this formula. Since our goal was to measure the performance and scalability of Dell’s server and not to train an end-to-end image classification model, the training was a subset of the full model training which was large enough to reflect performance. Therefore we chose a smaller number of epochs or iterations so that they could finish in a reasonable time. Although only partial training was performed, the training speed (images/sec) remained relatively constant over this period.

    The batch size is one of the hyperparameters the user needs to tune when training a neural network model with mini-batch Stochastic Gradient Descent (SGD). The batch size in the table are commonly used sizes. Whether these batch sizes are optimized for model accuracy is left in future work. For all neural networks in all frameworks, we increased the batch size proportionally with increasing number of GPUs. In the meantime, the number of iterations was adjusted so that the total number of samples was fixed no matter how many GPUs were used. Since epoch has nothing to do with batch size, its value was not changed when a different number of GPUs was used. For MXNet GoogleNet, there was runtime error if different bath sizes were used for different number of GPUs, so we used constant batch size. Learning rate is another hyperparameter that needs to be tuned. In this experiment, the default value in each framework was used.

     

    Table 2: Input parameters used in different deep learning frameworks

     

    Batch size

    Image shape

    Iterations/Epochs

    NV-Caffe GoogleNet

    CPU

    128

    224

    4000 iterations

    1 P100

    128

    4000 iterations

    2 P100

    256

    2000 iterations

    4 P100

    512

    1000 iterations

    TensorFlow Inception-V3

    1 P100

    64

    299

    4000 iterations

    2 P100

    128

    2000 iterations

    4 P100

    256

    1000 iterations

    MXNet GoogleNet

    1-16 P100

    144

    256

    1 epoch

    MXNet

    Inception-BN

    1 P100

    64

    224

    1 epoch

    2 P100

    128

    4 P100

    256

    8 P100

    256

    12 P100

    256

    16 P100

    256

     

    Performance Evaluation

    Figure 2 shows the training speed (images/sec) and training time (wall-clock time) of GoogleNet neural network in NV-Caffe using P100 GPUs. It can be seen that the training speed increased as the number of P100 GPUs increased. As a result, the training time decreased. The CPU result in Figure 2 was obtained from Intel-Caffe on two Intel Xeon CPU E5-2690 v4 (14-core Broadwell processors) within one node. We chose Intel-Caffe for the pure CPU test because it has better CPU optimizations than NV-Caffe. From Figure 1, we can see that 1 P100 GPU is ~5.3x and 4 P100 is ~19.7x faster than a Broadwell based CPU server. Since NV-Caffe has not supported distributed training so far, we only ran it on up to 4 P100 GPUs on one node.

     

    Figure 2: The training speed and time of GoogleNet in NV-Caffe using P100 GPUs

    Figure 3 and Figure 4 show the training speed and time of GoogleNet and Inception-BN neural networks in MXNet using P100 GPUs. In both figures, 8 P100 used 2 nodes, 12 P100 used 3 nodes and 16 P100 used 4 nodes. As we can see from both figures, MXNet had great scalability in training speed and training time when more P100 GPUs were used. As mentioned in Section Testing Methodology, if the Ethernet interfaces in all nodes were used, it would impact the training speed and training time significantly since the I/O operation was not fast enough to feed the GPU computations. Based on our observation, the training speed when using Ethernet was only half the speed compared to when using the InfiniBand interfaces. In both MXNet and TensorFlow, the CPU implementation was extremely slow and we believe they were not CPU optimized, therefore we did not compare their P100 performance with CPU performance.

    Figure 3: The training speed and time of GoogleNet in MXNet using P100 GPUs

    Figure 4: The training speed and time of Inception-BN in MXNet using P100 GPUs

    Figure 5 shows the training speed and training time of Inception-V3 neural network in TensorFlow using P100 GPUs. Similar to NV-Caffe and MXNet, TensorFlow also showed good scalability in training speed when more P100 GPUs were used. The training with TensorFlow on multiple nodes was able to run but with poor performance. So that result was not shown here and the reason is still under investigation.

    Figure 5: The training speed and time of Inception-V3 in TensorFlow using P100 GPUs

    Figure 6 shows the speedup when using multiple P100 GPUs in different deep learning frameworks and neural networks. The purpose of this figure is to demonstrate the speedup in each framework when more number of GPUs are used. The purpose does not include the comparison among different frameworks since their input parameters were different. When using 4 P100 GPUs for NV-Caffe GoogleNet and TensorFlow Inception-V3, we observed a speedup up to 3.8x and 3.0x, respectively. For MXNet, using 16 P100 achieved 13.5x speedup in GoogleNet and 14.7x speedup in Inception-BN which are close to the ideal speedup 16x. In particular, we observed linear speedup when using 8 P100 and 12 P100 GPUs in Inception-BN neural network.

     

    Figure 6: Speedup of multiple P100 GPUs in different DL frameworks and networks

    In practice, a real user application can take days or weeks for training a model. Although our benchmarking cases run in a few minutes or a few hours, they are just small snapshots from much longer runs that would be needed to really train a network. For example, the training of a real application might take 90 epochs of 1.2M images. A Dell C4130 with P100 GPUs can turn in results in less than a day, while CPU takes >1 week – that’s the real benefits to the end users. The effect for real use case is saving weeks of time per run, not seconds.

    Conclusions and Future Work

    Overall, we observed great speedup and scalability in neural network training when multiple P100 GPUs were used in Dell’s PowerEdge C4130 server and multiple server nodes were used. The training speed increased and the training time decreased as the number of P100 GPUs increased. From the results shown, it is clear that Dell’s PowerEdge C4130 cluster is a powerful tool for significantly speeding up neural network training.

    In the future work, we will try the P100 for NVLink-optimized servers with the same deep learning frameworks, neural networks and the dataset and see how much performance improvement can be achieved. This blog experimented the PowerEdge C4130 configuration G in which only GPU 1 and GPU 2, and GPU3 and GPU 4 have peer-to-peer accesses. In the future, we will try C4130 configuration B in which all of the four GPUs connected to one socket have peer-to-peer accesses and check the performance impact in this configuration. We will also investigate the impact of hyperparameters (e.g. batch size and learning rate) on both training performance and model accuracy. The reason of the slow training performance with TensorFlow on multiple nodes will also be examined.

     

  • Application Performance Study on Intel Broadwell EX processors

    Author: Yogendra Sharma, Ashish Singh, September 2016 (HPC Innovation Lab)

    This blog describes the performance analysis on a PowerEdge R930 server powered by four Intel Xeon E7-8890 v4 @2.2GHz processors (code named as Broadwell-EX). Primary objective of this blog is to compare the performance of HPL, STREAM and few scientific applications ANSYS Fluent and WRF with the previous generation of Intel processor Intel Xeon E7-8890 v3 @2.5GHz codenamed Haswell-EX. Below are the configurations used for this study.

    Platform

    PowerEdge R930

    PowerEdge R930

    Processor

    4 x Intel Xeon E7-8890 v3@2.5GHz (18 cores) 45MB L3 cache 165W

    4 x Intel Xeon E7-8890 v4@2.2GHz (24 cores) 60MB L3 cache 165W

    Memory

    1024 GB = 64 x 16GB DDR4 @2400MHz RDIMMS

    1024 GB = 32 x 32GB DDR4 @2400MHz RDIMMS

    BIOS Settings

    BIOS

    Version 1.0.9

    Version 2.0.1

    Processor Settings > Logical Processors

    Disabled

    Disabled

    Processor Settings > QPI Speed

    Maximum Data Rate

    Maximum Data Rate

    Processor Settings > System Profile

    Performance

    Performance

    Software and Firmware

    Operating System

    RHEL 6.6 x86_64

    RHEL 7.2 x86_64

    Intel Compiler

    Version 15.0.2

    Version 16.0.3

    Intel MKL

    Version 11.2

    Version 11.3

    Intel MPI

    Version 5.0

    Version 5.1.3

    Benchmark and Applications

    LINPACK

    V2.1 from MKL 11.2

    V2.1 from MKL 11.3

    STREAM

    v5.10, Array Size 1800000000, Iterations 100

    v5.10, Array Size 1800000000, Iterations 100

    WRF

    v3.5.1, Input Data Conus12KM, Netcdf-4.3.1.1

    V3.8 Input Data Conus12KM, Netcdf-4.4.0

     ANSYS Fluent  v15, Input Data: truck_poly_14m, sedan_4m, aircraft_2m  v16, Input Data: truck_poly_14m, sedan_4m, aircraft_2m

     

           Table 1: Details of Server and HPC Applications used with Broadwell-EX processors

    ____________________________________________________________________________________________________________________________________

    In this section of the blog, we have compared benchmark numbers with two generations of processors on the same server platform i.e. PowerEdge R930 as well as performance of Broadwell-EX processors with different CPU profiles and memory snoop modes namely Home Snoop (HS) and Cluster On Die(COD).

    The High Performance Linpack Benchmark is a measure of a system's floating point computing power. It measures how fast a computer solves a dense n by n system of linear equations Ax = b, which is a common task in engineering. HPL benchmark was run on both PowerEdge R930 servers (With Broadwell-EX and Haswell-EX ) with block size of NB=192 and problem size of N=340992.

      

    Figure 1: Comparing HPL Performance across BIOS profiles      Figure 2: Comparing HPL Performance over two generations of processors

    Figure 1 depicts the performance of PowerEdge R930 server with Broadwell-EX processors on different BIOS options. HS (Home snoop mode) performs better than the COD (Cluster-on-die) on both of the system profiles Performance and DAPC. Figure 2 compares the performance between four socket Intel Xeon E7-8890 v3 and Intel Xeon E7-8890 v4 processor servers. HPL showed 47% performance improvement with four Intel Xeon E7-8890 v4 processors on R930 server in comparison to four Intel Xeon E7-8890 v3 processors. This was due to ~33% increase in the number of cores and 13% increase due to new improved version of both Intel compiler and Intel MKL.  

    Stream benchmark is a synthetic benchmark program that measures sustainable memory bandwidth and the corresponding computation rate for simple vector kernels.

     

      

    Figure 3: Comparing STREAM Performance across BIOS profiles   Figure 4: Comparing STREAM Performance over two generations of processors

     

    As per Figure 3, the memory bandwidth of PowerEdge R930 server with Intel Broadwell-EX processors are same on different bios profiles. Figure4 shows the memory bandwidth of both Intel Xeon Broadwell-EX and Intel Xeon Haswell-EX processors with PowerEdge R930 server. Both Haswell-EX and Broadwell-EX support DDR3 and DDR4 memories respectively, while the platform with this configuration supports 1600MT/s of memory frequency for both generation of processors. Due to the same memory frequency supported by the PowerEdge R930 platform for both generation of processors, both Intel Xeon processors have same memory bandwidth of 260GB/s with the PowerEdge R930 server.

    The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture facilitating parallel computation and system extensibility. The model serves a wide range of meteorological applications across scales from tens of meters to thousands of kilometers. WRF can generate atmospheric simulations using real data or idealized conditions. We used the CONUS12km and CONUS2.5km benchmark datasets for this study. CONUS12km is a single domain and small size (48hours, 12km resolution case over the Continental U.S. (CONUS) domain from October 24, 2001) benchmark with 72 seconds of time step. CONUS2.5km is a single domain and large size (Latter 3hours of a 9hours, 2.5km resolution case over the Continental U.S. (CONUS) domain from June 4, 2005) benchmark with 15 seconds of time step. WRF decomposes the domain into tasks or patches. Each patch can be further decomposed into tiles that are processed separately, but by default there is only one tile for every run. If the single tile is too large to fit into the cache of the CPU and/or core, it slows down computation due to WRF’s memory bandwidth sensitivity. In order to reduce the size of the tile, it is possible to increase the number of tiles by defining “numtile = x” in input file or defining environment variable “WRF_NUM_TILES = x”. For both CONUS 12km and CONUS 2.5km the number of tiles are chosen based on best performance which is equal to 56.

      Figure 5: Comparing WRF Performance across BIOS profiles

    Figure 5 demonstrates the comparison of WRF datasets on different BIOS profiles .With Conus 12KM data ,all the bios profiles performs equally well because of the smaller data size while for CONUS 2.5KM Perf.COD (Performance System Profile with Cluster-On-Die snoop mode) gives best performance. As per the figure 5, the Cluster-on-Die snoop mode is performing 2% higher than Home snoop mode, while the Performance system profile gives 1% better performance than DAPC.

     Figure 6: Comparing WRF Performance over two generations of processors

    Figure 6 shows the performance comparison between Intel Xeon Haswell-EX and Intel Xeon Broadwell-EX processors with PowerEdge R930 server. As shown in the graph, Broadwell-EX performs 24% better than Haswell-EX for CONUS 12KM data set and 6% better for CONUS 2.5KM.

    ANSYS Fluent is a computational fluid dynamics (CFD) software tool. Fluent includes well-validated physical modeling capabilities to deliver fast and accurate results across the widest range of CFD and multi physics applications.

    Figure 7: Comparing Fluent Performance across BIOS profiles

    We used three different datasets for Fluent with ‘Solver Rating’ (Higher is better) as the performance metric. The above graph Figure 7 shows that all three datasets performed 4% better with Perf.COD (Performance System Profile with Cluster-On-Die snoop mode) bios profile than others. While, the DAPC.HS (DAPC system profile with Home snoop mode) bios profile shows lowest performance. For all three datasets ,the COD snoop mode performs 2% to 3% better than Home snoop mode and Performance system profile performs 2% to 4% better than DAPC. For all these three datasets the behaviour of Fluent is consistent.

    Figure 8: Comparing Fluent Performance over two generations of processors

     

    As shown above in Figure 8, for all the test cases on PowerEdge R930 with Broadwell-EX ,Fluent showed 13% to 27% performance improvement in-comparision to PowerEdge R930 with Haswell-EX.

    ________________________________________________________________________________________________

     

    Conclusion:

    Overall, Broadwell-EX processor makes the PowerEdge R930 server more powerful and more efficient. With Broadwell-EX, the HPL performance increses in the smae manner as increase in the number of cores in comparison to Haswell-EX. There is also increase in the performance for real time applications depending on their nature of computation. So, it can be a good choice to upgrade for those who are using compute hungry applications.

     


  • Introducing 100GBps with Intel® Omni-Path Fabric in HPC

     By Munira Hussain, Deepthi Cherlopalle

    This blog introduces the Omni-Path Fabric from Intel® as a cluster network fabric used for intra-node communication for application, management and storage communication in High Performance Computing (HPC). It is part of the new technology referring to Intel® Scalable System framework based on IP generated from the coalition of Qlogic, Truescale and Cray Aries. The goal of Omni-Path is to eventually be able to meet the demands of the exascale data centers in performance and scalability.

    Dell provides complete validated and supported solution offering which includes the Networking H-series Fabric switches and Host Fabric Interface (HFI) adapters. The Omni-Path HFI is a PCI-E Gen3 x16 adapter capable of 100 Gbps unidirectional per port. The card supports 4 lanes supporting 25Gbps per lane.

    HPC Program Overview with Omni-Path:

    The current solution program is based on Red Hat Linux 7.2 (kernel version 3.10.0-327.el7.x86_64). The Intel Fabric Suite (IFS) drivers are integrated in the current software solution stack Bright Cluster Manager 7.2 which helps to deploy, provision, install and configure an Omni-Path cluster seamlessly.

    The following Dell servers support Intel® Omni-Path Host Fabric Interface (HFI) cards

    PowerEdge R430,PowerEdge R630, PowerEdge R730, PowerEdge R730XD, PowerEdge R930, PowerEdge C4130, PowerEdge C6320

    The management and monitoring of the Fabric is done using the Fabric Manager (FM) GUI available from Intel®. The FMGUI provides in-depth analysis and graphical overview of the fabric health including detailed breakdown of status of the ports, mapping as well as investigative report on the errors.

     Figure 1: Fabric Manager GUI

    The IFS tools include various debugging and management tools such as opareports, opainfo, opaconfig, opacaptureall, opafabricinfoall, opapingall, opafastfabric, etc. These help to capture a snapshot of the Fabric and to troubleshoot. The Host based subnet manager service known as opafm is also available with IFS and is able to scale up to 1000’s of nodes.

    The Fabric relies on the PSM2 libraries to provide optimal performance. The IFS package provides precompiled versions of the open source OpenMPI and MVAPICH2 MPI along with some of the micro-benchmarks such as OSU and IMB used to test Bandwidth and Latency measurements of the cluster.

    Basic Performance Benchmarking Results:

    The performance numbers below were taken on Dell PowerEdge Server R630. The server configuration consisted of the dual socket Intel® Xeon® CPU E5-2697 v4 @ 2.3GHz, 18 cores with 8*16 GB @ 2400MHz. The BIOS version was 2.0.2, and the system profile was set to Performance.

    OSU Micro-benchmarks were used to determine latency. These latency tests were done in Ping-Pong fashion. HPC applications need low latency and high throughput. As shown in Figure 2, the back to back latency is 0.77µs, and switch latency is 0.9µs which is on par with industry standards.

    Figure 2: OSU Latency - E5-2697 v4

    Figure 3 below shows the OSU Uni-directional and bi-directional bandwidth results with OpenMPI-1.10-hfi version. At 4MB Uni-directional bandwidth is around 12.3 GB/s, and bi-directional bandwidth is around 24.3GB/s which is on par with the theoretical peak values.

    Figure 3: OSU Bandwidth – E5-2697 v4

    Conclusion:

     

    Omni-Path Fabric provides a value add to the HPC solution. It is a technology that integrates well as a high speed fabric needed for designing flexible reference architectures with the growing need for computation. Users can benefit from the open source fabric tools like FMGUI, Chassis Viewer and also FastFabric that is packaged with the IFS. The solution is automated and validated with Bright cluster Manager 7.2 on Dell Servers.

    More details on how Omni-Path perform in the other domains is available here. This document provides Intel® Omni-Path Fabric technology key features and provides a reference to performance data conducted on various commercial and open source applications.

     

     

     

     

     

     

     

  • New vs. Old: Comparing Broadwell Performance for CAE Applications Across Generations

    Authors: Mayura Deshmukh, Ashish K Singh, Neha Kashyap

    With the refresh of Dell’s 13th generation servers with the recently released Broadwell (BDW) processors, some obvious questions come to mind such as how the new processors compare with the older generation processors. This blog, fourth in the series of “Broadwell Performance for HPC,” focuses on answering this question. It compares the performance of various CAE applications for five Broadwell Intel Xeon E5-2600 v4 series processor models with previous generation Intel processors.

    Last week’s blog talked about the impact of BIOS options for each of the CAE applications. Here we focus on how much better the performance of the Broadwell processors is as compared to the previous generation Haswell (HSW) and Ivy-bridge (IVB) processors for these CAE applications. Table 1 shows the applications that we are comparing and Table 2 describes the server configuration used for the study. For LS-DYNA, the benchmarks run on the IVB and HSW (sse binary) and for ANSYS Fluent, benchmarks run on Westmere (WSM), Ivy-bridge (IVB), Sandy-bridge(SB) and HSW used different software versions (whatever latest version was available at the time) than what is mentioned in Table 1. STAR-CCM+ and OpenFOAM version for benchmarks run on both HSW and BDW were same.

    Table 1 - Applications and benchmarks

    Application

    Version

    Metric

    MPI

    Benchmark

    LS-DYNA®

    8.0.0

    Elapsed time

    Platform MPI 9.1.0

    • car2car with endtime=0.02

    STAR-CCM+®

    10.04.011

    Average Elapsed time

    Platform MPI 9.1.3

    • Civil_20m
    • EglinStoreSeparation
    • HlMach10
    • Kcs
    • LeMans_100M
    • Lemans_17m
    • Reactor9M
    • TurboCharger
    • Vtm

    ANSYS® Fluent®

    v16

    Solver rating

    Platform MPI 9.1.2.1

    • truck_poly_14m

    OpenFOAM

    2.4.0

    Clock time

    Open MPI 1.10.0

    • Motorbike 11M

     

    Table 2 - Server configuration

    Components

    Details

    Server

    PowerEdge R630

    Processor

    • 2 x E5-2650 v4 12c, 2.2/1.8 GHz, 105W, 30MB cache
    • 2 x E5-2690 v4 14c, 2.6/2.1 GHz, 135W, 35MB cache
    • 2 x E5-2697Av4 16c, 2.6/2.2 GHz, 145W, 40MB cache
    • 2 x E5-2698 v4 20c, 2.2/1.8 GHz, 135W, 50MB cache
    • 2 x E5-2699 v4 22c, 2.2/1.8 GHz, 145W, 55MB cache

    Memory

    256GB - 16 x 16GB 2400 MHz DDR4 RDIMMs

    Hard drive

    6 x 300GB SAS 6Gbps 10K rpm

    RAID controller

    PERC H330 mini

    Operating System

    Red Hat Enterprise Linux 7.2 (3.10.0-327.el7.x86_64)

    BIOS options

    System profile - Performance

    Logical Processor - Disabled

    Power Supply Redundant Policy - Not Redundant

    Power Supply Hot Spare Policy - Disabled

    I/O Non-Posted Prefetch - Disabled

    Snoop Mode - Opportunistic Snoop Broadcast (OSB) for OpenFOAM and Cluster on Die (COD) for all the other applications

    Node interleaving - Disabled

    BIOS

    2.0.0

    iDRAC Firmware

    2.30.30.02

    Figure 1 compares the performance of the five BDW Intel Xeon E5-2600 v4 series processors models with HSW Intel Xeon E5-2600 v3 series processors and IVB E5-2680 v2 for LS-DYNA car2car benchmark (with end time set to 0.02).


    Figure 1: IVB vs. HSW vs BDW for LS-DYNA

    The performance for all the processors is compared to E5-2680 v2, which is shown as the red baseline set at 1. The green bars show the performance for the HSW processors with LS-DYNA single precision sse binary, the grey bar represents data for HSW E5-2697 v3 with LS-DYNA single precision avx2 binary, the blue bars show the data for BDW processors with LS-DYNA single precision sse binary and the orange bars represent the BDW data with LS-DYNA single precision avx2 binary. For BDW, avx2 binaries perform 12-19% better than the sse binaries across all the processor models. The purple diamonds describe the performance per core compared to the E5-2680 v2. The percentages at the top of the BDW avx2 orange bar describe the percentage improvement of the BDW processors over HSW E5-2697 v3 avx2 (grey bar in the graph). The 12 core BDW E5-2650 v4 which has fewer cores and lower frequency understandably performs 11% lower than the Haswell E5-2697 v3 processor. The 14 core E5-2690 v4 which has same number of cores and similar avx2 frequencies performs 7% better than the E5-2697 v3 this can be accounted for due to the increase in bandwidth for Broadwell and BDW processors also measure better power efficiencies than Haswell processors. The performance for the 16core, 20core and 22core processors is 16 to 30% higher than the HSW E5-2697 v3 (avx2). Comparing the performance, performance per core and the higher memory bandwidth per core, the E5-2690 v4 14c and E5-2697Av4 16c look like attractive options for CAE/CFD codes, particularly when considering per core licensing costs.

    CD-adapco’s STAR-CCM+ is another CFD application widely-used by industry for solving problems involving fluid flows, heat transfer, and other phenomena. STAR-CCM+ shows similar performance patterns to LS-DYNA.

    Figure 2: HSW vs BDW for STAR-CCM+

    Figure 2 compares the performance of the five BDW Intel Xeon E5-2600 v4 series processors models (shown as the five bars in the graph) with HSW E5-2697 v3 shown as the red line set at one. The numbers at the top of the bar show the per core performance relative to the E5-2697 v3. As seen from the bars the 14core, 16core, 20core and the 22core relative performance is higher by 8% to 40% across all the benchmarks. The lower core, lower frequency 12core E5-2650 performs 11-20% lower than the E5-2697 v3. Similar to LS-DYNA, the per core performance of the 14core and the 16core is 2% to 11% better than the HSW E5-2697 v3 making them good options for STAR-CCM+ as well.

    ANSYS Fluent is a computational fluid dynamics application. The graph in Figure 3 shows the performance of truck_poly_14m for Sandy-bridge (SB), Ivy-bridge (IVB), HSW and BDW processors compared to the Westmere (WSM) processor shown as the redline set at one.

    Figure 3: WSM vs. SB vs. IVY vs. HSW vs. BDW for ANSYS Fluent

    The Fluent benchmark exhibit a similar pattern as LS-DYNA and STAR-CCM+ benchmarks. The purple diamonds in Figure 3 describe the performance per core compared to the WSM 2.93GHz processor. The percentages at the top of the BDW blue bar describe the percentage improvement of the BDW processors over HSW E5-2697 v3 (green bar in the graph). The 12 core BDW E5-2650 v4 which has fewer cores and lower frequency performs 14% lower than the Haswell E5-2697 v3 processor. With higher performance per core and the higher memory bandwidth per core, the E5-2690 v4 14c, E5-2697Av4 16c are good options, particularly when considering per core software licensing costs, and perform 11% and 21% better than the E5-2697 v3 processor. The 20 and 22core BDW processors perform 32%-39% better than the HSW E5-2697 v3.

    OpenFOAM (Open source Field Operation And Manipulation) is a free, open source software for computational fluid dynamics (CFD).

    Figure 4: HSW vs. BDW for OpenFOAM Motorbike 11M benchmark

    As shown in Figure 4 for the OpenFOAM Motorbike 11M benchmark, all the Broadwell processors perform 12% to 21% better than the Haswell E5-2697 v3 processor, shown as the red line set at one. Per core performance for the 16 core, 14 core and 12 core is 4% to 30% better than the E5-2697 v3.The performance for the 20 core and the 22 core BDW processors are the same for the Motorbike 11M benchmark. Increase in number of cores does not provide a significant performance boost for 20 and 22 core parts likely due to lower memory bandwidth per core as explained in the first blog’s STREAM results.

    Conclusion

    Along with more cores than HSW, BDW measures better power efficiency than HSW. Looking at the absolute performance, performance per core and the higher memory bandwidth per core, the E5-2690 v4 14c, E5-2697Av4 16c are attractive options for CAE/CFD codes particularly if per-core licensing costs are involved. For applications like OpenFOAM (motorbike case) all the BDW processors performed better than Haswell E5-2697 v3, but the increase in number of cores does not provide a significant performance boost for 20 and 22 core parts due to lower memory bandwidth per core.

     

  • Impact of Broadwell BIOS Options On CAE Applications

    Authors: Mayura Deshmukh, Ashish K Singh, Neha Kashyap

    Last week’s blog on the “Broadwell Performance for HPC” series described the BIOS options and compared performance across generations of processors for molecular dynamic applications (NAMD) and Weather Research and Forecasting (WRF). This blog, third in the series, focuses on BIOS options for some HPC CAE applications for five different Broadwell Intel Xeon E5-2600 v4 series processor models. It aims to answer questions like, which snoop mode works best for my application and processor? Which BIOS System Profile would give the best performance?

    There have been a few changes in the BIOS options for Broadwell as compared with the previous generation (Haswell). One of the major additions in the Broadwell BIOS is the “Opportunistic Snoop Broadcast” snoop mode in the Memory settings. This blog discusses performance of the applications for all four snoop modes: Opportunistic snoop broadcast (OSB), Early snoop (ES), Home snoop (HS) and Cluster on die (COD). For more information on the new BIOS options and snoop modes check blog one of this series.

    The Dell BIOS “System Profile” setting can be set to either of the four pre-configured profiles: Performance Per Watt (DAPC), Performance Per Watt (OS), Performance (Perf.) and Dense Configuration or set to Custom. In the pre-configured profiles the Turbo Boost, C States, C1E, CPU Power Management, Memory Frequency, Memory Patrol Scrub, Memory Refresh Rate, Uncore Frequency are preset whereas for Custom the User can choose values for these options. For more information on System Profiles check the link. DAPC and OS have shown to perform similarly in past studies, and Dense Configuration performs lower for HPC workloads, so we will be focusing on DAPC and Performance Profiles in this study. The DAPC (Dell Active Power Control) Profile relies on a BIOS-centric power control mechanism. Energy efficient turbo, C States, C1E are enabled with the DAPC Profile. Performance Profile disables power saving features such as C-states, Energy efficient turbo and C1E. Turbo boost is enabled in both the System Profiles.

    This blog discusses the performance of CAE applications with DAPC and Performance profile for each of the four snoop modes for five different Intel Xeon E5-2600 v4 series Broadwell processors. Table 1 shows the application and benchmark details and Table 2 describes the server configuration used for the study.

    Table 1 - Applications and benchmarks

    Application

    Version

    Metric

    MPI

    Benchmark

    LS-DYNA®

    8.0.0

    Elapsed time

    Platform MPI 9.1.0

    • car2car with endtime=0.02

    STAR-CCM+®

    10.04.011

    Average Elapsed time

    Platform MPI 9.1.3

    • Civil_20m
    • EglinStoreSeparation
    • HlMach10
    • Kcs
    • Lemans_100m
    • Lemans_17m
    • Reactor9m
    • TurboCharger
    • Vtm

    ANSYS® Fluent®

    v16

    Solver rating

    Platform MPI 9.1.2.1

    • truck_poly_14m
    • combustor_12m
    • combustor_71m
    • exhaust_system_33m
    • ice_2m

    OpenFOAM

    2.4.0

    Clock time

    Open MPI 1.10.0

    • Cavity 1M
    • Motorbike 11M

     

    Table 2 - Server configuration

    Components

    Details

    Server

    PowerEdge R630

    Processor

    • 2 x E5-2650 v4 12c, 2.2/1.8 GHz, 105W, 30MB cache
    • 2 x E5-2690 v4 14c, 2.6/2.1 GHz, 135W, 35MB cache
    • 2 x E5-2697Av4 16c, 2.6/2.2 GHz, 145W, 40MB cache
    • 2 x E5-2698 v4 20c, 2.2/1.8 GHz, 135W, 50MB cache
    • 2 x E5-2699 v4 22c, 2.2/1.8 GHz, 145W, 55MB cache

    Memory

    256GB - 16 x 16GB 2400 MHz DDR4 RDIMMs

    Hard drive

    6 x 300GB SAS 6Gbps 10K rpm

    RAID controller

    PERC H330 mini

    Operating System

    Red Hat Enterprise Linux 7.2 (3.10.0-327.el7.x86_64)

    BIOS options

    System Profile - Performance and Performance Per Watt (DAPC)

    Logical Processor - Disabled

    Power Supply Redundant Policy - Not Redundant

    Power Supply Hot Spare Policy - Disabled

    I/O Non-Posted Prefetch - Disabled

    Snoop Mode - Opportunistic Snoop Broadcast (OSB), Early Snoop (ES), Home Snoop (HS), Cluster on Die (COD)

    Node interleaving - Disabled

    BIOS

    2.0.0

    iDRAC Firmware

    2.30.30.02

     

    LS-DYNA is a general-purpose finite element program from LSTC capable of simulating complex real-world structural mechanics problems. We ran the car2car benchmark with endtime set to 0.02 with both the single precision avx2 and the single precision sse LS-DYNA binaries.

                                                                                                                    

    Figure 1: Comparing snoop modes and BIOS Profiles for LS-DYNA

    The left graph in Figure 1 shows how better or worse the different snoop modes perform compared to the default setting of snoop mode = OSB and BIOS profile=DAPC (which is set at 1, the red line on the graph). Just changing the snoop mode to COD increases performance by 1-3% with either BIOS profiles across all the processor models. The performance with COD is closely followed by OSB followed by ES for lower core counts and HS for 16, 20 and 22 core processors. With ES mode, the system starts paying the penalty of having lower request tokens per core for higher core counts compared to the other snoop modes (for e.g. for 14 core 128/14 = 9 per core Vs. 128/22 = 5 per core for 22 core). All the snoop modes with the System Profile set to Performance follow similar pattern as DAPC. As shown in the graph on the right in Figure 1, changing the System Profile from DAPC to Performance can provide up to 2% performance benefit. The COD.Perf is the best option, about 2-4% better compared to OSB.DAPC across all processor models. The total 2-4% improvement with COD.Perf is accounted partially due to the change in snoop mode and partially due to change in the BIOS System Profile to Performance. We ran the car2car benchmark for all the combinations above with the sse LS-DYNA binary as well and noted similar behavior with the Performance System Profile and COD snoop mode being 2-6% better than the default OSB.DAPC. The avx2 binaries performed 12-19% better than the sse binaries across all the processor models.

    CD-adapco® STAR-CCM+ is another CFD application widely-used by industry for solving problems involving fluid flows, heat transfer, and other phenomena. The STAR-CCM+ benchmarks results show a pattern similar to LS-DYNA in terms of snoop mode and System Profile.

                                                                                                                

    Figure 2: Comparing snoop modes for STAR-CCM+

    Figure 2 compares the snoop modes for the Civil_20m and Lemans_17m benchmarks. For simplicity, data for these two benchmarks are shown. The other benchmarks datasets show results similar to the patterns in Figure 2. The BIOS profile in the graphs is set to DAPC and the snoop modes are compared against the default OSB snoop mode (which is set at 1, the red line on the graph). The COD is the best option for the Civil_20m benchmark, it is about 2-3% better for DAPC. For the Performance System Profile COD is 4-6% better for the Civil_20m benchmark (not shown in the graph). COD is followed by OSB and then ES for smaller core counts. Performance with ES though starts reducing as the cores increase similar to what was observed with LS-DYNA car2car benchmark case. The HlMach10 benchmark shows similar pattern to the Civil_20m benchmark. For the HlMach10 benchmark case the COD.Perf option is 2-7% better than the default OSB.DAPC.

    All the other benchmarks (EglinStoreSeparation, Kcs, Lemans_100m, Reactor9m, TurboCharger, Vtm) show similar pattern to Lemans_17m. The COD and OSB perform similarly, there is only ~1% difference between OSB and COD across the benchmark cases across all processor models. After COD and OSB, ES option is better for lower core counts and HS for 16, 20 and 22 core processors. As mentioned previously, the system in ES mode starts paying the penalty of having lower request tokens per core for higher core counts compared to the other snoop modes.

                                                                                                             

    Figure 3: DAPC vs. Performance with COD snoop mode for STARCCM+

    The graph in figure 3 compares the System Profile BIOS options DAPC and Performance. We are comparing the performance of COD.Perf with respect to COD.DAPC, which is the red baseline set at 1 in the graph. The Performance profile provides 2-4% benefit over the DAPC for the Civil_20m benchmark for all the processor models. Also for the high core count, E5-2699 v4 the Performance profile performs 2-5% better across all the benchmarks. For all the other processor models there is not a significant gain (only about 1%) with the Performance profile for all the benchmarks (except Civil_20m).

    ANSYS Fluent is a computational fluid dynamics application. Fluent provides multiple benchmark cases. We picked four representative cases from the v16 benchmark suite: combustor_12m, combustor_71m, exhaust_system_33m and ice_2m and one from the older v15 benchmark suite: truck_poly_14m, to allow us to compare our data with previous generation processor models. The Fluent benchmarks exhibit a similar pattern as LS-DYNA and STAR-CCM+ benchmarks.

                                                                                                           

    Figure 4: Comparing snoop modes for ANSYS Fluent

    The graph in Figure 4 shows the performance of truck_poly_14m for all the snoop modes compared to the default OSB.DAPC which is shown as the red baseline in the graphs. All the other benchmarks show a similar pattern. COD performs up to 2% better than OSB for truck_poly_14m, combustor_12m and ice_2m. COD is about 5% better for combustor_71m and 6% better for exhaust_33m. COD is followed by OSB, followed by ES for lower core counts and HS for higher core count processors for all the benchmarks.

                                                                                                            

    Figure 5: DAPC vs. Performance with COD snoop mode for ANSYS Fluent

    Figure 5 shows the performance for Performance profile with respect to DAPC with COD set as the snoop mode for both options. DAPC is shown as the red baseline in the graph. The Performance BIOS profile option is about 4% better for all the processor models for the larger combustor_71m and exhaust_33m benchmark cases. The Performance profile is 1-3% better for the other benchmark cases.

    OpenFOAM (Open source Field Operation And Manipulation) is a free, open source software for computational fluid dynamics (CFD). OpenFOAM was compiled with -march=native / Broadwell option. We used the cavity-1M and motorBike-11M datasets which are modifications of the OpenFOAM tutorials/incompressible/icoFoam/cavity and tutorials/incompressible/simpleFoam/motorBike models respectively.

                                                                                                           

    Figure 6: Comparing snoop modes and BIOS Profiles for OpenFOAM Cavity 1M benchmark

    As shown in left graph of figure 6 for DAPC System Profile, the benchmark performance increases by 3-6% when in COD snoop mode when compared to OSB. ES and HS options perform up to 3% lower than OSB across all the processor models. The pattern is similar for the Performance System Profile, where COD is better by 3-7% followed by OSB. HS is lower than OSB but better than ES for all the processors models except for the 20core E5-2698 v4 where ES is 1% better than HS for DAPC profile and 7% better than HS for Performance System Profile. There is not a lot of difference in performance for DAPC Vs Performance profile especially for the higher frequency processors 14core E5-2690v4 and the 16core E5-2697A v4. For the other models the Performance profile shows up to 4% benefit as shown in the right graph of figure 6.

                                                                                                            

    Figure 7: Comparing snoop modes and BIOS Profiles for OpenFOAM Motorbike 11M benchmark

    For the openFOAM motorbike 11M benchmark the OSB, COD and the HS snoop modes perform similarly with about 1% variation. The performance for ES is low across all the processor models and it keeps on dropping as the number of cores increase as shown in the left graph of figure 7. The snoop modes with BIOS System Profile set to Performance follow exactly similar trend. As shown in the right graph on figure 3, the DAPC and Performance profiles show similar performance with Performance about 1% better in most cases except for the E5-2697A where the DAPC.COD was 2% better.

    Conclusion

    Most of the data sets used in this study show advantage of COD mode, but COD benefits codes which are highly NUMA optimized and where the dataset fit into the NUMA memory (that is half of each sockets memory capacity). OSB is a close second and a good option for codes with varying level of NUMA optimization; OSB is also the default memory snoop BIOS option. HS and ES perform slightly lower than COD and OSB. ES is better than HS for lower core counts but as the core counts increase ES starts paying the penalty of having lower request tokens per core for higher core counts compared to the other snoop modes. In terms of System Profile, Performance Profile performs slightly better than DAPC in most of the cases.

    Be sure to check back next week for the last blog in the series which will compare the performance of HPC CAE applications across generations (Ivy-bridge vs. Haswell vs. Broadwell)