High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • Need for Speed: Comparing FDR and EDR InfiniBand (Part 2)

    By Olumide Olusanya and Munira Hussain

    This is the second part of this blog series. In the first post, we shared OSU Micro-Benchmarks (latency and bandwidth) and HPL performance between FDR and EDR Infiniband. In this part, we will further compare performance using additional real-world applications such as ANSYS Fluent, WRF, and NAS Parallel Benchmarks. For my cluster configuration, please refer to part 1.

    Results

    Ansys Fluent

    Fluent is a Computational Fluid Dynamics (CFD) application used for engineering design and analysis. It can be used to simulate the flow of fluids, with heat transfer, turbulence and other phenomena, involved in various transportation, industrial and manufacturing processes.

    For this test we ran Eddy_417k which is one of the problem sets from ANSYS Fluent Benchmark suits.  It is a reaction flow case based on the eddy dissipation model. In addition, it has around 417,000 hexahedral cells and is a small dataset with a high communication overhead.  

                                                                         Figure 1 - ANSYS Fluent 16.0 (Eddy_417k)

    From Figure 1 above, EDR shows a wide performance advantage over FDR as the number of cores increase to 80. We continue to see an even wider difference as the cluster scales. While FDR’s performance seems to gradually taper off after 80 cores, EDR’s performance continues to scale as the number of cores increase and performs 85% better than FDR on 320 cores (16 nodes).

    WRF (Weather Research and Forecasting)

    WRF is a modelling system for weather prediction. It is widely used in atmospheric and operational forecasting research. It contains two dynamic cores, a data assimilation system, and a software architecture that allows for parallel computation and system extensibility. For this test, we are going to study the performance of a medium size case, Conus 12km.

    Conus 12km is a resolution case over the Continental US domain. The benchmark is run for 3 hours after which we take the average of the time per time step.

                                                                                                    Figure 2 - WRF (Conus12km)

    Figure 2 shows both EDR and FDR scaling almost linearly and also performing almost equally until the cluster scales to 320 cores when EDR performs better than FDR by 2.8%. This performance difference, which may seem little, is significantly higher than my highest run to run variation of 0.005% between three successive EDR and FDR 320-core tests.

    HPC Advisory Council’s result here shows a similar trend with the same benchmark. From their result, we can see that the performances are neck and neck until the 8 and 16-node run where we see a small performance gap. Then the gap widens even more in the 32-node run and EDR posts a 28% better performance than FDR. Both results show that we could see an even higher performance advantage with EDR as we scale beyond 320 cores.

     

    NAS Parallel Benchmarks

    NPB contains a suite of benchmarks developed by NASA Advanced Supercomputing Division. The benchmarks are developed to test the performance of highly parallel supercomputers which all mimic large-scale and commonly used computational fluid dynamics applications in their computation and data movement.  For my test, we ran four of these benchmarks: CG, MG, FT, and IS. In the figures below, the performance difference is in an oval right above the corresponding run.

                                                                          Figure 3 - CG         

                                                                              Figure 4 - MG

     

                                                                            Figure 5 - FT         

     


                                                                              Figure 6 - IS

    CG is a benchmark which computes an approximation of the smallest eigenvalue of a large, sparse, symmetric positive-definite matrix using a conjugate gradient method. It also tests irregular long distance communication between cores. From Figure 3 above, EDR shows a 7.5% performance advantage with 256 cores.

    MG solves a 3-D Poisson Partial Differential Equation. The problem in this benchmark is simplified as it has constant instead of variable coefficients to better mimic real applications. In addition to this, it tests short and long distance communication between cores. Unlike CG, the communication patterns are highly structured. From Figure 4, EDR performs better than FDR by 1.5% on our 256-core cluster.

    FT is a 3-D partial differential equation solution using FFTs. It tests the long-distance communication performance as well and shows a 7.5% performance gain using EDR on 256 cores as seen in Figure 5 above.

    IS, a large integer sort application, shows a high 16% performance difference between EDR and FDR on 256 cores. This application not only tests the integer computation speed, but also the communication performance between cores. From Figure 6, we can see a 12% EDR advantage with 128 cores which increases to 16% on 256 cores.

                            

    Conclusion 

    In both blogs, we have shown several micro-benchmark and real-world application results to compare FDR with EDR Infiniband. From these results, EDR has shown a higher performance and better scaling than FDR on our 16-node Dell PowerEdge C6320 cluster. Also, some applications have shown a wider performance margin between these interconnects than other applications. This is because of the nature of the applications being tested; communication intensive applications will definitely perform and scale better with a faster network when compared with compute-intensive applications. Furthermore, because of our cluster size, we were only able to test the scalability of the applications on 16 servers (320 cores). In the future, we plan on running these tests again on a larger cluster to further test the performance difference between EDR and FDR.

     

  • Dell’s Jim Ganthier Recognized as an “HPCWire 2016 Person to Watch”

    Congratulations to Jim Ganthier, Dell’s vice president and general manager of Cloud, HPC and Engineered Solutions, who was recently selected by HPCWire as a “2016 Person to Watch.” In an interview as part of this recognition, Jim offered his insights, perspective and vision on the role of HPC, seeing it as a critical segment of focus driving Dell’s business. He also discussed initiatives Dell is employing to inspire greater adoption through innovation, as HPC becomes more mainstream.

    There has been a shift in the industry, with newfound appreciation of advanced-scale computing as a strategic business advantage. As it expands, organizations and enterprises of all sizes are becoming more aware of HPC’s value to increase economic competitiveness and drive market growth. However, Jim believes greater availability of HPC is still needed for the full benefits to be realized across all industries and verticals.

    As such, one of Dell’s goals for 2016 is to help more people in more industries to use HPC by offering more innovative products and discoveries than any other vendor. This includes developing domain-specific HPC solutions, extending HPC-optimized and enabled platforms, and enabling a broader base of HPC customers to deploy, manage and support HPC solutions. Further, Dell is investing in vertical expertise by bringing on HPC experts in specific areas including life sciences, manufacturing and oil and gas.

    Dell is also offering its own brand muscle to draw more attention to HPC at the C-suite level, and will thus accelerate mainstream adoption - this includes leveraging the company’s leading IT portfolio, services and expertise. Most importantly, the company is championing the democratization of HPC, meaning minimizing complexities and mitigating risk associated with traditional HPC while making data more accessible to an organization’s users.

    Here are a few of the trends Jim sees powering adoption for the year ahead:

    • HPC is evolving beyond the stereotype of being solely targeted for government and academia, deeply technical, audiences and is now making a move toward the mainstream. Commercial companies within the manufacturing and financial services industries, for example, require high-powered computational ability to stay competitive and drive change for their customers.
    • The science behind HPC is no longer just about straight number crunching, but now includes development of actionable insights by extending HPC capabilities to Big Data analytics, creating value for new markets and increasing adoption.
    • This trend towards the mainstream does come with its challenges, however, including issues with complexity, standardization and interoperability among different vendors. That said, by joining forces with Intel and others to form the OpenHPC Collaborative Project in November 2015, Dell is helping to deliver a stable environment for its HPC customers. The OpenHPC Collaborative Project aims to enable all vendors to have a consistent, open source software stack, standard testing and validation methods, the ability to use heterogeneous components together and the capacity to reduce costs. OpenHPC levels the playing field, providing better control, insight and long-term value as HPC gains traction in new markets.

    A great example of HPC outside the world of government and academic research is aircraft and automotive design. HPC has long been used for structural mechanics and aerodynamics of vehicles, but now that the electronics content of aircraft and automobiles is increasing dramatically, HPC techniques are now being used to prevent electromagnetic interference from impacting performance of those electronics. At the same time, HPC has enabled vehicles to be lighter, safer and fuel efficient than ever before. Other examples of HPC applications include everything from oil exploration to personalized medicine, from weather forecasting to the creation of animated movies, and from predicting the stock market to assuring homeland security. HPC is also being used by the likes of FINRA to help detect and deter fraud, as well as helping stimulate emerging markets by enabling growth of analytics applied to big data.

    Again, our sincerest congratulations to Jim Ganthier! To read the full Q&A, visit http://bit.ly/1PYFSv2.

     

  • Need for Speed: Comparing FDR and EDR InfiniBand (Part 1)

    By Olumide Olusanya and Munira Hussain

    The goal of this blog is to evaluate the performance of Mellanox Technologies’ FDR (Fourteen Data Rate) Infiniband and their latest EDR (Enhanced Data Rate) Infiniband with speeds of 56Gb/s and 100Gb/s respectively. This is the first of our two series blog and we will be showing how these interconnects perform on a cluster using industry-wide micro-level benchmarks and applications on HPC cluster configuration. In this part, we will show latency, bandwidth and HPL results for FDR vs EDR and in part 2 we will share more results with other applications which include ANSYS Fluent, WRF, and NAS Parallel Benchmarks. You should also keep in mind that while some applications would benefit from the higher bandwidth in EDR, other applications which have low communication overhead would show little performance improvement in comparison.  

    General Overview

    Mellanox EDR adapters are based on a new generation ASIC also known as ConnectX-4 while the FDR adapters are based on ConnectX-3. The theoretical uni-directional bandwidth for EDR is 100 Gb/s versus FDR which is 56Gb/s. Another difference is that EDR adapters are x16 adapters while FDR adapters are available in x8 and x16. Both of these adapters operate at a bus width of 4X link. The messaging rate for EDR can reach up to 150 million messages per second compared with FDR ConnectX-3 adapters which deliver more than 90 million messages per second.

    Table 1 below shows the difference between EDR and FDR and Table 2 describes the configuration of the cluster used in the test while Table 3 lists the applications and benchmarks used for this test.

                            Table 1 - Difference between EDR and FDR

     

    FDR

    EDR

    Chipset

    ConnectX-3

    ConnectX-4

    Link

    x8 and x16 Gen3

    x16 Gen3

    Theoretical BW

    56 Gb/s

    100 Gb/s

    Messaging rate

    90 MMS

    150 MMS

    Port

    QSFP

    QSFP28

     

                                           Table 2 - Cluster configuration

    Components

    Details

    Server

    16 nodes x PowerEdge C6320 [ 4 chassis ]

    Processor

    Intel®Xeon®Intel Xeon E5-2660 v3 @2.6/2.2 GHz , 10 cores, 105W

    BIOS

    1.1.3

    Memory

    128 GB – 8 x16 GB @ 2133MHz

    Operating System

    Red Hat Enterprise Linux Server release 6.6.z (Santiago)

    Kernel

    2.6.32-504.16.2.el6.x86_64

    MPI

    Intel® MPI 5.0.3.048

    Drivers

    MLNX_OFED_LINUX-3.0-1.0.1

    BIOS settings

    • System Profile: Performance Optimized
    • Turbomode: Enabled
    • Cstates: Disabled
    • Nodeinterleave: Disabled
    • Hyper threading: Disabled
    • Snoop mode: Early/Home/COD snoop

     

    Interconnect

    EDR

    • Mellanox ConnectX-4 EDR 100Gbps
    • Mellanox Switch-IB  SB7790
    • PCI-E x16 Gen3 riser slot
    • HCA firmware: 12.0012.1100
    • PSID: MT_2180110032

     

    FDR

    • Mellanox ConnectX-3 FDR 56Gbps
    • Mellanox SwitchX SX6025
    • PCI-E x8 Gen3 Mezz slot
    • HCA firmware: 2.30.8000
    • PSID: DEL0A30000019

     

     

    Table 3 - Applications and Benchmarks 

    Application

    Domain

    Version

    Benchmark

    OSU Micro-Benchmarks

    Efficiency of MPI   implementation

    From Mellanox OFED 3.1

    Latency, Bandwidth

    HPL

    Random dense linear

    system

    From Intel MKL

    Problem size 90% of total memory

    Ansys Fluent

    Computational Fluid

    Dynamics

     

    V16.0

    Eddy_417k

    WRF

    Weather Research and

    Forecasting

    V3.5.1

    Conus 12km

    NAS Parallel Benchmarks

    Computational Fluid

    Dynamics

    3.3.1

    CG, MG, IS, FT

     

    Results 

    OSU Micro-Benchmarks

    To find the latency and bandwidth, we used the tests from the OSU Micro-Benchmark suite. These tests use the MPI message passing performance to check the quality of a network fabric. Using the same system configuration for EDR and FDR fabrics, we got latency results as shown in Figure 1 below.

                                                     Figure 1 - OSU Latency (using MPI from Mellanox HPC-X Toolkit)

    Figure 1 shows a simple OSU node-to-node latency result for EDR vs FDR. Latency numbers are typically taken from the lowest data points (usually the point with the lowest message size). Hence, the lower the data points, the better. In the above OSU latency graph, EDR shows a latency of 0.80us while FDR shows 0.81us. As the message size increases past 512 Bytes, EDR provides an even lower latency of 2.75us compared with FDR’s 2.84us for a 4KB message size. When we did a further latency study using RDMA, EDR measured 0.61us and FDR measured 0.65us.

    Figure 2 below plots the OSU unidirectional and bidirectional bandwidth achieved by both EDR and FDR at different message sizes from 1- 4MB.

                                     Figure 2 - OSU Bandwidth (using MPI from Mellanox HPC-X Toolkit)

    OSU unidirectional bandwidth is a ping-pong type of communication test where the sender sends a fixed size of messages back-to-back to a receiver and then the receiver responds only after receiving all the messages.  This test measures the maximum data rate of the network one–way or the unidirectional bandwidth. The result is taken from the achieved bandwidth of the maximum message size which is 4MB. In the above test, EDR achieves a maximum unidirectional data rate of 12.4GB/s (99.2Gb/s) and FDR achieves 6.3GB/s (50.4Gb/s). This is a 97% performance improvement in EDR over FDR.

    OSU bidirectional bandwidth is very similar to the unidirectional test, but in this case, both nodes send messages to each other and await a reply. From the above graph, EDR achieves a bidirectional data rate of 24.2GB/s (193.6Gb/s) compared with FDR’s 10.8GB/s (86.4Gb/s) which gives us a 124% improvement with EDR over FDR.

    HPL

    Figure 3 below shows the HPL performance between EDR and FDR using COD (Cluster on Die) snoop mode. Previous studies have shown that COD gives the best performance over Home and Early snoop.

     

                                                              Figure 3 - HPL Performance

    HPL benchmark is a compute-intensive application. It could spend more than 80% of its runtime on computation depending on how you tune it. During the bulk of its communication time, it sends messages of small sizes across the cluster which may not benefit from a higher speed network. Hence, you should not expect a huge performance difference between EDR and FDR. Even though EDR seems to perform slightly better than FDR by 0.33% in the 80-core run, this difference is within our run-run variation for successive tests with either EDR or FDR. As a result, this performance gain cannot be attributed to an EDR advantage. This also makes it is difficult to test accurately the effect of one interconnect over the other with HPL.

    Conclusion

    From our tests so far, EDR has shown a clear bandwidth advantage when compared with FDR – 97% in unidirectional and 124% in bidirectional bandwidth. In the second part of this blog, we will share more results from other applications (ANSYS Fluent, WRF, and NAS Parallel Benchmarks) to compare performance between EDR and FDR.

      

  • IoT Close to Becoming of Age

    CES 2016 proved that the Internet of Things (IoT) is a segment of the tech market that continues to grow – from connected homes, to connected cars, to connected cities and beyond.  Dell is working with customers on over 150 IoT projects that range from solving simple organizational issues, to leveraging technology as a critical competitive advantage.

    Compass Intelligence is a market analytics and consulting firm that specializes in metrics-driven market intelligence and insights for the mobile, IoT, and high-tech industries. For the past three years the company has recognized the year’s best in mobile devices and software, wireless communications, Internet of Things, wearables, green technology and connected products. Dell was thrilled to be one of the 2016 honorees, for its dedication to making the IoT market a reality. 

    A good example of the collaboration taking place in the market is the Thread Group, a global ecosystem of developers, retailers, and customers, who are working together to create a better way to connect products to the home, educate those who are not in the know, and simplify the process through the most innovative technology at hand.  For more, you can read my blog, “Pushing Through the Hype and Handwringing of IoT.”

  • A Look Back at SC15 and Looking Forward to 2016

    By Christine Fronczak

    It’s been a great year for our community, with the industry maturing sufficiently to push HPC firmly back into the limelight. We are seeing a large part of the market evolving beyond the traditional stereotype of HPC – that of being solely for the science and super-geeky technology audience. Big data is a prime example of the market that requires high powered computational needs for a wide range of vertical markets – from retail to manufacturing and financial services.

    SC15 enabled us to bring some of the industry’s best together so we could learn all the different ways HPC is being used to better our world. From finding a cure for autism and rare childhood diseases to rooting out fraud and plagiarism, to creating clean energy, and helping third world countries develop better food and foster entrepreneurship.

    We heard some very moving and intriguing use cases from the likes of General Atomics, The University of Florida, Virginia Tech, TGen, University of Maryland and Johns Hopkins, as well as Oxford University among others.  These have been posted for your viewing on our YouTube Channel. We also hosted Intersect360’s Addison Snell and Dell’s own Onur Celebioglu who walked us through trends they see in the market and what to expect in the coming year. Their talks are also accessible on the SC15 YouTube playlist. InsideHPC’s Rich Brueckner joined us to moderate a panel discussion on the convergence of HPC, Big Data and Cloud, followed by a discussion on the NSCI initiative. Created by President Obama in July 2015, the National Strategic Computing Initiative has a mission to ensure the United States continues leading high performance computing over the coming decades.  Both of these panels are available on the same playlist as the others, and can also be accessed via the InsideHPC website.

    We will continue to follow these stories throughout the year, as well as giving you insight into the people behind the scenes making the “magic” happen. Intersect360’s Addison Snell has said that HPC is “a critical pillar of innovation and advancement, whether you’re talking about general scientific research or throughout different industries.” In 2016 we will watch and explore the trends and foster discussion on the successes and failures of our community in order to propel the industry forward. In the meantime, we leave you with a glimpse and some insights into SC15 and wish you a very Happy New Year!

    Highlights from SC15:

  • Meeting the Demands of HPC and Big Data Applications by Leveraging Hybrid CPU/GPU Computing

    “Rack ‘em and stack ‘em.”— a winning approach for a long time but not without its limitations. A generalized server solution works best when the applications running on those servers have generalized needs.

    Enter “Big Data.” Today’s application and workload environments can be required to process massive amounts of granular data and, thus, often consist of applications that place high demands on different server hardware elements. Some applications are very compute intensive and place a high demand on the server’s CPU where others in the same environment are tasked with unique processing requirements performed on specialized graphical processing units (GPUs).

    Whether it is customer, demographic, seismic data — or a whole host of other uses — the number crunching and processing required across the suite of applications can result in processing demands that are radically different from demands of prior years. Enter Hybrid High Performance Computing. These systems are built to serve two masters: CPU-intensive applications and GPU-intensive applications delivering a hybrid environment where workloads can be optimized and run-times reduced through ideal resource utilization.

    The results of Hybrid CPU/GPU Computing adoption have been impressive. Just a few examples of how Hybrid CPU/GPU Computing is delivering real value include:

    • Optimization of workloads across CPU/GPU servers
    • Delivering the highest-density, highest-performance in a small footprint
    • Provides significant power, cooling and resource utilization benefits

    You can learn more about leveraging hybrid CPU/GPU computing in this whitepaper.

  • Counting Down to SC15

    - by Stephen Sofhauser

    The countdown to SC15 has started, and we at Dell are very excited for this year’s event. We have a lot to share with you all this year, and it’s particularly special to us because it’s right in our back yard.  Come visit us at booth #1009. Our aim is to show you the true meaning of “Texas Friendly” with great demos, two customer theaters that will feature a stellar lineup of speakers, panel discussions, new products and solutions we’re bringing to market, and of course, let’s not forget the awesome food and entertainment Austin has to offer!

    We have a great morning series for you, featuring our director of HPC Engineer, Onur Celebioglu, and Intersect360’s Addison Snell. Every morning of the show (10:15 a.m. -10:45 a.m.), they will be discussing trends and tech in the HPC market. 

    There will be two afternoon panel discussions, hosted by insideHPC’s Richard Brueckner.  I recently spoke to him about the upcoming conference, you can listen to the podcast here.

    The first panel is from 1:30 p.m. to 2:30 p.m. on Wednesday, November 18th, “All Together Now: The Convergence of Big Data, Cloud, and HPC.” This should prove to be an interesting discussion with Richard and our four panelists: Wojtek Goscinski, Ph.D. (Monash University), Niall Gaffney, Ph.D. (Texas Advanced Computing Center (TACC)), Craig Stewart, Ph.D. (Indiana University), and Andrew Rutherford (Microsoft).  The concept of the panel was born from discussions with customers who after years of siloed workloads, are trying to figure out how to best integrate their Big Data, Cloud, and HPC.

    Wednesday's second panel should be equally compelling as Richard and the panelists discuss how the NSCI is fostering more collaboration between government, academia, and industry. “More than Just Exascale: How the NSCI Will Make HPC More Accessible to All” (3:00 p.m. – 4:00 p.m). This all comes as a result of the White House initiative, seeking to keep the United States at the forefront of HPC capabilities. What’s nice is that most of these speakers know each other. We will be featuring Dan Stanzione, Ph.D. (Texas Advanced Computing Center TACC), Mike Norman, Ph.D. (San Diego Supercomputing Center SDSC), Dave Lifka, Ph.D. (Cornell University), and Merle Giles (National Center for Supercomputing Applications NCSA) – who has a book on HPC best practices, Industrial Applications of High-Performance Computing: Best Global Practices.

    We have so much more planned for you, including talks from customers like the San Diego Supercomputing Center, Virginia Bioinformatics Institute (VBI), and Cornell – just granted $5 million by the NSF to collaboratively develop a federated cloud. There’s just too much to cram it in one blog post – but you can check out http://dell.to/1PFgbwJ for more information. We look forward to seeing you (booth #1009) and hope you have a great SC15, and enjoy everything Dell has to offer! And welcome to Austin!

     

     

  • A Great Lineup for SC15

    Join Dell in Booth #1009!

    This year promises to be another great one at SC, chock full of great speakers and panels – you won’t be disappointed! 

    Join us at the Dell Sixth Street Theater Tuesday, Wednesday, and Thursday morning from 10:15 a.m. – 10:45 a.m. for authentic Texas burritos and engaging conversation from Intersect360’s CEO, Addison Snell, and Dell’s own Onur Celebioglu. They will be discussing why HPC is now important to a broader group of use cases, and dig deep into overviews of HPC for research, life sciences and manufacturing. Participants will learn about types of application characterization, best practices and examples of engineered solutions that are appropriate for these specific verticals. Come learn more about why HPC, big data and cloud are converging, and how Dell solves challenges in our HPC engineering lab and through collaborative work with other leading technology partners and research institutions.

    We will also have two informative panels hosted by Rich Brueckner, president, insideHPC. Recently, Dell’s North American Sales Director for HPCHigh Performance Computing spoke with Brueckner about these panels - you can listen to the podcast here.

    All Together Now: The Convergence of Big Data, Cloud, and HPC

    Wednesday, Nov 18, 1:30 p.m. – 2:30 p.m.

    • Wojtek Goscinski, Ph.D. (Monash University)
    • Niall Gaffney, Ph.D. (Texas Advanced Computing Center (TACC))
    • Craig Stewart, Ph.D. (Indiana University)
    • Andrew Rutherford (Microsoft)

    Modeling and simulation have been the primary usage of high performance computing (HPC). But the world is changing. We now see the need for rapid, accurate insights from large amounts of data. To accomplish this, HPC technology is repurposed. Likewise the location where the work gets done is not entirely the same either. Many workloads are migrating to massive cloud data centers because of the speed of execution. In this panel, leaders in computing will share how they, and others, integrate tradition and innovation (HPC technologies, Big Data analytics, and Cloud Computing) to achieve more discoveries and drive business outcomes.

    More than Just Exascale: How the NSCI Will Make HPC More Accessible to All 

    Wednesday, Nov 18, 3:00 p.m. – 4:00 p.m.

    • Dan Stanzione, Ph.D. (Texas Advanced Computing Center TACC)
    • Mike Norman, Ph.D. (San Diego Supercomputing Center SDSC)
    • Dave Lifka, Ph.D. (Cornell University)
    • Merle Giles (National Center for Supercomputing Applications NCSA)

    The US —and the world —took notice this summer when President Obama issued an Executive Order establishing the National Strategic Computing Initiative (NSCI). While most headlines focus on the exascale goals of this initiative, the NSCI presents a comprehensive set of objectives. Those objectives include advancing usage, capabilities and impact of HPC for decades to come. In this panel, you will hear from HPC leaders who are moving us forward in improving HPC application developer productivity and making high performance computing more accessible to all.

    Lastly, from 1:00 p.m. to 1:30 p.m. on Tuesday, Wednesday, and Thursday afternoons, you can enjoy tasty treats and great dialogue from:

    • Jeff Kirk, Sr. Principle Engineer, HPC Technologies, Office of the CTO
    • Joe Sekel, HPC Server Architect, Dell
    • Adnan Khaleel, Director Global Enterprise Sales Strategy

    Each day, they will be discussing different aspects of HPC adoption, scaling, and scope continue to grow, driven by the need to solve more problems, larger problems, and new types of problems. Modeling & simulation remain important applications cases; data analytics and machine learning are expanding the scope of HPC and types of HPC systems; cloud computing is making HPC more accessible and on-demand. Dell has been a long-time leader in HPC clusters for modeling and simulation, and is now embarking on a path towards leadership in this broader context of HPC.

     

     

  • Accelerating HPC applications using K80 GPUs

    - By Mayura Deshmukh

    Every year Graphics Processing Unit (GPU) become more powerful, achieving more teraflops thus giving a quantum leap in performance for commonly used molecular dynamics and manufacturing codes, allowing researchers to use more efficient and denser high performance computing architectures. What is the performance difference between CPU and GPU? How much is the power consumption? How well does K80 GPUs perform with the Dell C4130 server? Which configuration is the best for my application? These are some of the questions which come to our mind and this blog aims to answer these and related questions.

    This blog presents the work conducted to measure and analyze the performance, power consumption and performance per watt of a single Dell PowerEdge C4130 server with nVidia K80 GPUs. The PowerEdge C4130 server is the latest GPU high density design from Dell, offering up to four GPUs in a 1U form factor. The uniqueness of PowerEdge C4130 is that it presents a configurable system design, potentially making it a better fit, for the wider variety of extreme HPC applications. 

    The HPC focused Tesla series K80 GPU provides 1.87 TFLOPs (double precision) compute capacity, which is about 31% more than K40, the previous Tesla card.  K40’s base clock is 745MHz, though it can be boosted up to 810MHz or 875MHz. K80 has a base clock of 562MHz, but it can climb up to 875MHz, at 13MHz increments. Another new feature of the K80, is Autoboost, which provides additional performance, if additional power and thermal head room is available. In the K80, the internal GPUs are based on the GK210 architecture and have a total of 4,992 cores which represent a 73% improvement over K40.  The K80 has a total memory of 24GBs which is divided equally between the two internal GPUs; this is a 100% more memory capacity compared to the K40.   The memory bandwidth in K80 is improved to 480 GB/s.  The rated power consumption of a single K80 is a maximum of 300 watts.

    Configuration

    The C4130 offers eight configurations “A” through “H”. Since GPUs provide the bulk of compute horsepower, the configurations can be divided into three groups based on expected performance, the first group of four configurations, “A”, “B”, “C” and “G” with four GPUs each, the second group of a single configuration “H” with three GPUs, and the third group of three configurations, “D” “E” and “F” with two GPUs each. The quad GPU configurations: “A”, “B” and “G” have an internal PCIe switch module. The details of the various configurations are shown in the Table 1 and the block diagram (Figure 1) below:

    Table 1: C4130 Configurations

     

      Figure 1: C4130 Configuration Block Diagram

    Table 2 gives more information about the hardware configuration, profiles and firmware used for the benchmarking.

    Table 2: Hardware Configuration

     

    Bandwidth

    CUDA’s heterogeneous programming model uses both the CPU and GPU, so data transfer between CPUs and GPUs greatly affect performance.

    Figure 2: Memory Bandwidth for C4130 

         

    Figure 2: Memory Bandwidth for C4130 

    Figure 2 shows the host-to-device (CPU à GPU) and device-to-host (CPU ß GPU) memory bandwidth for all the C4130 configurations. Bandwidth is within range of 12000 MB/s (Peak is 15754 MB/s)

    Nvidia’s GPUDirect Peer to Peer feature enables GPUs on the same PCIe root complex to directly transfer data between their memories, avoiding any copies to system memory. This dramatically lowers CPU overhead, and reduces latency, resulting in significant performance improvements in data transfer time for applications. Without the peer to peer feature, to get data from one GPU to another on the same host, one would use cudaMemcpy() first to get the data from the GPU to system memory, then another cudaMemcpy() to get the same data onto the second GPU.

    Figure 3: Peer-to-peer Bandwidth for C4130

    Figure 3 shows the peer to peer communication between the GPUs for the C4130 with a switch module (Configuration B) Vs C4130 without switch module (Configuration C - Dual CPUs, Balanced with four GPUs).

    • For configuration B the bandwidth is constant at 24.6 GB/s across all GPU’s.
    • For configuration C bandwidth is:
      • 24.6 GB/s for data transfers between GPUs on the same card (GPU1óGPU2, GPU3óGPU4, GPU5óGPU6, GPU7óGPU8)
      • 19.6 GB/s for data transfers between GPUs connected to the same CPU (GPU1,2óGPU3,4; GPU5,6óGPU7,8)
      • 18.7 GB/s for data transfers between GPUs connected to the other CPU (GPU1,2,3,4óGPU5,6,7,8)

    Applications that require a lot of peer to peer communication can benefit from the high bandwidth offered by the C4130 switch module configurations (A, B, G).

    HPL

    HPL solves a random dense linear system in double-precision arithmetic on distributed-memory systems and is a very compute intensive benchmark. NVIDIA pre compiled HPL, Intel MKL 2015 and OpenMPI 1.6.5 were used for the benchmarking. The problem size (N) used was ~90% of the system memory.

    Figure 4: HPL performance and power consumption with C4130

       

     

    The blue bars on the left graph in Figure 4 shows the HPL performance characterization of PowerEdge C4130. The results are achieved in GFLOPS which is the Y-axis on the graph. 

    • Performance for the four GPU configurations –“A”, “B”, “C” and “G”, ranges from 6.5 to 7.3 TFLOPS. Configuration “C” and “G”, with two GPUs balanced per CPU are the highest performing configurations with 7.3 TFLOPS. The performance difference between “A” and “B” can be attributed to the additional CPU in configuration “B”. The difference from “B” to “G” or “C” is due to different GPU to CPU ratios; all three have the same number of compute resources.  Configuration “C” and “G” are balanced with two GPUs per CPU while “B” has the all four GPU attached to a single CPU.
    • The only three GPU configuration “H” achieved 6.4 TFLOPS which falls between the performance of the four GPU and two GPU configuration.
    • For the two GPU configurations, “D” is highest with 3.8 TFLOPS, “E” and “F” with 3.6 TFLOPS. Configuration “E” has one less CPU explaining the difference in performance than “D”.
    • Both “D” and “F” have two CPUs and two GPUs but for configuration “F” both the GPUs are connected to just one CPU, whereas for Configuration “D” each GPU is connected to each CPU (more cores per GPU).

    Compared to a CPU-only performance, run on two E5-2690v3, an acceleration of ~9X is obtained by using four K80, 7X by using three GPUs and an acceleration of ~4.7X with two K80 GPUs.  The HPL efficiency is significantly higher on K80 (low to upper 80s) compared to previous generation of GPUs.

     The red bars on the right graph in Figure 4 represent the power consumption for the HPL runs. The quad GPU configurations “A”, “B”, “C” and “G” consume significantly more power than the CPU-only runs, which is expected for compute intensive loads. But the energy efficiency (calculated as performance per watt) with these configurations is 4+ GFLOPS/w compared to the 1.6 GFLOPS/s of the CPU-only HPL runs. The power consumption for the three GPU configuration “H” is 2.7X and the energy efficiency is 4.1 GFLOPS/w which makes it an energy efficient lower cost alternative to the quad GPU configurations. The dual GPU configurations “D”, “E” and “F” consume low power (1.8X to 2.1X compared to CPU-only runs) and the energy efficiency is in the range of 3.5 GFLOPS/w to 3.9 GFLOPS/w that is about 2.3X better than the CPU only runs.

    NAMD

    NAMD is designed for high-performance simulation of large bio molecular systems. The benchmarks ApoA1 (92224 atoms) is a high density lipoprotein found in plasma, which helps extraction of cholesterol from tissues to liver. F1ATPase (327506 atoms) is responsible for the synthesizing of the molecule adenosine tri-phosphate. STMV (Satellite Tobacco Mosaic Virus) is a small, icosahedral virus, which worsens the symptoms of infections by tobacco mosaic virus. STMV is a large benchmark case with 1066628 atoms.

    Figure 5: NAMD performance and power consumption with C4130

        

       

    Figure 5 quantifies the performance and power consumption of NAMD for all the C4130 configurations compared to the CPU-only server (i.e. server with two CPUS).

    • The acceleration on NAMD is sensitive to number of CPUs and the memory available in the system E.g. there is a significant difference in the acceleration between “A” and “B” for the quad GPU configuration and between “E” and “F” for the dual GPU configuration. This difference becomes more apparent as the problem size increases. “B” with a similar configuration to “A” but with an additional CPU and memory performs 43% better compared to “A” and “F” with an additional CPU and memory than E performs 26% better.
    • Among the four quad GPU configurations NAMD performs best on configuration “C” and “G”. The difference in the two highest performing configurations and the other configurations (“A” and “B”) is the manner in which GPUs are attached to the CPU. The balanced configurations “G” (with switch) and “C” (without switch) have 2 GPUs attached to 2 CPUs resulting in 7.8X acceleration over the CPU-only case. The same four GPUs attached via a switch module to a single CPU, configuration “B” results in about 7.7X acceleration.
    • “H” the three GPU configuration falls in between the four GPU and two GPU configurations with respect to performance with 7.1X acceleration than the CPU-only configuration. “H” with an extra CPU and more memory performs better than the four GPU configuration “A”
    • “D” and “F” with 2 CPUs and 2 GPUs perform better with 5.9X acceleration compared to 4.4X in configuration “E” (1 CPU and 2 GPUs).  

    As shown in the right graph of Figure 5, the power consumption for quad GPU configurations is ~ 2.3X resulting in accelerations from 4.4X to 7.8X and the energy efficiency (performance per watt) ranges from 2.0X to 3.4X. Configuration “C” and “G” along with providing the best performance also do well from energy efficiency perspective (an acceleration of 7.8X for 2.3X more power)amongst the quad GPU configurations. Configuration “H” with three GPUs is more energy efficient configuration than the quad GPU configurations with performance per watt of 3.7X providing 7.1X acceleration with only 1.9X more power. Configuration “F” is the most energy efficient configuration, consuming only 1.5X more power with performance per watt of 3.8X.

    ANSYS Fluent

    ANSYS Fluent is a computational fluid dynamics application used for fluid flow design engineering analysis. The equation solvers used to drive the simulation are computational intensive. Approximately 3 GB GPU memory is required for a 1M Cell simulation. The benchmarks run are the ANSYS pipes 1.2M and 9.6M steady state, non-combustive cases.

    Figure 6: ANSYS Fluent performance and power consumption with C4130

        

     

    The left graph in Figure 6 shows performance of ANSYS Fluent compared to 4 CPU cores. Code performs best for configuration with 1: 2 CPU: GPU ratio.

    • The quad GPU configurations provide 3.9-4.4X acceleration compared to tests run on 4 CPU cores. Configuration “C” and “G” provide the best performance amongst the four GPU configurations
    • The three GPU configuration “H” provides 3.7X acceleration
    • The dual GPU configuration “E” with two GPU’s connected to a single CPU provides the best acceleration of 2.8X amongst all the dual GPU configurations

    In Figure 6 the right graph shows the power consumption data for all the configurations compared to the power consumed when the benchmarks were run on 4 CPU cores. The numbers in yellow at the bottom of the bars indicate the relative performance per watt for the configurations. The quad GPU configurations consume 3.7X-3.9X more power and provide 3%-20% more performance per watt. The three GPU configuration “H” is the most energy efficient configuration consumes 2.8X more power but provides the most performance per watt (32% more than the 4 core runs) of all the configurations. The dual GPU configurations consume 2.1X-2.5X more power and the energy efficiency is7%-28% better.      

    Fluent scales well on CPU cores so to understand the benefit of using GPUs we experimented by using the same number of licenses and running the benchmark on the CPU cores Vs running it on CPU+ GPU.

    Figure 7: ANSYS Fluent optimizing licensing costs

      

     

    Figure 7 shows the data for the 1.2M and 9.6 Fluent benchmark run on only CPU cores Vs quad GPU configurations “A”, “B” and “C”. The benchmark output is the wall clock time which is the Y-axis (lower is better), the X-axis shows the number of CPU cores used for the test (that is the number of fluent licenses required). As shown in Figure 7 using 24 licenses, the GPU approach: that is using 16cores + 8GPUs provides 48% better performance than just using 24 CPU cores for the 9.6M benchmark and is 25% better for the 1.2M benchmark. Similarly, Table 3 shows the performance benefit of GPU approach Vs CPU approach for 24, 20, 16, 12 and 8 licenses for the 9.6M and 1.2M benchmark cases.

    Table 3: Fluent GPU vs CPU approach with same number of licenses

     

    Conclusion

    The C4130 server with nVidia Tesla K80 GPUs demonstrates exceptional performance and power-efficiency gains for compute intensive workloads and applications like NAMD and Fluent. Fluent scaling is very impressive on CPU cores but depending on your problem and licensing model there is a definitive performance benefit with using GPU’s. Applications that do a lot to GPU peer-to-peer communication can gain from the higher bandwidth offered by the C4130 switch configurations.

  • Application Performance Study on Intel Haswell EX Processors

    by Ashish Kumar Singh

    This blog describes, in detail, the performance study carried out on the E7-8800 v3 family of processors (architecture codenamed as Haswell-EX). The performance on Intel Xeon E7-8800 v3 has been compared to Intel Xeon E7-4800 v2 to ascertain the generation over generation performance improvement. The applications used for this study are HPL, STREAM, WRF and ANSYS Fluent. The Intel Xeon E7-8890v3 processors have 18 cores/36 threads with 45MB of L3 cache (2.5MB/slice). With AVX workloads the clock speed of Intel E7-8890 v3 reduced from 2.5GHz to 2.1GHz. These processors support QPI speed of 9.6 GT/s.

    Server Configuration                                                                                                                                         

     

    PowerEdge R920

    PowerEdge R930

    Processor

    4 x Intel Xeon E7-4870v2 @ 2.3GHz (15 cores) 30MB L3 cache 130W

    4 x Intel Xeon E7- 8890v3 @2.5GHz (18 cores) 45MB L3 cache 165W

    Memory

    512GB = 32 x 16GB DDR3 @ 1333MHz RDIMMS

    1024 GB = 64 x 16GB DDR4 @1600MHz RDIMMS

    BIOS Settings

    BIOS

    Version 1.1.0

    Version 1.0.9

    Processor Settings > Logical Processors

    Disabled

    Disabled

    Processor Settings > QPI Speed

    Maximum Data Rate

    Maximum Data Rate

    Processor Settings > System Profile

    Performance

    Performance

                                                               Software and Firmware          

    Operating System

    RHEL6.5 x86_64

    RHEL 6.6 x86_64

    Intel Compiler

    Version 14.0.2

    Version 15.0.2

    Intel MKL

    Version 11.1

    Version 11.2

    Intel MPI

    Version 4.1

    Version 5.0

    Benchmark and Applications

    LINPACK

    V2.1 from MKL 11.1

    V2.1 from MKL 11.2

    STREAM

    v5.10, Array Size 1800000000, Iterations 100

    v5.10, Array Size 1800000000, Iterations 100

    WRF

    v3.5.1, Input Data Conus12KM, Netcdf-4.3.1.1

    V3.6.1, Input Data Conus12K, Netcdf-4.3.2

    ANSYS Fluent

    v15, Input Data: eddy_417k, truck_poly_14m, sedan_4m, aircraft_2m

    v15, Input Data: eddy_417k, truck_poly_14m, sedan_4m, aircraft_2m

    Analysis

    The objective of this comparison was to show the generation-over-generation performance improvement in the enterprise 4S platforms. The performance differences between two server generations were because of the improvement in system architecture, greater number of cores and higher frequency memory. The software versions were not a significant factor.

    LINPACK

    High Performance LINPACK is a benchmark that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed memory systems. HPL benchmark was run on both PowerEdge R930 and PowerEdge R920 with block size of NB=192 and problem size of N=90% of total memory size.

      

    As shown in the graph above, LINPACK showed 1.95X performance improvement with four Intel Xeon E7-8890 v3 processors on R930 server in comparison to four Intel Xeon E7-4870 v2 processors on R920 server. This was due to substantial increase in number of cores, memory speed, flops/cycle of the processor and processor architecture.

    STREAM

    STREAM is a simple synthetic program to measure sustained memory bandwidth used COPY, SCALE, SUM and TRAID programs to measure memory bandwidth.

    Operations of these programs are shown below:

    COPY:       a(i) = b(i)
    SCALE:      a(i) = q*b(i)
    SUM:        a(i) = b(i) + c(i)
    TRIAD:      a(i) = b(i) + q*c(i)

    This chart showed the comparison of sustained memory bandwidth between PowerEdge R920 and PowerEdge R930 servers. STREAM showed 231GB/s on PowerEdge R920 and 260GB/s on PowerEdge R930, which is 12% improvement in memory bandwidth. This increase is because of the improvement in DIMM speed available on PowerEdge R930.

    WRF

    The WRF (Weather Research and Forecasting) model is a next-generation mesoscale numerical weather prediction system designed to serve both atmospheric research and operational forecasting needs. The model serves a wide range of metrological applications across scales from tens of meters to thousands of kilometers. WRF allows to generate atmospheric simulations based on real data (observations, analysis) or idealized conditions.

    WRF performance analysis was run for conus12KM dataset. Conus12KM data is a single domain, medium size 48-hours 12KM resolution case over continental US (CONUS) domain with a time step of 72seconds.

     

    With Conus12KM dataset, WRF showed 0.22seconds average time on PowerEdge R930 server, while 0.26seconds on PowerEdge R930 server, which is an 18% improvement.

    ANSYS Fluent

    ANSYS Fluent contains the broad physical modeling capabilities for model flow, turbulence, heat transfer, and reactions for industrial applications ranging from air flow over an aircraft wing to combustion in a furnace, from bubble columns to oil platforms, from blood flow to semiconductor manufacturing, and from clean room design to wastewater treatment plants.

            

           

    We used four different datasets for Fluent. We considered ‘Solver rating’ (higher is better) as the performance metric. For all the test cases with PowerEdge R930 Fluent showed 24% to 29% performance improvement in-comparision to PowerEdge R920.

    Conclusion

    PowerEdge R930 server outperforms its previous generation PowerEdge R920 server in both benchmarks and application comparison. Due to latest processors with higher number of cores, higher frequency memory and CPU architecture improvement PowerEdge R930 gave better performance than PowerEdge R920. PowerEdge R930 platform with four Intel Xeon EX processors is very good choice for those HPC applications, which can scale up to the large number of cores and memory.

    Reference

    http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2014/05/21/hpc-application-performance-study-on-4s-srvers