High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • Need for Speed: Comparing FDR and EDR InfiniBand (Part 1)

    By Olumide Olusanya and Munira Hussain

    The goal of this blog is to evaluate the performance of Mellanox Technologies’ FDR (Fourteen Data Rate) Infiniband and their latest EDR (Enhanced Data Rate) Infiniband with speeds of 56Gb/s and 100Gb/s respectively. This is the first of our two series blog and we will be showing how these interconnects perform on a cluster using industry-wide micro-level benchmarks and applications on HPC cluster configuration. In this part, we will show latency, bandwidth and HPL results for FDR vs EDR and in part 2 we will share more results with other applications which include ANSYS Fluent, WRF, and NAS Parallel Benchmarks. You should also keep in mind that while some applications would benefit from the higher bandwidth in EDR, other applications which have low communication overhead would show little performance improvement in comparison.  

    General Overview

    Mellanox EDR adapters are based on a new generation ASIC also known as ConnectX-4 while the FDR adapters are based on ConnectX-3. The theoretical uni-directional bandwidth for EDR is 100 Gb/s versus FDR which is 56Gb/s. Another difference is that EDR adapters are x16 adapters while FDR adapters are available in x8 and x16. Both of these adapters operate at a bus width of 4X link. The messaging rate for EDR can reach up to 150 million messages per second compared with FDR ConnectX-3 adapters which deliver more than 90 million messages per second.

    Table 1 below shows the difference between EDR and FDR and Table 2 describes the configuration of the cluster used in the test while Table 3 lists the applications and benchmarks used for this test.

                            Table 1 - Difference between EDR and FDR








    x8 and x16 Gen3

    x16 Gen3

    Theoretical BW

    56 Gb/s

    100 Gb/s

    Messaging rate

    90 MMS

    150 MMS





                                           Table 2 - Cluster configuration




    16 nodes x PowerEdge C6320 [ 4 chassis ]


    Intel®Xeon®Intel Xeon E5-2660 v3 @2.6/2.2 GHz , 10 cores, 105W




    128 GB – 8 x16 GB @ 2133MHz

    Operating System

    Red Hat Enterprise Linux Server release 6.6.z (Santiago)




    Intel® MPI



    BIOS settings

    • System Profile: Performance Optimized
    • Turbomode: Enabled
    • Cstates: Disabled
    • Nodeinterleave: Disabled
    • Hyper threading: Disabled
    • Snoop mode: Early/Home/COD snoop




    • Mellanox ConnectX-4 EDR 100Gbps
    • Mellanox Switch-IB  SB7790
    • PCI-E x16 Gen3 riser slot
    • HCA firmware: 12.0012.1100
    • PSID: MT_2180110032



    • Mellanox ConnectX-3 FDR 56Gbps
    • Mellanox SwitchX SX6025
    • PCI-E x8 Gen3 Mezz slot
    • HCA firmware: 2.30.8000
    • PSID: DEL0A30000019



    Table 3 - Applications and Benchmarks 





    OSU Micro-Benchmarks

    Efficiency of MPI   implementation

    From Mellanox OFED 3.1

    Latency, Bandwidth


    Random dense linear


    From Intel MKL

    Problem size 90% of total memory

    Ansys Fluent

    Computational Fluid






    Weather Research and



    Conus 12km

    NAS Parallel Benchmarks

    Computational Fluid



    CG, MG, IS, FT



    OSU Micro-Benchmarks

    To find the latency and bandwidth, we used the tests from the OSU Micro-Benchmark suite. These tests use the MPI message passing performance to check the quality of a network fabric. Using the same system configuration for EDR and FDR fabrics, we got latency results as shown in Figure 1 below.

                                                     Figure 1 - OSU Latency (using MPI from Mellanox HPC-X Toolkit)

    Figure 1 shows a simple OSU node-to-node latency result for EDR vs FDR. Latency numbers are typically taken from the lowest data points (usually the point with the lowest message size). Hence, the lower the data points, the better. In the above OSU latency graph, EDR shows a latency of 0.80us while FDR shows 0.81us. As the message size increases past 512 Bytes, EDR provides an even lower latency of 2.75us compared with FDR’s 2.84us for a 4KB message size. When we did a further latency study using RDMA, EDR measured 0.61us and FDR measured 0.65us.

    Figure 2 below plots the OSU unidirectional and bidirectional bandwidth achieved by both EDR and FDR at different message sizes from 1- 4MB.

                                     Figure 2 - OSU Bandwidth (using MPI from Mellanox HPC-X Toolkit)

    OSU unidirectional bandwidth is a ping-pong type of communication test where the sender sends a fixed size of messages back-to-back to a receiver and then the receiver responds only after receiving all the messages.  This test measures the maximum data rate of the network one–way or the unidirectional bandwidth. The result is taken from the achieved bandwidth of the maximum message size which is 4MB. In the above test, EDR achieves a maximum unidirectional data rate of 12.4GB/s (99.2Gb/s) and FDR achieves 6.3GB/s (50.4Gb/s). This is a 97% performance improvement in EDR over FDR.

    OSU bidirectional bandwidth is very similar to the unidirectional test, but in this case, both nodes send messages to each other and await a reply. From the above graph, EDR achieves a bidirectional data rate of 24.2GB/s (193.6Gb/s) compared with FDR’s 10.8GB/s (86.4Gb/s) which gives us a 124% improvement with EDR over FDR.


    Figure 3 below shows the HPL performance between EDR and FDR using COD (Cluster on Die) snoop mode. Previous studies have shown that COD gives the best performance over Home and Early snoop.


                                                              Figure 3 - HPL Performance

    HPL benchmark is a compute-intensive application. It could spend more than 80% of its runtime on computation depending on how you tune it. During the bulk of its communication time, it sends messages of small sizes across the cluster which may not benefit from a higher speed network. Hence, you should not expect a huge performance difference between EDR and FDR. Even though EDR seems to perform slightly better than FDR by 0.33% in the 80-core run, this difference is within our run-run variation for successive tests with either EDR or FDR. As a result, this performance gain cannot be attributed to an EDR advantage. This also makes it is difficult to test accurately the effect of one interconnect over the other with HPL.


    From our tests so far, EDR has shown a clear bandwidth advantage when compared with FDR – 97% in unidirectional and 124% in bidirectional bandwidth. In the second part of this blog, we will share more results from other applications (ANSYS Fluent, WRF, and NAS Parallel Benchmarks) to compare performance between EDR and FDR.


  • IoT Close to Becoming of Age

    CES 2016 proved that the Internet of Things (IoT) is a segment of the tech market that continues to grow – from connected homes, to connected cars, to connected cities and beyond.  Dell is working with customers on over 150 IoT projects that range from solving simple organizational issues, to leveraging technology as a critical competitive advantage.

    Compass Intelligence is a market analytics and consulting firm that specializes in metrics-driven market intelligence and insights for the mobile, IoT, and high-tech industries. For the past three years the company has recognized the year’s best in mobile devices and software, wireless communications, Internet of Things, wearables, green technology and connected products. Dell was thrilled to be one of the 2016 honorees, for its dedication to making the IoT market a reality. 

    A good example of the collaboration taking place in the market is the Thread Group, a global ecosystem of developers, retailers, and customers, who are working together to create a better way to connect products to the home, educate those who are not in the know, and simplify the process through the most innovative technology at hand.  For more, you can read my blog, “Pushing Through the Hype and Handwringing of IoT.”

  • A Look Back at SC15 and Looking Forward to 2016

    By Christine Fronczak

    It’s been a great year for our community, with the industry maturing sufficiently to push HPC firmly back into the limelight. We are seeing a large part of the market evolving beyond the traditional stereotype of HPC – that of being solely for the science and super-geeky technology audience. Big data is a prime example of the market that requires high powered computational needs for a wide range of vertical markets – from retail to manufacturing and financial services.

    SC15 enabled us to bring some of the industry’s best together so we could learn all the different ways HPC is being used to better our world. From finding a cure for autism and rare childhood diseases to rooting out fraud and plagiarism, to creating clean energy, and helping third world countries develop better food and foster entrepreneurship.

    We heard some very moving and intriguing use cases from the likes of General Atomics, The University of Florida, Virginia Tech, TGen, University of Maryland and Johns Hopkins, as well as Oxford University among others.  These have been posted for your viewing on our YouTube Channel. We also hosted Intersect360’s Addison Snell and Dell’s own Onur Celebioglu who walked us through trends they see in the market and what to expect in the coming year. Their talks are also accessible on the SC15 YouTube playlist. InsideHPC’s Rich Brueckner joined us to moderate a panel discussion on the convergence of HPC, Big Data and Cloud, followed by a discussion on the NSCI initiative. Created by President Obama in July 2015, the National Strategic Computing Initiative has a mission to ensure the United States continues leading high performance computing over the coming decades.  Both of these panels are available on the same playlist as the others, and can also be accessed via the InsideHPC website.

    We will continue to follow these stories throughout the year, as well as giving you insight into the people behind the scenes making the “magic” happen. Intersect360’s Addison Snell has said that HPC is “a critical pillar of innovation and advancement, whether you’re talking about general scientific research or throughout different industries.” In 2016 we will watch and explore the trends and foster discussion on the successes and failures of our community in order to propel the industry forward. In the meantime, we leave you with a glimpse and some insights into SC15 and wish you a very Happy New Year!

    Highlights from SC15:

  • Meeting the Demands of HPC and Big Data Applications by Leveraging Hybrid CPU/GPU Computing

    “Rack ‘em and stack ‘em.”— a winning approach for a long time but not without its limitations. A generalized server solution works best when the applications running on those servers have generalized needs.

    Enter “Big Data.” Today’s application and workload environments can be required to process massive amounts of granular data and, thus, often consist of applications that place high demands on different server hardware elements. Some applications are very compute intensive and place a high demand on the server’s CPU where others in the same environment are tasked with unique processing requirements performed on specialized graphical processing units (GPUs).

    Whether it is customer, demographic, seismic data — or a whole host of other uses — the number crunching and processing required across the suite of applications can result in processing demands that are radically different from demands of prior years. Enter Hybrid High Performance Computing. These systems are built to serve two masters: CPU-intensive applications and GPU-intensive applications delivering a hybrid environment where workloads can be optimized and run-times reduced through ideal resource utilization.

    The results of Hybrid CPU/GPU Computing adoption have been impressive. Just a few examples of how Hybrid CPU/GPU Computing is delivering real value include:

    • Optimization of workloads across CPU/GPU servers
    • Delivering the highest-density, highest-performance in a small footprint
    • Provides significant power, cooling and resource utilization benefits

    You can learn more about leveraging hybrid CPU/GPU computing in this whitepaper.

  • Counting Down to SC15

    - by Stephen Sofhauser

    The countdown to SC15 has started, and we at Dell are very excited for this year’s event. We have a lot to share with you all this year, and it’s particularly special to us because it’s right in our back yard.  Come visit us at booth #1009. Our aim is to show you the true meaning of “Texas Friendly” with great demos, two customer theaters that will feature a stellar lineup of speakers, panel discussions, new products and solutions we’re bringing to market, and of course, let’s not forget the awesome food and entertainment Austin has to offer!

    We have a great morning series for you, featuring our director of HPC Engineer, Onur Celebioglu, and Intersect360’s Addison Snell. Every morning of the show (10:15 a.m. -10:45 a.m.), they will be discussing trends and tech in the HPC market. 

    There will be two afternoon panel discussions, hosted by insideHPC’s Richard Brueckner.  I recently spoke to him about the upcoming conference, you can listen to the podcast here.

    The first panel is from 1:30 p.m. to 2:30 p.m. on Wednesday, November 18th, “All Together Now: The Convergence of Big Data, Cloud, and HPC.” This should prove to be an interesting discussion with Richard and our four panelists: Wojtek Goscinski, Ph.D. (Monash University), Niall Gaffney, Ph.D. (Texas Advanced Computing Center (TACC)), Craig Stewart, Ph.D. (Indiana University), and Andrew Rutherford (Microsoft).  The concept of the panel was born from discussions with customers who after years of siloed workloads, are trying to figure out how to best integrate their Big Data, Cloud, and HPC.

    Wednesday's second panel should be equally compelling as Richard and the panelists discuss how the NSCI is fostering more collaboration between government, academia, and industry. “More than Just Exascale: How the NSCI Will Make HPC More Accessible to All” (3:00 p.m. – 4:00 p.m). This all comes as a result of the White House initiative, seeking to keep the United States at the forefront of HPC capabilities. What’s nice is that most of these speakers know each other. We will be featuring Dan Stanzione, Ph.D. (Texas Advanced Computing Center TACC), Mike Norman, Ph.D. (San Diego Supercomputing Center SDSC), Dave Lifka, Ph.D. (Cornell University), and Merle Giles (National Center for Supercomputing Applications NCSA) – who has a book on HPC best practices, Industrial Applications of High-Performance Computing: Best Global Practices.

    We have so much more planned for you, including talks from customers like the San Diego Supercomputing Center, Virginia Bioinformatics Institute (VBI), and Cornell – just granted $5 million by the NSF to collaboratively develop a federated cloud. There’s just too much to cram it in one blog post – but you can check out http://dell.to/1PFgbwJ for more information. We look forward to seeing you (booth #1009) and hope you have a great SC15, and enjoy everything Dell has to offer! And welcome to Austin!



  • A Great Lineup for SC15

    Join Dell in Booth #1009!

    This year promises to be another great one at SC, chock full of great speakers and panels – you won’t be disappointed! 

    Join us at the Dell Sixth Street Theater Tuesday, Wednesday, and Thursday morning from 10:15 a.m. – 10:45 a.m. for authentic Texas burritos and engaging conversation from Intersect360’s CEO, Addison Snell, and Dell’s own Onur Celebioglu. They will be discussing why HPC is now important to a broader group of use cases, and dig deep into overviews of HPC for research, life sciences and manufacturing. Participants will learn about types of application characterization, best practices and examples of engineered solutions that are appropriate for these specific verticals. Come learn more about why HPC, big data and cloud are converging, and how Dell solves challenges in our HPC engineering lab and through collaborative work with other leading technology partners and research institutions.

    We will also have two informative panels hosted by Rich Brueckner, president, insideHPC. Recently, Dell’s North American Sales Director for HPCHigh Performance Computing spoke with Brueckner about these panels - you can listen to the podcast here.

    All Together Now: The Convergence of Big Data, Cloud, and HPC

    Wednesday, Nov 18, 1:30 p.m. – 2:30 p.m.

    • Wojtek Goscinski, Ph.D. (Monash University)
    • Niall Gaffney, Ph.D. (Texas Advanced Computing Center (TACC))
    • Craig Stewart, Ph.D. (Indiana University)
    • Andrew Rutherford (Microsoft)

    Modeling and simulation have been the primary usage of high performance computing (HPC). But the world is changing. We now see the need for rapid, accurate insights from large amounts of data. To accomplish this, HPC technology is repurposed. Likewise the location where the work gets done is not entirely the same either. Many workloads are migrating to massive cloud data centers because of the speed of execution. In this panel, leaders in computing will share how they, and others, integrate tradition and innovation (HPC technologies, Big Data analytics, and Cloud Computing) to achieve more discoveries and drive business outcomes.

    More than Just Exascale: How the NSCI Will Make HPC More Accessible to All 

    Wednesday, Nov 18, 3:00 p.m. – 4:00 p.m.

    • Dan Stanzione, Ph.D. (Texas Advanced Computing Center TACC)
    • Mike Norman, Ph.D. (San Diego Supercomputing Center SDSC)
    • Dave Lifka, Ph.D. (Cornell University)
    • Merle Giles (National Center for Supercomputing Applications NCSA)

    The US —and the world —took notice this summer when President Obama issued an Executive Order establishing the National Strategic Computing Initiative (NSCI). While most headlines focus on the exascale goals of this initiative, the NSCI presents a comprehensive set of objectives. Those objectives include advancing usage, capabilities and impact of HPC for decades to come. In this panel, you will hear from HPC leaders who are moving us forward in improving HPC application developer productivity and making high performance computing more accessible to all.

    Lastly, from 1:00 p.m. to 1:30 p.m. on Tuesday, Wednesday, and Thursday afternoons, you can enjoy tasty treats and great dialogue from:

    • Jeff Kirk, Sr. Principle Engineer, HPC Technologies, Office of the CTO
    • Joe Sekel, HPC Server Architect, Dell
    • Adnan Khaleel, Director Global Enterprise Sales Strategy

    Each day, they will be discussing different aspects of HPC adoption, scaling, and scope continue to grow, driven by the need to solve more problems, larger problems, and new types of problems. Modeling & simulation remain important applications cases; data analytics and machine learning are expanding the scope of HPC and types of HPC systems; cloud computing is making HPC more accessible and on-demand. Dell has been a long-time leader in HPC clusters for modeling and simulation, and is now embarking on a path towards leadership in this broader context of HPC.



  • Accelerating HPC applications using K80 GPUs

    - By Mayura Deshmukh

    Every year Graphics Processing Unit (GPU) become more powerful, achieving more teraflops thus giving a quantum leap in performance for commonly used molecular dynamics and manufacturing codes, allowing researchers to use more efficient and denser high performance computing architectures. What is the performance difference between CPU and GPU? How much is the power consumption? How well does K80 GPUs perform with the Dell C4130 server? Which configuration is the best for my application? These are some of the questions which come to our mind and this blog aims to answer these and related questions.

    This blog presents the work conducted to measure and analyze the performance, power consumption and performance per watt of a single Dell PowerEdge C4130 server with nVidia K80 GPUs. The PowerEdge C4130 server is the latest GPU high density design from Dell, offering up to four GPUs in a 1U form factor. The uniqueness of PowerEdge C4130 is that it presents a configurable system design, potentially making it a better fit, for the wider variety of extreme HPC applications. 

    The HPC focused Tesla series K80 GPU provides 1.87 TFLOPs (double precision) compute capacity, which is about 31% more than K40, the previous Tesla card.  K40’s base clock is 745MHz, though it can be boosted up to 810MHz or 875MHz. K80 has a base clock of 562MHz, but it can climb up to 875MHz, at 13MHz increments. Another new feature of the K80, is Autoboost, which provides additional performance, if additional power and thermal head room is available. In the K80, the internal GPUs are based on the GK210 architecture and have a total of 4,992 cores which represent a 73% improvement over K40.  The K80 has a total memory of 24GBs which is divided equally between the two internal GPUs; this is a 100% more memory capacity compared to the K40.   The memory bandwidth in K80 is improved to 480 GB/s.  The rated power consumption of a single K80 is a maximum of 300 watts.


    The C4130 offers eight configurations “A” through “H”. Since GPUs provide the bulk of compute horsepower, the configurations can be divided into three groups based on expected performance, the first group of four configurations, “A”, “B”, “C” and “G” with four GPUs each, the second group of a single configuration “H” with three GPUs, and the third group of three configurations, “D” “E” and “F” with two GPUs each. The quad GPU configurations: “A”, “B” and “G” have an internal PCIe switch module. The details of the various configurations are shown in the Table 1 and the block diagram (Figure 1) below:

    Table 1: C4130 Configurations


      Figure 1: C4130 Configuration Block Diagram

    Table 2 gives more information about the hardware configuration, profiles and firmware used for the benchmarking.

    Table 2: Hardware Configuration



    CUDA’s heterogeneous programming model uses both the CPU and GPU, so data transfer between CPUs and GPUs greatly affect performance.

    Figure 2: Memory Bandwidth for C4130 


    Figure 2: Memory Bandwidth for C4130 

    Figure 2 shows the host-to-device (CPU à GPU) and device-to-host (CPU ß GPU) memory bandwidth for all the C4130 configurations. Bandwidth is within range of 12000 MB/s (Peak is 15754 MB/s)

    Nvidia’s GPUDirect Peer to Peer feature enables GPUs on the same PCIe root complex to directly transfer data between their memories, avoiding any copies to system memory. This dramatically lowers CPU overhead, and reduces latency, resulting in significant performance improvements in data transfer time for applications. Without the peer to peer feature, to get data from one GPU to another on the same host, one would use cudaMemcpy() first to get the data from the GPU to system memory, then another cudaMemcpy() to get the same data onto the second GPU.

    Figure 3: Peer-to-peer Bandwidth for C4130

    Figure 3 shows the peer to peer communication between the GPUs for the C4130 with a switch module (Configuration B) Vs C4130 without switch module (Configuration C - Dual CPUs, Balanced with four GPUs).

    • For configuration B the bandwidth is constant at 24.6 GB/s across all GPU’s.
    • For configuration C bandwidth is:
      • 24.6 GB/s for data transfers between GPUs on the same card (GPU1óGPU2, GPU3óGPU4, GPU5óGPU6, GPU7óGPU8)
      • 19.6 GB/s for data transfers between GPUs connected to the same CPU (GPU1,2óGPU3,4; GPU5,6óGPU7,8)
      • 18.7 GB/s for data transfers between GPUs connected to the other CPU (GPU1,2,3,4óGPU5,6,7,8)

    Applications that require a lot of peer to peer communication can benefit from the high bandwidth offered by the C4130 switch module configurations (A, B, G).


    HPL solves a random dense linear system in double-precision arithmetic on distributed-memory systems and is a very compute intensive benchmark. NVIDIA pre compiled HPL, Intel MKL 2015 and OpenMPI 1.6.5 were used for the benchmarking. The problem size (N) used was ~90% of the system memory.

    Figure 4: HPL performance and power consumption with C4130



    The blue bars on the left graph in Figure 4 shows the HPL performance characterization of PowerEdge C4130. The results are achieved in GFLOPS which is the Y-axis on the graph. 

    • Performance for the four GPU configurations –“A”, “B”, “C” and “G”, ranges from 6.5 to 7.3 TFLOPS. Configuration “C” and “G”, with two GPUs balanced per CPU are the highest performing configurations with 7.3 TFLOPS. The performance difference between “A” and “B” can be attributed to the additional CPU in configuration “B”. The difference from “B” to “G” or “C” is due to different GPU to CPU ratios; all three have the same number of compute resources.  Configuration “C” and “G” are balanced with two GPUs per CPU while “B” has the all four GPU attached to a single CPU.
    • The only three GPU configuration “H” achieved 6.4 TFLOPS which falls between the performance of the four GPU and two GPU configuration.
    • For the two GPU configurations, “D” is highest with 3.8 TFLOPS, “E” and “F” with 3.6 TFLOPS. Configuration “E” has one less CPU explaining the difference in performance than “D”.
    • Both “D” and “F” have two CPUs and two GPUs but for configuration “F” both the GPUs are connected to just one CPU, whereas for Configuration “D” each GPU is connected to each CPU (more cores per GPU).

    Compared to a CPU-only performance, run on two E5-2690v3, an acceleration of ~9X is obtained by using four K80, 7X by using three GPUs and an acceleration of ~4.7X with two K80 GPUs.  The HPL efficiency is significantly higher on K80 (low to upper 80s) compared to previous generation of GPUs.

     The red bars on the right graph in Figure 4 represent the power consumption for the HPL runs. The quad GPU configurations “A”, “B”, “C” and “G” consume significantly more power than the CPU-only runs, which is expected for compute intensive loads. But the energy efficiency (calculated as performance per watt) with these configurations is 4+ GFLOPS/w compared to the 1.6 GFLOPS/s of the CPU-only HPL runs. The power consumption for the three GPU configuration “H” is 2.7X and the energy efficiency is 4.1 GFLOPS/w which makes it an energy efficient lower cost alternative to the quad GPU configurations. The dual GPU configurations “D”, “E” and “F” consume low power (1.8X to 2.1X compared to CPU-only runs) and the energy efficiency is in the range of 3.5 GFLOPS/w to 3.9 GFLOPS/w that is about 2.3X better than the CPU only runs.


    NAMD is designed for high-performance simulation of large bio molecular systems. The benchmarks ApoA1 (92224 atoms) is a high density lipoprotein found in plasma, which helps extraction of cholesterol from tissues to liver. F1ATPase (327506 atoms) is responsible for the synthesizing of the molecule adenosine tri-phosphate. STMV (Satellite Tobacco Mosaic Virus) is a small, icosahedral virus, which worsens the symptoms of infections by tobacco mosaic virus. STMV is a large benchmark case with 1066628 atoms.

    Figure 5: NAMD performance and power consumption with C4130



    Figure 5 quantifies the performance and power consumption of NAMD for all the C4130 configurations compared to the CPU-only server (i.e. server with two CPUS).

    • The acceleration on NAMD is sensitive to number of CPUs and the memory available in the system E.g. there is a significant difference in the acceleration between “A” and “B” for the quad GPU configuration and between “E” and “F” for the dual GPU configuration. This difference becomes more apparent as the problem size increases. “B” with a similar configuration to “A” but with an additional CPU and memory performs 43% better compared to “A” and “F” with an additional CPU and memory than E performs 26% better.
    • Among the four quad GPU configurations NAMD performs best on configuration “C” and “G”. The difference in the two highest performing configurations and the other configurations (“A” and “B”) is the manner in which GPUs are attached to the CPU. The balanced configurations “G” (with switch) and “C” (without switch) have 2 GPUs attached to 2 CPUs resulting in 7.8X acceleration over the CPU-only case. The same four GPUs attached via a switch module to a single CPU, configuration “B” results in about 7.7X acceleration.
    • “H” the three GPU configuration falls in between the four GPU and two GPU configurations with respect to performance with 7.1X acceleration than the CPU-only configuration. “H” with an extra CPU and more memory performs better than the four GPU configuration “A”
    • “D” and “F” with 2 CPUs and 2 GPUs perform better with 5.9X acceleration compared to 4.4X in configuration “E” (1 CPU and 2 GPUs).  

    As shown in the right graph of Figure 5, the power consumption for quad GPU configurations is ~ 2.3X resulting in accelerations from 4.4X to 7.8X and the energy efficiency (performance per watt) ranges from 2.0X to 3.4X. Configuration “C” and “G” along with providing the best performance also do well from energy efficiency perspective (an acceleration of 7.8X for 2.3X more power)amongst the quad GPU configurations. Configuration “H” with three GPUs is more energy efficient configuration than the quad GPU configurations with performance per watt of 3.7X providing 7.1X acceleration with only 1.9X more power. Configuration “F” is the most energy efficient configuration, consuming only 1.5X more power with performance per watt of 3.8X.

    ANSYS Fluent

    ANSYS Fluent is a computational fluid dynamics application used for fluid flow design engineering analysis. The equation solvers used to drive the simulation are computational intensive. Approximately 3 GB GPU memory is required for a 1M Cell simulation. The benchmarks run are the ANSYS pipes 1.2M and 9.6M steady state, non-combustive cases.

    Figure 6: ANSYS Fluent performance and power consumption with C4130



    The left graph in Figure 6 shows performance of ANSYS Fluent compared to 4 CPU cores. Code performs best for configuration with 1: 2 CPU: GPU ratio.

    • The quad GPU configurations provide 3.9-4.4X acceleration compared to tests run on 4 CPU cores. Configuration “C” and “G” provide the best performance amongst the four GPU configurations
    • The three GPU configuration “H” provides 3.7X acceleration
    • The dual GPU configuration “E” with two GPU’s connected to a single CPU provides the best acceleration of 2.8X amongst all the dual GPU configurations

    In Figure 6 the right graph shows the power consumption data for all the configurations compared to the power consumed when the benchmarks were run on 4 CPU cores. The numbers in yellow at the bottom of the bars indicate the relative performance per watt for the configurations. The quad GPU configurations consume 3.7X-3.9X more power and provide 3%-20% more performance per watt. The three GPU configuration “H” is the most energy efficient configuration consumes 2.8X more power but provides the most performance per watt (32% more than the 4 core runs) of all the configurations. The dual GPU configurations consume 2.1X-2.5X more power and the energy efficiency is7%-28% better.      

    Fluent scales well on CPU cores so to understand the benefit of using GPUs we experimented by using the same number of licenses and running the benchmark on the CPU cores Vs running it on CPU+ GPU.

    Figure 7: ANSYS Fluent optimizing licensing costs



    Figure 7 shows the data for the 1.2M and 9.6 Fluent benchmark run on only CPU cores Vs quad GPU configurations “A”, “B” and “C”. The benchmark output is the wall clock time which is the Y-axis (lower is better), the X-axis shows the number of CPU cores used for the test (that is the number of fluent licenses required). As shown in Figure 7 using 24 licenses, the GPU approach: that is using 16cores + 8GPUs provides 48% better performance than just using 24 CPU cores for the 9.6M benchmark and is 25% better for the 1.2M benchmark. Similarly, Table 3 shows the performance benefit of GPU approach Vs CPU approach for 24, 20, 16, 12 and 8 licenses for the 9.6M and 1.2M benchmark cases.

    Table 3: Fluent GPU vs CPU approach with same number of licenses



    The C4130 server with nVidia Tesla K80 GPUs demonstrates exceptional performance and power-efficiency gains for compute intensive workloads and applications like NAMD and Fluent. Fluent scaling is very impressive on CPU cores but depending on your problem and licensing model there is a definitive performance benefit with using GPU’s. Applications that do a lot to GPU peer-to-peer communication can gain from the higher bandwidth offered by the C4130 switch configurations.

  • Application Performance Study on Intel Haswell EX Processors

    by Ashish Kumar Singh

    This blog describes, in detail, the performance study carried out on the E7-8800 v3 family of processors (architecture codenamed as Haswell-EX). The performance on Intel Xeon E7-8800 v3 has been compared to Intel Xeon E7-4800 v2 to ascertain the generation over generation performance improvement. The applications used for this study are HPL, STREAM, WRF and ANSYS Fluent. The Intel Xeon E7-8890v3 processors have 18 cores/36 threads with 45MB of L3 cache (2.5MB/slice). With AVX workloads the clock speed of Intel E7-8890 v3 reduced from 2.5GHz to 2.1GHz. These processors support QPI speed of 9.6 GT/s.

    Server Configuration                                                                                                                                         


    PowerEdge R920

    PowerEdge R930


    4 x Intel Xeon E7-4870v2 @ 2.3GHz (15 cores) 30MB L3 cache 130W

    4 x Intel Xeon E7- 8890v3 @2.5GHz (18 cores) 45MB L3 cache 165W


    512GB = 32 x 16GB DDR3 @ 1333MHz RDIMMS

    1024 GB = 64 x 16GB DDR4 @1600MHz RDIMMS

    BIOS Settings


    Version 1.1.0

    Version 1.0.9

    Processor Settings > Logical Processors



    Processor Settings > QPI Speed

    Maximum Data Rate

    Maximum Data Rate

    Processor Settings > System Profile



                                                               Software and Firmware          

    Operating System

    RHEL6.5 x86_64

    RHEL 6.6 x86_64

    Intel Compiler

    Version 14.0.2

    Version 15.0.2

    Intel MKL

    Version 11.1

    Version 11.2

    Intel MPI

    Version 4.1

    Version 5.0

    Benchmark and Applications


    V2.1 from MKL 11.1

    V2.1 from MKL 11.2


    v5.10, Array Size 1800000000, Iterations 100

    v5.10, Array Size 1800000000, Iterations 100


    v3.5.1, Input Data Conus12KM, Netcdf-

    V3.6.1, Input Data Conus12K, Netcdf-4.3.2

    ANSYS Fluent

    v15, Input Data: eddy_417k, truck_poly_14m, sedan_4m, aircraft_2m

    v15, Input Data: eddy_417k, truck_poly_14m, sedan_4m, aircraft_2m


    The objective of this comparison was to show the generation-over-generation performance improvement in the enterprise 4S platforms. The performance differences between two server generations were because of the improvement in system architecture, greater number of cores and higher frequency memory. The software versions were not a significant factor.


    High Performance LINPACK is a benchmark that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed memory systems. HPL benchmark was run on both PowerEdge R930 and PowerEdge R920 with block size of NB=192 and problem size of N=90% of total memory size.


    As shown in the graph above, LINPACK showed 1.95X performance improvement with four Intel Xeon E7-8890 v3 processors on R930 server in comparison to four Intel Xeon E7-4870 v2 processors on R920 server. This was due to substantial increase in number of cores, memory speed, flops/cycle of the processor and processor architecture.


    STREAM is a simple synthetic program to measure sustained memory bandwidth used COPY, SCALE, SUM and TRAID programs to measure memory bandwidth.

    Operations of these programs are shown below:

    COPY:       a(i) = b(i)
    SCALE:      a(i) = q*b(i)
    SUM:        a(i) = b(i) + c(i)
    TRIAD:      a(i) = b(i) + q*c(i)

    This chart showed the comparison of sustained memory bandwidth between PowerEdge R920 and PowerEdge R930 servers. STREAM showed 231GB/s on PowerEdge R920 and 260GB/s on PowerEdge R930, which is 12% improvement in memory bandwidth. This increase is because of the improvement in DIMM speed available on PowerEdge R930.


    The WRF (Weather Research and Forecasting) model is a next-generation mesoscale numerical weather prediction system designed to serve both atmospheric research and operational forecasting needs. The model serves a wide range of metrological applications across scales from tens of meters to thousands of kilometers. WRF allows to generate atmospheric simulations based on real data (observations, analysis) or idealized conditions.

    WRF performance analysis was run for conus12KM dataset. Conus12KM data is a single domain, medium size 48-hours 12KM resolution case over continental US (CONUS) domain with a time step of 72seconds.


    With Conus12KM dataset, WRF showed 0.22seconds average time on PowerEdge R930 server, while 0.26seconds on PowerEdge R930 server, which is an 18% improvement.

    ANSYS Fluent

    ANSYS Fluent contains the broad physical modeling capabilities for model flow, turbulence, heat transfer, and reactions for industrial applications ranging from air flow over an aircraft wing to combustion in a furnace, from bubble columns to oil platforms, from blood flow to semiconductor manufacturing, and from clean room design to wastewater treatment plants.



    We used four different datasets for Fluent. We considered ‘Solver rating’ (higher is better) as the performance metric. For all the test cases with PowerEdge R930 Fluent showed 24% to 29% performance improvement in-comparision to PowerEdge R920.


    PowerEdge R930 server outperforms its previous generation PowerEdge R920 server in both benchmarks and application comparison. Due to latest processors with higher number of cores, higher frequency memory and CPU architecture improvement PowerEdge R930 gave better performance than PowerEdge R920. PowerEdge R930 platform with four Intel Xeon EX processors is very good choice for those HPC applications, which can scale up to the large number of cores and memory.







  • The Right Mix for Today’s Data Environments

    Three takeaways from Dell’s John Whittaker on leveraging both big data analytics and traditional database management tools today...

    Delving into the results of a recent Unisphere Research survey of 300 database administrators (DBAs) and corporate data managers, Dell’s Executive Director of Information Management John Whittaker gives straight-forward advice for tackling today’s complex database environments via an Industry Perspectives article on the Data Center Knowledge site.  

    While organizations and DBAs have become hyper-focused on big data, analytics, and unstructured data tools, Whittaker gives a timely reminder that structured data still matters.

    Indeed, according to the Unisphere survey he references, structured data still accounts for 75 percent of the data stack at more than two-thirds of today’s enterprises. What’s more, nearly one-third of all organizations haven’t begun actively managing unstructured data at all to this point.

    That means paying attention to tools like Oracle and Microsoft SQL Server still needs to be a priority for DBAs, even as they try to incorporate Hadoop and NoSQL into their organizations.

    But that doesn’t mean Whittaker is turning a blind eye toward these more modern technologies. On the contrary, he makes a clear case for ramping up predictive analytics to allow an organization to see not only where it’s been, but where it’s going to stay a step ahead of the competition.

    The key to doing both is recognizing that even with the rise of big data, you need to leverage the right combination of both traditional and modern database tools today. Knowing which serves each situation best, and giving your team the tools it needs for each, is the balancing act DBAs and data managers must pull off today.

    Read the entire article on the Data Center Knowledge site here.


  • Integrating hooks and tools for easier management of HPC Cluster

    Managing tens of thousands of local and remote server nodes in a cluster is always a challenge. To reduce the cluster-management overhead and simplify setup of cluster of nodes, admins seek the convenience of a single snapshot view. Rapid changes in technology make management, tuning, customization, and settings updates an ongoing necessity, one that needs to be performed and easily as infrastructure is refactored and refreshed. 

    To simplify some of these challenges, it’s important to fully integrate hardware management and the cluster management solution.  The following integration detailed in this blog between that of server hardware and the cluster management solution provides an example of some of the best practices achievable today.

    Critical to this integration and design is the Integrated Dell Remote Access Controller (IDRAC).  Since IDRAC is embedded into the server motherboards for in-band and out-of-band system management, it can display and modify BIOS settings as well as perform firmware updates through the Life Cycle Controller and remote-console. Collectively, each server’s in-depth system profile information is gathered using system tools and utilities and is available in a single graphical user interface for ease of administration, thus reducing the need to physically access the servers themselves. 

    Figure 1. BIOS-level integration between Dell PowerEdge servers and cluster management solution (Bright 7.1)

    Figure 1 (above) depicts the configuration setup for a single node in the cluster. The fabric can be accessed via the dedicated iDRAC port or shared with the LAN-on-Motherboard capability. The cluster administration fabric is configured at the time of deployment with the help of built-in scripts in the software stack that help automate this. The system profile of the server is captured in an XML-based schema file that gets imported from the iDRAC using the racadm commands. Thus relevant data such as optimal system BIOS settings, boot order, console redirection and network configuration are parsed and displayed on the cluster dashboard of the graphical user interface.  By reversing this process, it is possible to change and apply other BIOS settings onto a server to tune and set system profiles from the graphical interface. These choices are then stored in an updated XML-based schema file on the head node, and pushed out to the appropriate nodes during reboots.

    Figure 2. Snapshot of the Cluster Node Configuration via cluster management solution.

    Figure 2 is a screenshot showing BIOS version and system profile information for a number of Dell PowerEdge servers of the same model. This is a particularly useful overview as inappropriate settings and versions can be easily and rapidly identified. 

    Typical use would be when new servers are added or replaced in a cluster. The above integration will help to ensure that all servers have similar homogenous performance, BIOS versions, firmware, system profile and other tuning configurations.

    This integration is also helpful for users who need custom settings – i.e. not the default settings - applied on their servers. For example codes that are latency sensitive may require custom profile with C-States disabled. These servers can be categorized into a node group, with specific BIOS parameters applied to that group.

    This tightly coupled BIOS level integration delivers capabilities that provide a significantly enhanced solution offering for HPC cluster maintenance that provides a single snapshot view for simplified updates and tuning.  As a validated and tested solutions on the given hardware, it provides seamless operation and administration of clusters at scale.   


    1. http://www.brightcomputing.com/Bright-Cluster-Manager
    2. http://en.community.dell.com/techcenter/systems-management/w/wiki/3204.dell-remote-access-controller-drac-idrac
    3. http://www.brightcomputing.com/Linux-Cluster-Architecture
    4. http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2014/09/23/bios-tuning-for-hpc-on-13th-generation-haswell-servers
    5. http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2015/04/29/linpack-benchmarking-on-a-4-nodes-cluster-with-intel-xeon-phi-7120p-coprocessors