High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • Meeting the Demands of HPC and Big Data Applications by Leveraging Hybrid CPU/GPU Computing

    “Rack ‘em and stack ‘em.”— a winning approach for a long time but not without its limitations. A generalized server solution works best when the applications running on those servers have generalized needs.

    Enter “Big Data.” Today’s application and workload environments can be required to process massive amounts of granular data and, thus, often consist of applications that place high demands on different server hardware elements. Some applications are very compute intensive and place a high demand on the server’s CPU where others in the same environment are tasked with unique processing requirements performed on specialized graphical processing units (GPUs).

    Whether it is customer, demographic, seismic data — or a whole host of other uses — the number crunching and processing required across the suite of applications can result in processing demands that are radically different from demands of prior years. Enter Hybrid High Performance Computing. These systems are built to serve two masters: CPU-intensive applications and GPU-intensive applications delivering a hybrid environment where workloads can be optimized and run-times reduced through ideal resource utilization.

    The results of Hybrid CPU/GPU Computing adoption have been impressive. Just a few examples of how Hybrid CPU/GPU Computing is delivering real value include:

    • Optimization of workloads across CPU/GPU servers
    • Delivering the highest-density, highest-performance in a small footprint
    • Provides significant power, cooling and resource utilization benefits

    You can learn more about leveraging hybrid CPU/GPU computing in this whitepaper.

  • Counting Down to SC15

    - by Stephen Sofhauser

    The countdown to SC15 has started, and we at Dell are very excited for this year’s event. We have a lot to share with you all this year, and it’s particularly special to us because it’s right in our back yard.  Come visit us at booth #1009. Our aim is to show you the true meaning of “Texas Friendly” with great demos, two customer theaters that will feature a stellar lineup of speakers, panel discussions, new products and solutions we’re bringing to market, and of course, let’s not forget the awesome food and entertainment Austin has to offer!

    We have a great morning series for you, featuring our director of HPC Engineer, Onur Celebioglu, and Intersect360’s Addison Snell. Every morning of the show (10:15 a.m. -10:45 a.m.), they will be discussing trends and tech in the HPC market. 

    There will be two afternoon panel discussions, hosted by insideHPC’s Richard Brueckner.  I recently spoke to him about the upcoming conference, you can listen to the podcast here.

    The first panel is from 1:30 p.m. to 2:30 p.m. on Wednesday, November 18th, “All Together Now: The Convergence of Big Data, Cloud, and HPC.” This should prove to be an interesting discussion with Richard and our four panelists: Wojtek Goscinski, Ph.D. (Monash University), Niall Gaffney, Ph.D. (Texas Advanced Computing Center (TACC)), Craig Stewart, Ph.D. (Indiana University), and Andrew Rutherford (Microsoft).  The concept of the panel was born from discussions with customers who after years of siloed workloads, are trying to figure out how to best integrate their Big Data, Cloud, and HPC.

    Wednesday's second panel should be equally compelling as Richard and the panelists discuss how the NSCI is fostering more collaboration between government, academia, and industry. “More than Just Exascale: How the NSCI Will Make HPC More Accessible to All” (3:00 p.m. – 4:00 p.m). This all comes as a result of the White House initiative, seeking to keep the United States at the forefront of HPC capabilities. What’s nice is that most of these speakers know each other. We will be featuring Dan Stanzione, Ph.D. (Texas Advanced Computing Center TACC), Mike Norman, Ph.D. (San Diego Supercomputing Center SDSC), Dave Lifka, Ph.D. (Cornell University), and Merle Giles (National Center for Supercomputing Applications NCSA) – who has a book on HPC best practices, Industrial Applications of High-Performance Computing: Best Global Practices.

    We have so much more planned for you, including talks from customers like the San Diego Supercomputing Center, Virginia Bioinformatics Institute (VBI), and Cornell – just granted $5 million by the NSF to collaboratively develop a federated cloud. There’s just too much to cram it in one blog post – but you can check out http://dell.to/1PFgbwJ for more information. We look forward to seeing you (booth #1009) and hope you have a great SC15, and enjoy everything Dell has to offer! And welcome to Austin!



  • A Great Lineup for SC15

    Join Dell in Booth #1009!

    This year promises to be another great one at SC, chock full of great speakers and panels – you won’t be disappointed! 

    Join us at the Dell Sixth Street Theater Tuesday, Wednesday, and Thursday morning from 10:15 a.m. – 10:45 a.m. for authentic Texas burritos and engaging conversation from Intersect360’s CEO, Addison Snell, and Dell’s own Onur Celebioglu. They will be discussing why HPC is now important to a broader group of use cases, and dig deep into overviews of HPC for research, life sciences and manufacturing. Participants will learn about types of application characterization, best practices and examples of engineered solutions that are appropriate for these specific verticals. Come learn more about why HPC, big data and cloud are converging, and how Dell solves challenges in our HPC engineering lab and through collaborative work with other leading technology partners and research institutions.

    We will also have two informative panels hosted by Rich Brueckner, president, insideHPC. Recently, Dell’s North American Sales Director for HPCHigh Performance Computing spoke with Brueckner about these panels - you can listen to the podcast here.

    All Together Now: The Convergence of Big Data, Cloud, and HPC

    Wednesday, Nov 18, 1:30 p.m. – 2:30 p.m.

    • Wojtek Goscinski, Ph.D. (Monash University)
    • Niall Gaffney, Ph.D. (Texas Advanced Computing Center (TACC))
    • Craig Stewart, Ph.D. (Indiana University)
    • Andrew Rutherford (Microsoft)

    Modeling and simulation have been the primary usage of high performance computing (HPC). But the world is changing. We now see the need for rapid, accurate insights from large amounts of data. To accomplish this, HPC technology is repurposed. Likewise the location where the work gets done is not entirely the same either. Many workloads are migrating to massive cloud data centers because of the speed of execution. In this panel, leaders in computing will share how they, and others, integrate tradition and innovation (HPC technologies, Big Data analytics, and Cloud Computing) to achieve more discoveries and drive business outcomes.

    More than Just Exascale: How the NSCI Will Make HPC More Accessible to All 

    Wednesday, Nov 18, 3:00 p.m. – 4:00 p.m.

    • Dan Stanzione, Ph.D. (Texas Advanced Computing Center TACC)
    • Mike Norman, Ph.D. (San Diego Supercomputing Center SDSC)
    • Dave Lifka, Ph.D. (Cornell University)
    • Merle Giles (National Center for Supercomputing Applications NCSA)

    The US —and the world —took notice this summer when President Obama issued an Executive Order establishing the National Strategic Computing Initiative (NSCI). While most headlines focus on the exascale goals of this initiative, the NSCI presents a comprehensive set of objectives. Those objectives include advancing usage, capabilities and impact of HPC for decades to come. In this panel, you will hear from HPC leaders who are moving us forward in improving HPC application developer productivity and making high performance computing more accessible to all.

    Lastly, from 1:00 p.m. to 1:30 p.m. on Tuesday, Wednesday, and Thursday afternoons, you can enjoy tasty treats and great dialogue from:

    • Jeff Kirk, Sr. Principle Engineer, HPC Technologies, Office of the CTO
    • Joe Sekel, HPC Server Architect, Dell
    • Adnan Khaleel, Director Global Enterprise Sales Strategy

    Each day, they will be discussing different aspects of HPC adoption, scaling, and scope continue to grow, driven by the need to solve more problems, larger problems, and new types of problems. Modeling & simulation remain important applications cases; data analytics and machine learning are expanding the scope of HPC and types of HPC systems; cloud computing is making HPC more accessible and on-demand. Dell has been a long-time leader in HPC clusters for modeling and simulation, and is now embarking on a path towards leadership in this broader context of HPC.



  • Accelerating HPC applications using K80 GPUs

    - By Mayura Deshmukh

    Every year Graphics Processing Unit (GPU) become more powerful, achieving more teraflops thus giving a quantum leap in performance for commonly used molecular dynamics and manufacturing codes, allowing researchers to use more efficient and denser high performance computing architectures. What is the performance difference between CPU and GPU? How much is the power consumption? How well does K80 GPUs perform with the Dell C4130 server? Which configuration is the best for my application? These are some of the questions which come to our mind and this blog aims to answer these and related questions.

    This blog presents the work conducted to measure and analyze the performance, power consumption and performance per watt of a single Dell PowerEdge C4130 server with nVidia K80 GPUs. The PowerEdge C4130 server is the latest GPU high density design from Dell, offering up to four GPUs in a 1U form factor. The uniqueness of PowerEdge C4130 is that it presents a configurable system design, potentially making it a better fit, for the wider variety of extreme HPC applications. 

    The HPC focused Tesla series K80 GPU provides 1.87 TFLOPs (double precision) compute capacity, which is about 31% more than K40, the previous Tesla card.  K40’s base clock is 745MHz, though it can be boosted up to 810MHz or 875MHz. K80 has a base clock of 562MHz, but it can climb up to 875MHz, at 13MHz increments. Another new feature of the K80, is Autoboost, which provides additional performance, if additional power and thermal head room is available. In the K80, the internal GPUs are based on the GK210 architecture and have a total of 4,992 cores which represent a 73% improvement over K40.  The K80 has a total memory of 24GBs which is divided equally between the two internal GPUs; this is a 100% more memory capacity compared to the K40.   The memory bandwidth in K80 is improved to 480 GB/s.  The rated power consumption of a single K80 is a maximum of 300 watts.


    The C4130 offers eight configurations “A” through “H”. Since GPUs provide the bulk of compute horsepower, the configurations can be divided into three groups based on expected performance, the first group of four configurations, “A”, “B”, “C” and “G” with four GPUs each, the second group of a single configuration “H” with three GPUs, and the third group of three configurations, “D” “E” and “F” with two GPUs each. The quad GPU configurations: “A”, “B” and “G” have an internal PCIe switch module. The details of the various configurations are shown in the Table 1 and the block diagram (Figure 1) below:

    Table 1: C4130 Configurations


      Figure 1: C4130 Configuration Block Diagram

    Table 2 gives more information about the hardware configuration, profiles and firmware used for the benchmarking.

    Table 2: Hardware Configuration



    CUDA’s heterogeneous programming model uses both the CPU and GPU, so data transfer between CPUs and GPUs greatly affect performance.

    Figure 2: Memory Bandwidth for C4130 


    Figure 2: Memory Bandwidth for C4130 

    Figure 2 shows the host-to-device (CPU à GPU) and device-to-host (CPU ß GPU) memory bandwidth for all the C4130 configurations. Bandwidth is within range of 12000 MB/s (Peak is 15754 MB/s)

    Nvidia’s GPUDirect Peer to Peer feature enables GPUs on the same PCIe root complex to directly transfer data between their memories, avoiding any copies to system memory. This dramatically lowers CPU overhead, and reduces latency, resulting in significant performance improvements in data transfer time for applications. Without the peer to peer feature, to get data from one GPU to another on the same host, one would use cudaMemcpy() first to get the data from the GPU to system memory, then another cudaMemcpy() to get the same data onto the second GPU.

    Figure 3: Peer-to-peer Bandwidth for C4130

    Figure 3 shows the peer to peer communication between the GPUs for the C4130 with a switch module (Configuration B) Vs C4130 without switch module (Configuration C - Dual CPUs, Balanced with four GPUs).

    • For configuration B the bandwidth is constant at 24.6 GB/s across all GPU’s.
    • For configuration C bandwidth is:
      • 24.6 GB/s for data transfers between GPUs on the same card (GPU1óGPU2, GPU3óGPU4, GPU5óGPU6, GPU7óGPU8)
      • 19.6 GB/s for data transfers between GPUs connected to the same CPU (GPU1,2óGPU3,4; GPU5,6óGPU7,8)
      • 18.7 GB/s for data transfers between GPUs connected to the other CPU (GPU1,2,3,4óGPU5,6,7,8)

    Applications that require a lot of peer to peer communication can benefit from the high bandwidth offered by the C4130 switch module configurations (A, B, G).


    HPL solves a random dense linear system in double-precision arithmetic on distributed-memory systems and is a very compute intensive benchmark. NVIDIA pre compiled HPL, Intel MKL 2015 and OpenMPI 1.6.5 were used for the benchmarking. The problem size (N) used was ~90% of the system memory.

    Figure 4: HPL performance and power consumption with C4130



    The blue bars on the left graph in Figure 4 shows the HPL performance characterization of PowerEdge C4130. The results are achieved in GFLOPS which is the Y-axis on the graph. 

    • Performance for the four GPU configurations –“A”, “B”, “C” and “G”, ranges from 6.5 to 7.3 TFLOPS. Configuration “C” and “G”, with two GPUs balanced per CPU are the highest performing configurations with 7.3 TFLOPS. The performance difference between “A” and “B” can be attributed to the additional CPU in configuration “B”. The difference from “B” to “G” or “C” is due to different GPU to CPU ratios; all three have the same number of compute resources.  Configuration “C” and “G” are balanced with two GPUs per CPU while “B” has the all four GPU attached to a single CPU.
    • The only three GPU configuration “H” achieved 6.4 TFLOPS which falls between the performance of the four GPU and two GPU configuration.
    • For the two GPU configurations, “D” is highest with 3.8 TFLOPS, “E” and “F” with 3.6 TFLOPS. Configuration “E” has one less CPU explaining the difference in performance than “D”.
    • Both “D” and “F” have two CPUs and two GPUs but for configuration “F” both the GPUs are connected to just one CPU, whereas for Configuration “D” each GPU is connected to each CPU (more cores per GPU).

    Compared to a CPU-only performance, run on two E5-2690v3, an acceleration of ~9X is obtained by using four K80, 7X by using three GPUs and an acceleration of ~4.7X with two K80 GPUs.  The HPL efficiency is significantly higher on K80 (low to upper 80s) compared to previous generation of GPUs.

     The red bars on the right graph in Figure 4 represent the power consumption for the HPL runs. The quad GPU configurations “A”, “B”, “C” and “G” consume significantly more power than the CPU-only runs, which is expected for compute intensive loads. But the energy efficiency (calculated as performance per watt) with these configurations is 4+ GFLOPS/w compared to the 1.6 GFLOPS/s of the CPU-only HPL runs. The power consumption for the three GPU configuration “H” is 2.7X and the energy efficiency is 4.1 GFLOPS/w which makes it an energy efficient lower cost alternative to the quad GPU configurations. The dual GPU configurations “D”, “E” and “F” consume low power (1.8X to 2.1X compared to CPU-only runs) and the energy efficiency is in the range of 3.5 GFLOPS/w to 3.9 GFLOPS/w that is about 2.3X better than the CPU only runs.


    NAMD is designed for high-performance simulation of large bio molecular systems. The benchmarks ApoA1 (92224 atoms) is a high density lipoprotein found in plasma, which helps extraction of cholesterol from tissues to liver. F1ATPase (327506 atoms) is responsible for the synthesizing of the molecule adenosine tri-phosphate. STMV (Satellite Tobacco Mosaic Virus) is a small, icosahedral virus, which worsens the symptoms of infections by tobacco mosaic virus. STMV is a large benchmark case with 1066628 atoms.

    Figure 5: NAMD performance and power consumption with C4130



    Figure 5 quantifies the performance and power consumption of NAMD for all the C4130 configurations compared to the CPU-only server (i.e. server with two CPUS).

    • The acceleration on NAMD is sensitive to number of CPUs and the memory available in the system E.g. there is a significant difference in the acceleration between “A” and “B” for the quad GPU configuration and between “E” and “F” for the dual GPU configuration. This difference becomes more apparent as the problem size increases. “B” with a similar configuration to “A” but with an additional CPU and memory performs 43% better compared to “A” and “F” with an additional CPU and memory than E performs 26% better.
    • Among the four quad GPU configurations NAMD performs best on configuration “C” and “G”. The difference in the two highest performing configurations and the other configurations (“A” and “B”) is the manner in which GPUs are attached to the CPU. The balanced configurations “G” (with switch) and “C” (without switch) have 2 GPUs attached to 2 CPUs resulting in 7.8X acceleration over the CPU-only case. The same four GPUs attached via a switch module to a single CPU, configuration “B” results in about 7.7X acceleration.
    • “H” the three GPU configuration falls in between the four GPU and two GPU configurations with respect to performance with 7.1X acceleration than the CPU-only configuration. “H” with an extra CPU and more memory performs better than the four GPU configuration “A”
    • “D” and “F” with 2 CPUs and 2 GPUs perform better with 5.9X acceleration compared to 4.4X in configuration “E” (1 CPU and 2 GPUs).  

    As shown in the right graph of Figure 5, the power consumption for quad GPU configurations is ~ 2.3X resulting in accelerations from 4.4X to 7.8X and the energy efficiency (performance per watt) ranges from 2.0X to 3.4X. Configuration “C” and “G” along with providing the best performance also do well from energy efficiency perspective (an acceleration of 7.8X for 2.3X more power)amongst the quad GPU configurations. Configuration “H” with three GPUs is more energy efficient configuration than the quad GPU configurations with performance per watt of 3.7X providing 7.1X acceleration with only 1.9X more power. Configuration “F” is the most energy efficient configuration, consuming only 1.5X more power with performance per watt of 3.8X.

    ANSYS Fluent

    ANSYS Fluent is a computational fluid dynamics application used for fluid flow design engineering analysis. The equation solvers used to drive the simulation are computational intensive. Approximately 3 GB GPU memory is required for a 1M Cell simulation. The benchmarks run are the ANSYS pipes 1.2M and 9.6M steady state, non-combustive cases.

    Figure 6: ANSYS Fluent performance and power consumption with C4130



    The left graph in Figure 6 shows performance of ANSYS Fluent compared to 4 CPU cores. Code performs best for configuration with 1: 2 CPU: GPU ratio.

    • The quad GPU configurations provide 3.9-4.4X acceleration compared to tests run on 4 CPU cores. Configuration “C” and “G” provide the best performance amongst the four GPU configurations
    • The three GPU configuration “H” provides 3.7X acceleration
    • The dual GPU configuration “E” with two GPU’s connected to a single CPU provides the best acceleration of 2.8X amongst all the dual GPU configurations

    In Figure 6 the right graph shows the power consumption data for all the configurations compared to the power consumed when the benchmarks were run on 4 CPU cores. The numbers in yellow at the bottom of the bars indicate the relative performance per watt for the configurations. The quad GPU configurations consume 3.7X-3.9X more power and provide 3%-20% more performance per watt. The three GPU configuration “H” is the most energy efficient configuration consumes 2.8X more power but provides the most performance per watt (32% more than the 4 core runs) of all the configurations. The dual GPU configurations consume 2.1X-2.5X more power and the energy efficiency is7%-28% better.      

    Fluent scales well on CPU cores so to understand the benefit of using GPUs we experimented by using the same number of licenses and running the benchmark on the CPU cores Vs running it on CPU+ GPU.

    Figure 7: ANSYS Fluent optimizing licensing costs



    Figure 7 shows the data for the 1.2M and 9.6 Fluent benchmark run on only CPU cores Vs quad GPU configurations “A”, “B” and “C”. The benchmark output is the wall clock time which is the Y-axis (lower is better), the X-axis shows the number of CPU cores used for the test (that is the number of fluent licenses required). As shown in Figure 7 using 24 licenses, the GPU approach: that is using 16cores + 8GPUs provides 48% better performance than just using 24 CPU cores for the 9.6M benchmark and is 25% better for the 1.2M benchmark. Similarly, Table 3 shows the performance benefit of GPU approach Vs CPU approach for 24, 20, 16, 12 and 8 licenses for the 9.6M and 1.2M benchmark cases.

    Table 3: Fluent GPU vs CPU approach with same number of licenses



    The C4130 server with nVidia Tesla K80 GPUs demonstrates exceptional performance and power-efficiency gains for compute intensive workloads and applications like NAMD and Fluent. Fluent scaling is very impressive on CPU cores but depending on your problem and licensing model there is a definitive performance benefit with using GPU’s. Applications that do a lot to GPU peer-to-peer communication can gain from the higher bandwidth offered by the C4130 switch configurations.

  • Application Performance Study on Intel Haswell EX Processors

    by Ashish Kumar Singh

    This blog describes, in detail, the performance study carried out on the E7-8800 v3 family of processors (architecture codenamed as Haswell-EX). The performance on Intel Xeon E7-8800 v3 has been compared to Intel Xeon E7-4800 v2 to ascertain the generation over generation performance improvement. The applications used for this study are HPL, STREAM, WRF and ANSYS Fluent. The Intel Xeon E7-8890v3 processors have 18 cores/36 threads with 45MB of L3 cache (2.5MB/slice). With AVX workloads the clock speed of Intel E7-8890 v3 reduced from 2.5GHz to 2.1GHz. These processors support QPI speed of 9.6 GT/s.

    Server Configuration                                                                                                                                         


    PowerEdge R920

    PowerEdge R930


    4 x Intel Xeon E7-4870v2 @ 2.3GHz (15 cores) 30MB L3 cache 130W

    4 x Intel Xeon E7- 8890v3 @2.5GHz (18 cores) 45MB L3 cache 165W


    512GB = 32 x 16GB DDR3 @ 1333MHz RDIMMS

    1024 GB = 64 x 16GB DDR4 @1600MHz RDIMMS

    BIOS Settings


    Version 1.1.0

    Version 1.0.9

    Processor Settings > Logical Processors



    Processor Settings > QPI Speed

    Maximum Data Rate

    Maximum Data Rate

    Processor Settings > System Profile



                                                               Software and Firmware          

    Operating System

    RHEL6.5 x86_64

    RHEL 6.6 x86_64

    Intel Compiler

    Version 14.0.2

    Version 15.0.2

    Intel MKL

    Version 11.1

    Version 11.2

    Intel MPI

    Version 4.1

    Version 5.0

    Benchmark and Applications


    V2.1 from MKL 11.1

    V2.1 from MKL 11.2


    v5.10, Array Size 1800000000, Iterations 100

    v5.10, Array Size 1800000000, Iterations 100


    v3.5.1, Input Data Conus12KM, Netcdf-

    V3.6.1, Input Data Conus12K, Netcdf-4.3.2

    ANSYS Fluent

    v15, Input Data: eddy_417k, truck_poly_14m, sedan_4m, aircraft_2m

    v15, Input Data: eddy_417k, truck_poly_14m, sedan_4m, aircraft_2m


    The objective of this comparison was to show the generation-over-generation performance improvement in the enterprise 4S platforms. The performance differences between two server generations were because of the improvement in system architecture, greater number of cores and higher frequency memory. The software versions were not a significant factor.


    High Performance LINPACK is a benchmark that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed memory systems. HPL benchmark was run on both PowerEdge R930 and PowerEdge R920 with block size of NB=192 and problem size of N=90% of total memory size.


    As shown in the graph above, LINPACK showed 1.95X performance improvement with four Intel Xeon E7-8890 v3 processors on R930 server in comparison to four Intel Xeon E7-4870 v2 processors on R920 server. This was due to substantial increase in number of cores, memory speed, flops/cycle of the processor and processor architecture.


    STREAM is a simple synthetic program to measure sustained memory bandwidth used COPY, SCALE, SUM and TRAID programs to measure memory bandwidth.

    Operations of these programs are shown below:

    COPY:       a(i) = b(i)
    SCALE:      a(i) = q*b(i)
    SUM:        a(i) = b(i) + c(i)
    TRIAD:      a(i) = b(i) + q*c(i)

    This chart showed the comparison of sustained memory bandwidth between PowerEdge R920 and PowerEdge R930 servers. STREAM showed 231GB/s on PowerEdge R920 and 260GB/s on PowerEdge R930, which is 12% improvement in memory bandwidth. This increase is because of the improvement in DIMM speed available on PowerEdge R930.


    The WRF (Weather Research and Forecasting) model is a next-generation mesoscale numerical weather prediction system designed to serve both atmospheric research and operational forecasting needs. The model serves a wide range of metrological applications across scales from tens of meters to thousands of kilometers. WRF allows to generate atmospheric simulations based on real data (observations, analysis) or idealized conditions.

    WRF performance analysis was run for conus12KM dataset. Conus12KM data is a single domain, medium size 48-hours 12KM resolution case over continental US (CONUS) domain with a time step of 72seconds.


    With Conus12KM dataset, WRF showed 0.22seconds average time on PowerEdge R930 server, while 0.26seconds on PowerEdge R930 server, which is an 18% improvement.

    ANSYS Fluent

    ANSYS Fluent contains the broad physical modeling capabilities for model flow, turbulence, heat transfer, and reactions for industrial applications ranging from air flow over an aircraft wing to combustion in a furnace, from bubble columns to oil platforms, from blood flow to semiconductor manufacturing, and from clean room design to wastewater treatment plants.



    We used four different datasets for Fluent. We considered ‘Solver rating’ (higher is better) as the performance metric. For all the test cases with PowerEdge R930 Fluent showed 24% to 29% performance improvement in-comparision to PowerEdge R920.


    PowerEdge R930 server outperforms its previous generation PowerEdge R920 server in both benchmarks and application comparison. Due to latest processors with higher number of cores, higher frequency memory and CPU architecture improvement PowerEdge R930 gave better performance than PowerEdge R920. PowerEdge R930 platform with four Intel Xeon EX processors is very good choice for those HPC applications, which can scale up to the large number of cores and memory.







  • The Right Mix for Today’s Data Environments

    Three takeaways from Dell’s John Whittaker on leveraging both big data analytics and traditional database management tools today...

    Delving into the results of a recent Unisphere Research survey of 300 database administrators (DBAs) and corporate data managers, Dell’s Executive Director of Information Management John Whittaker gives straight-forward advice for tackling today’s complex database environments via an Industry Perspectives article on the Data Center Knowledge site.  

    While organizations and DBAs have become hyper-focused on big data, analytics, and unstructured data tools, Whittaker gives a timely reminder that structured data still matters.

    Indeed, according to the Unisphere survey he references, structured data still accounts for 75 percent of the data stack at more than two-thirds of today’s enterprises. What’s more, nearly one-third of all organizations haven’t begun actively managing unstructured data at all to this point.

    That means paying attention to tools like Oracle and Microsoft SQL Server still needs to be a priority for DBAs, even as they try to incorporate Hadoop and NoSQL into their organizations.

    But that doesn’t mean Whittaker is turning a blind eye toward these more modern technologies. On the contrary, he makes a clear case for ramping up predictive analytics to allow an organization to see not only where it’s been, but where it’s going to stay a step ahead of the competition.

    The key to doing both is recognizing that even with the rise of big data, you need to leverage the right combination of both traditional and modern database tools today. Knowing which serves each situation best, and giving your team the tools it needs for each, is the balancing act DBAs and data managers must pull off today.

    Read the entire article on the Data Center Knowledge site here.


  • Integrating hooks and tools for easier management of HPC Cluster

    Managing tens of thousands of local and remote server nodes in a cluster is always a challenge. To reduce the cluster-management overhead and simplify setup of cluster of nodes, admins seek the convenience of a single snapshot view. Rapid changes in technology make management, tuning, customization, and settings updates an ongoing necessity, one that needs to be performed and easily as infrastructure is refactored and refreshed. 

    To simplify some of these challenges, it’s important to fully integrate hardware management and the cluster management solution.  The following integration detailed in this blog between that of server hardware and the cluster management solution provides an example of some of the best practices achievable today.

    Critical to this integration and design is the Integrated Dell Remote Access Controller (IDRAC).  Since IDRAC is embedded into the server motherboards for in-band and out-of-band system management, it can display and modify BIOS settings as well as perform firmware updates through the Life Cycle Controller and remote-console. Collectively, each server’s in-depth system profile information is gathered using system tools and utilities and is available in a single graphical user interface for ease of administration, thus reducing the need to physically access the servers themselves. 

    Figure 1. BIOS-level integration between Dell PowerEdge servers and cluster management solution (Bright 7.1)

    Figure 1 (above) depicts the configuration setup for a single node in the cluster. The fabric can be accessed via the dedicated iDRAC port or shared with the LAN-on-Motherboard capability. The cluster administration fabric is configured at the time of deployment with the help of built-in scripts in the software stack that help automate this. The system profile of the server is captured in an XML-based schema file that gets imported from the iDRAC using the racadm commands. Thus relevant data such as optimal system BIOS settings, boot order, console redirection and network configuration are parsed and displayed on the cluster dashboard of the graphical user interface.  By reversing this process, it is possible to change and apply other BIOS settings onto a server to tune and set system profiles from the graphical interface. These choices are then stored in an updated XML-based schema file on the head node, and pushed out to the appropriate nodes during reboots.

    Figure 2. Snapshot of the Cluster Node Configuration via cluster management solution.

    Figure 2 is a screenshot showing BIOS version and system profile information for a number of Dell PowerEdge servers of the same model. This is a particularly useful overview as inappropriate settings and versions can be easily and rapidly identified. 

    Typical use would be when new servers are added or replaced in a cluster. The above integration will help to ensure that all servers have similar homogenous performance, BIOS versions, firmware, system profile and other tuning configurations.

    This integration is also helpful for users who need custom settings – i.e. not the default settings - applied on their servers. For example codes that are latency sensitive may require custom profile with C-States disabled. These servers can be categorized into a node group, with specific BIOS parameters applied to that group.

    This tightly coupled BIOS level integration delivers capabilities that provide a significantly enhanced solution offering for HPC cluster maintenance that provides a single snapshot view for simplified updates and tuning.  As a validated and tested solutions on the given hardware, it provides seamless operation and administration of clusters at scale.   


    1. http://www.brightcomputing.com/Bright-Cluster-Manager
    2. http://en.community.dell.com/techcenter/systems-management/w/wiki/3204.dell-remote-access-controller-drac-idrac
    3. http://www.brightcomputing.com/Linux-Cluster-Architecture
    4. http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2014/09/23/bios-tuning-for-hpc-on-13th-generation-haswell-servers
    5. http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2015/04/29/linpack-benchmarking-on-a-4-nodes-cluster-with-intel-xeon-phi-7120p-coprocessors
  • Congratulations to Team South Africa

    by David Detweiler

    Congratulations to Team South Africa on their second place finish in the Student Cluster Competition at the International Supercomputing Conference (ISC) in Frankfurt, Germany earlier this month. The students hailing from the University of Witwatersrand narrowly missed three-peating as champions.

    The team was comprised of Ari Croock, James Allingham, Sasha Naidoo, Robert Clucas, Paul Osei Sekyere, and Jenalea Miller, with reserve team members Vyacheslav Schevchenko and Nabeel Rajab. Together, they represented the Centre for High Performance Computing (CHPC) at the competition.


    The South African students competed against teams from seven other nations over a sleep-depriving three days. During the competition, the teams were tasked with designing and building their own small cluster computers, and run a series of HPC benchmarks and applications. In addition, the students were assigned to optimize four science applications, three of which were announced before the competition, with the fourth introduced during the event.

    The competition was sponsored by ISC and the HPC Advisory Council. Each team was scored based on three criteria:

    • Performance of the HPCC benchmark run accounted for 10%
    • A suite of test applications worth 80%
    • Explanation before a panel of experts of the strategy they used and results achieved (10%)

    With young people like team South Africa entering the field, the future of HPC looks brighter than ever. Congratulations on a job well done!

  • Big Data is Unlocking New Opportunities in Scientific Research

    Scientific research has reaped the rewards offered by big data technologies. New insights have been discovered in a wide range of disciplines thanks to the collection, analysis and visualization of large data sets. In a recent series of articles, insideBigData examined some of the noteworthy benefits researchers are realizing when adopting big data technologies.

    The results of big data adoption have been impressive. Just a few examples of how big data is being employed across a variety of disciplines include:

    • Health sciences – Data around genes, proteins, small molecules and other important indicators can now be stored in single repositories. This data can then be shared with researchers around the world.
    • Neurosciences – The activities of neurons in the human brain are currently being mapped. Researchers hope to discover not only how the brain functions and develops, but also new ways to treat disease and trauma.
    • Climate sciences – Massive amounts of information about climate and weather are now being collected and analyzed. Climate and environmental scientists are gaining heretofore unimaginable insights into everything from warming oceans to severe weather.

    Data has afforded researchers tremendous opportunities. Researchers are collaborating across companies, disciplines and even continents in manners never before available to them. As with any rapid adoption of technology, some difficulties are to be expected. However, the benefits offered by big data far outweigh any potential issues.

    Big data has already ushered in exciting scientific advancements across a myriad of disciplines. The benefits for researchers – and society – are just beginning.

    You can learn more about Big Data and scientific research in this white paper.

  • HPC and Big Data are Growing More Aligned

    The alignment between high performance computing (HPC) and big data has steadily gained traction over the past few years. As analytics and big data continue to be top of mind for organizations of all sizes and industries, traditional IT departments are considering HPC solutions to help provide rapid and reliable information to business owners so they can make more informed decisions.

    This alignment is clearly seen by increasing sales of hyper-converged systems. IDC predicts sales of these systems will increase 116% this year compared 2014, reaching an impressive $807 million. This significant growth is expected to continue over the next few years. Indeed, the market is expected to experience nearly 60% compound annual growth rate (CAGR) from 2014 to 2019, at which time it will generate more than $3.9 billion in total sales.

    To meet this growing customer demand, more hyper-converged systems are being offered. For example, the latest offering in the 13th generation Dell PowerEdge server portfolio, the PowerEdge C6320, is now available.  These types of solutions help organizations meet their increasingly demanding workloads by offering improved performance, power improvements and cost-efficient compute and storage. This allows customers to optimize application performance and productivity while conserving energy use and saving traditional datacenter space.

    Among the top research organizations and enterprises utilizing the marriage between HPC and big data is San Diego Supercomputer Center (SDSC). Comet, it’s new, recently-deployed petascale supercomputer is leveraging 27 racks of PowerEdge C6320, totaling 1,944 nodes or 46,656 cores. This represents a five-fold increase in compute capacity versus their previous system. In turn this affords SDSC the ability to provide HPC to a much larger research community. You can read more about Comet and how it is being in this Q&A with Rick Wagner, SDSC’s high-performance computing systems manager. (LINK)

    Learn more about the PowerEdge C6320 here