High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • A Billion Here. A Billion There. Next Thing You Know, You’ve Got 4.3 Billion. And It Turns Out That’s Not Nearly Enough!

    A #Dell colleague, Dave Keller (@DaveKatDell), alerted me to a YouTube video featuring Vint Cerf, a founding father of the Internet and current Chief Internet Evangelist at Google. 


    In that video, Vin Cerf explains that devices connected to the Internet are given Internet addresses like phones are given phone numbers.  That address, known as an IP address, is usually represented as a grouping of 4 numbers separated by dots, such as   This is the default IP address of many Netgear routers such as might be used in your home.

    Each of those numbers in Version 4 of the Internet Protocol (IPv4) can be a number from 0 to 255.  This means there are 256 choices for each of the 4 numbers. 

    256 * 256 * 256 * 256 =  4,294,967,296

    So, there are about 4.3 billion addresses available for devices to connect to the Internet.  In 1980, there were only about 280 million people in the entire United States.  4.3 billion sounds like plenty!

    But how many do you use today?  Cell phone?  Laptop or Tablet?   Home computer?  Work computer?   Home Internet router?  TV?

    I just named 6 possible ones.  Without going into private networks, etc., I think it is safe to say that when you are connected to the Internet, you are using an IP address. 

    OK.  So what’s the big deal?  China has over a billion people.  India has over a billion people.  And according to the Vint Cerf in that same video, there are over 5 billion mobile devices in the world today.    According to Government Technology (http://del.ly/6046XoZ8),  in 2020 there will be 50 billion Internet-enabled devices in the world. To put that number in perspective, that equates to more than 6 connected devices per person.  Oops!

     But don’t worry.  Internet Protocol Version 6 (IPv6) is rolling out.   China is actually take a lead in this.  Imagine why. 



    4.3 billion sounded big.  Just how big is IPv6?  Almost too big to explain or to even comprehend.  It is well over one trillion times as large as IPv4.   Or, with IPv6 those 4.3 billion address available from IPv4 are available to each and every person alive.   I can almost understand and appreciate that.  But in fact, it’s much larger: over a trillion-trillion-trillion total addresses.  Or for the nerds out there, about a third of a google of addresses. 

    And according to Paul Gil over at About.com “These trillions of new IPv6 addresses will meet the internet demand for the foreseeable future.”  I certainly hope that is an understatement!


    If you have comments or can contribute additional information, please feel free to do so.  Thanks.  --Mark R. Fernandez, Ph.D.


    Follow me on Twitter @MarkFatDell

  • Insights from 26th Annual HPCC Conference

    The 26th annual HPCC Conference had the theme “Supercomputing: A Global Perspective,” and was held in Newport, RI at the end of March. The conference pulled together a variety of industry experts, including High Performance Computing (HPC) users, vendors, and other industry experts. This blog includes some of my observations from the event.

    There were three main themes throughout the event;

    1. The Missing Middle;
    2. Manufacturing; and
    3. Exascale.

    John West, the Director for Department of Defense’s (DOD) High Performance Computing Modernization Program, kicked off the event discussing “The Missing Middle.” He postulated on how, “given this unalloyed good that is HPC, how come everybody isn’t using it?” You can watch his entire message here. 

    A New Focus on Manufacturing

    New to this year’s Newport HPCC show was a focus on Manufacturing. Per the Conference Leadership team:

    “Bringing HPC to manufacturing is an important initiative, in the U.S. and the rest of the world. Competitiveness is an elusive goal that requires continual refinement and adoption of new technologies. HPCC 2012 will highlight this critical area with discussions on the application of HPC to modern manufacturing to address what many refer to as the ‘missing middle’ – referring to the thousands of small and mid-size businesses not currently taking advantage of high performance computing in areas such as design, manufacturing, logistics, transportation, etc.”

    Speakers in this area included:

    • Gardner Carrick, President of The Manufacturing Institute who discussed “HPC in Manufacturing and Competitiveness”
    • Dawn White of Accio Energy, presented “Wind Energy from a Revolutionary Concept”
    • Suzy Tichenor from Oak Ridge National Laboratories who discussed the ORNL Industrial Partnership Program

     Interesting Debates About Achieving Exascale

    Panelists also discussed at length the quest for Exascale computing. How do we get there? What are the obstacles to achieving Exascale? What are the drivers that will get HPC in the United States to Exascale? What does Exascale computing provide the HPC market that current systems can’t achieve today?

    Our very own Dr. Mark Fernandez was on a panel discussing this. In fact, this last day round table discussion included a “Lighting round” that really did a fantastic job of encapsulating the entire event’s worth of content into one 30-minute session.

    HPC Analyst Crossfire – Live from the National HPCC Conference 2012

    Other Interesting notables:

    • According to Addison Snell of Intersect 360, there is a $26 billion market for HPC
    • In 1976 – Cray – 80 MFLOPS – cost $8.8 Million – 2012 – Apple Ipad – 170 MFLOPS - $499

    Please leave a comment, or add any additional insights from the event.

    Troy Testa

  • NVIDIA's Tesla K40 GPU Accelerator Gaining Impressive Results

    A recent NVIDIA blog highlighted the impressive results clients have realized using NVIDIA's Tesla K40 GPU accelerator.

    The blog focuses on three very divergent applications: weather forecasting, Twitter trends, and financial risk analysis. Each of the applications has seen impressive improvement since using the accelerator.

    You can read the NVIDIA blog and learn more about the Tesla K40 GPU results here.

  • HPC Application Performance Study on 4S Servers

    by Ranga Balimidi, Ashish K. Singh, and Ishan Singh

    What can you do with a big bad 4-socket machine with 60 cores with up to 6TB memory in HPC? To help answer that question, we conducted a performance study using several benchmark suites such as HPL, STREAM, WRF and Fluent. This blog describes some of our results that help illustrate the possibilities. The server that we used for this study is the Dell PowerEdge R920. This server supports the family of processors in the Intel architecture code named Ivy Bridge EX. 

    The server configuration table outlines the configuration details used for this study as well as the configurations from a previous study performed in June 2010 with the previous generation of technology. We use these two systems to compare performance across technology refresh.

    Server Configuration

    Power Edge R920 Hardware


    4 x Intel Xeon E7-4870v2 @ 2.3GHz (15 cores) 30M cache 130W


    512 GB =32 * 16GB 1333MHz RDIMMs

    PowerEdge R910 Hardware


    4 x  Intel Xeon X7550 @ 2.00GHz (8 cores) 18M cache 130W


    128GB = 32 * 4GB 1066MHz RDIMMs

    Software and Firmware for PowerEdge R920

    Operating System

    Red Hat Enterprise Linux 6.5 (kernel version 2.6.32-431.el6 x86_64)

    Intel Compiler

    Version 14.0.2

    Intel MKL

    Version 11.1

    Intel MPI

    Version 4.1


    Version 1.1.0

    BIOS Settings

    System Profile set to Performance

    (Logical Processor disabled, Node Interleave disabled)

    Benchmarks & Applications for PowerEdge R920


    v2.1, From Intel MKL v11.1, Problem size 90% of total memory.


    v5.10, Array Size 1800000000, Iterations 100


    v3.5.1, Input Data Conus 12K, Netcdf-


    v15, Input Data: eddy_417k, truck_poly_14m, sedan_4m, aircraft_2m

    Results and Analysis

    For this study, we compared the two servers across the four benchmarks described below.

    The aim of this comparison is to show the generation-over-generation changes in this four socket platform. Each server was configured with the optimal software and BIOS configurations at the time of the measurements. The biggest difference in performance between the two server generations is the improvement in system architecture, greater number of cores, and memory speed. The software versions are not a significant factor.


    The STREAM benchmark is a simple synthetic program that measures sustained memory bandwidth in MB/s. It uses COPY, SCALE, SUM and TRIAD programs to evaluate the memory bandwidth. Operations of these programs are shown below:

    COPY:       a(i) = b(i)
    SCALE:     a(i) = q*b(i)
    SUM:         a(i) = b(i) + c(i)
    TRIAD:      a(i) = b(i) + q*c(i)

    The chart below compares STREAM performance results from this study with results from previous the generation. In this study, STREAM yields 231GB/s memory bandwidth which is twice the memory bandwidth measured from the previous study. This increase is because of the improvement in the number of memory channels and DIMM speed.

    The graph also plots the local bandwidth and remote memory bandwidth. Local memory bandwidth is measured by binding processes to a socket and accessing only memory local to that socket (NUMA enabled, same NUMA node). Remote memory bandwidth is measured by binding processes to one socket and only accessing memory that is remote to that socket (remote NUMA node) where it has to go through QPI link to access this memory. The remote memory bandwidth is 72% lower than the local memory bandwidth due to the limitation of QPI link bandwidth.


    The Linpack benchmark measures how fast a computer solves linear equations and measures a system's floating-point computing power. It requires a software library for performing numerical linear algebra on digital computers; for this study we used Intel’s Math Kernel Library. The following chart illustrates results from a single server HPL performance benchmark.


    HPL yielded 4.67x sustained performance improvement in this study. This is primarily due to the substantial increase in the number of cores, increase in the FLOP/cycle of the processor and the overall improvement in the processor architecture.


    The Weather Research and Forecasting (WRF) Model is a numerical weather prediction system designed to serve atmospheric research and weather forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture allowing for parallel computation and system extensibility. 


    We have taken the average time step as the metric to measure WRF performance. We used Conus 12km data set for this application.

    In the graph above we've plotted the WRF performance results from this study relative to results from the previous generation. Since there is an increase in number of cores on Intel E7-4870 v2 processor, we have scaled up WRF to 60 cores and observed significant performance increase while scaling. Matching the number of cores used on both platforms at 32 cores, we observed significant performance improvement (2.9x) over the previous generation platform. When using the full capability of the server at 60c there is an additional 35% improvement. When it comes to server-to-server comparison, the PowerEdge R920 performs ~4x better than PowerEdge R910. This is due to the overall architecture improvements including processor and memory technology.


    Ansys Fluent contains the broad physical modeling capabilities needed to model flow, turbulence, heat transfer, and reactions for industrial applications ranging from air flow over an aircraft wing to combustion in a furnace, from bubble columns to oil platforms, from blood flow to semiconductor manufacturing, and from clean room design to wastewater treatment plants.

    In the charts below, we have plotted the performance results from this study relative to results from the previous generation platform.

    We've used four input data sets for Fluent. We've considered “Solver Rating” (higher is better) as the performance metric for these test cases.

    For all the test cases, Fluent scaled very well with 100% CPU utilization. Comparing generation-to-generation, for 32 core-to-core comparisons we observed the R920 performance results are approximately 2x better over the previous generation in all the test cases. When it comes to server-to-server comparison using all available cores, it performs 3-3.5x better.

    These results were gathered by explicitly setting processor affinity at the MPI level. To do this, the following two configuration options were used:

    $export HPMPI_MPIRUN_FLAGS="-aff=automatic"

    $cat ~/.fluent

             (define (set-affinity argv) (display "set-affinity disabled"))


    The PowerEdge R920 server outperforms its previous generation server in both benchmarks and applications comparisons studied in this exercise. The platform has its advantage over the previous generation platform in terms of latest processor support, increased memory speed and capacity support, and overall system architecture improvements. This platform is a good choice for HPC applications, which can scale-up with the high processor core count support (up to 60 cores) and large shared memory support (up to 6TB). It is also a great choice for memory intensive applications considering the large memory support.

  • My Sandy Bridge Processor System Is Running Slow!

    This week I was asked if I knew any reason that a Sandy Bridge system would run slower than an approximately equivalent Westmere system.
    [I would not normally blog about such a thing, but this is the third time in the last two weeks that this type of question has surfaced!]
    The Intel Sandy Bridge processor contains new instructions collectively called Advanced Vector Extensions, or AVX.  AVX provides up to double the peak FLOPS performance when compared to previous processor generations such as the Westmere.  To take advantage of these AVX instructions, the application *must* be re-compiled with a minimum compiler version that supports AVX.  With Intel that is the Intel Compiler Suite starting with version 11.1 and starting with version 4.6 of GCC.
    If an application has not been re-compiled with an AVX-aware compiler, the application with not be able to take advantage of these Sandy Bridge instructions.   And it will probably run slower than previously seen on older processors, even including Westmere processors with higher frequencies.
    Let me say this another way:  A Westmere executable will run fine on a Sandy Bridge system due to Intel’s commitment and extensive work to maintain backwards compatibility, but it will probably run slower with no errors or any indications of why.
    Furthermore, re-compiling “on” the Sandy Bridge processor, but using an older compiler (pre icc 11.1 or pre ggc 4.6) does not help.  Remember, use the latest compiler on those shiny new platforms!
    For just one example of Westmere vs. Sandy Bridge performance improvements that are possible, please see our blog at:
                            HPC performance on the 12th Generation (12G) PowerEdge Servers:  http://dell.to/zozohn
    I know there are some codes for legal, certification or other reasons that cannot be “changed.”  But I certainly hope that this policy has not bled over into not even being able to re-compile apps to take advantage of new technologies.
    For additional information on AVX and re-compiling applications, see:
                Intel Advanced Vector Extensions:  http://software.intel.com/en-us/avx/
    How To Compile For Intel AVX:  http://software.intel.com/en-us/articles/how-to-compile-for-intel-avx/
    Optimizing for AVX Using MKL BLAS: http://software.intel.com/en-us/articles/optimize-for-intel-avx-using-intel-math-kernel-librarys-basic-linear-algebra-subprograms-blas-with-dgemm-routine/
    If you have comments or can contribute additional information, please feel free to do so.  Thanks.  --Mark R. Fernandez, Ph.D.



  • TACC's Stampede Rumbles the Ground: Operational Jan. 7, 2013

    It's seems we've been sending out quite a lot of accolades and doing a lot of high 5's to the folks at the University of Texas, and the Texas Advanced Computing Center (TACC) lately - and with good reason! The latest achievement for TACC is when it's showcase supercomputer Stampede became operational Jan. 7, 2013. By any account, within high performance computing (HPC) or scientific research, Stampede is a trailblazing system, with many innovative features. In particular, Stampede was designed to support open science, and by definition, across all domains of science and engineering. So, while this machine has nearly a 10 petaFLOPS peak performance, it may be even more impressive that Stampede serves as a general purpose HPC resource designed to support a wide range of applications.

    At the SC12 event in Salt Lake City, TACC Director Jay Boisseau delivers an excellent presentation discussing Stampede, many of its technical capabilities, as well as the vision for the system. You can view the presentation in its entirety here:

    Below is a brief interview with TACC's Director of Scientific Applications Karl Schultz, who provides a more personal perspective into Stampede.

    Additionally, we are still all very excited about the University of Texas student team and it's first place finish at the SC12 Student Cluster Competition. Below is a video where Dell's Steve Sofhauser interviews some of the students on that winning team, and is well worth the view:

    We have more great customer videos and other content from SC12 that you can view here http://dell.to/XwggcN.

  • University of Florida Supports Collaborative Science With Its World-class Supercomputer HiPerGator

    Imagine a researcher writing out complex equations and theories in a lab book, as he or she works on the world’s most challenging scientific problems. It’s a great image – and while I’m sure this still occurs – the nature of scientific discovery has evolved into a more collaborative effort, where data can be shared and worked on by scientists regardless of their location. Scientific research has become a collaborative discipline.

    Albert Einstein once said, Imagination is more important than knowledge. The concept that imagination is so vital to discovery, especially coming from the brilliant scientist Einstein, has always fascinated me.

    The University of Florida recently unveiled its powerful new supercomputer called HiPerGator that represents the key tool towards scientific collaboration, while enabling researchers to explore their imaginations like never before. HiPerGator truly represents the future of scientific research.
    In the official news announcement of HiPerGator, University of Florida President Bernie Machen summed up the role high performance computing (HPC) plays as a tool to enable imagination:

    “The [HiPerGator] computer removes the physical limitations on what scientist and engineers can discover. It frees them to follow their imaginations wherever they lead.”

    How exciting! Congratulations to the University of Florida, in advancing this bold project where the planning began back in 2004. HiPerGator delivers 150 teraFLOPS of performance and will be working on projects like *** design, fighting cancer and HIV, as well as climate change, and beyond. HiPerGator now sits proudly as the fastest computing system in the State of Florida.

    Tim Carroll of Dell with UF VP/CIO Elias Eldayrie - Credit University of Florida

    Another exciting aspect of this project is that HiPerGator represents only the start of a much bolder vision involving high performance computing for the University of Florida. The facility that houses HiPerGator has a lot of room for more racks of computing systems, and the capacity to provide the energy and cooling for future supercomputing systems for the next ten years.

    It’s that type of vision that has led to the fastest computer in the State of Florida, and sets the stage for the University of Florida to maintain its leadership as one of the top research institutions in the world for years to come.

    You can read more about HiPerGator by following some of the links below.

    University of Florida News Announcement: UF launches HiPerGator, the state’s most powerful supercomputer

    Direct2Dell blog: Congratulations to University of Florida for launching HiPerGator

    HPCWire: University of Florida Opens HiPerGator Jaws

    insideHPC: HiPerGator Debuts as Florida’s Most Powerful Supercomputer

    The Gainsville Sun: UF supercomputer supersizing research

  • University of Maryland Unveils Deepthought2 Supercomputer

    Recently, the University of Maryland unveiled one of the nation's fastest university-owned supercomputers. Deepthought2 is 10 times faster than its predecessor, the original Deepthought. 

    With a processing speed of about 300 teraflops, the new supercomputer has a petabyte of storage and is connected by an InfiniBand network. Deepthought2 was developed with high-performance computing solutions from Dell.

    Deepthought2 will allow the university's researchers to conduct a level of simulations that heretofore would have had to be run on a national-level supercomputer. However, now,  based on current standings, it is believed that Deepthought2 will be one of the top 500 clusters in the world and one of the top 10 among universities in the United States.

    The new system will allow the university to conduct advanced research activities in a multitude of disciplines: from studying the formation of the first galaxies to simulating fire and combustion. 

    Deepthought2 is a significant asset of the University of Maryland's new 9,000-square-foot Cyberinfrastructure Center, which  will provide power and the necessary space for Deepthought2. 

    You can read more about Deepthought2 at insideHPC ,Information Technology Newsletter for the University of Maryland, the University's independent student newspaper The Diamondback, or University Business.

  • Hadoop Summit Kicks Off

    The Hadoop Summit kicks off tomorrow, June 3. Running through Thursday, June 5, this is your opportunity to learn more about building, managing and operating relevant real-world applications and to find out about the latest in Big Data.

     With the Summit comes some exciting news, including:

    • The announcement of Dell PowerEdge Servers support of Cloudera Enterprise 5 Reference Architecture
    • The introduction of Apache YARN (Yet Another Resource Negotiator) for Hadoop 2.0
    • The unveiling of Apache Spark, enabling applications in Hadoop clusters to run 100x faster in memory, and 10x faster even when running on disk

    You can read more about these innovative announcements and the Hadoop Summit in a blog authored by my colleague Armando Acosta.

    Stop by Dell's booth (G14) to talk with us about all the exciting changes. We'd love to see you!

  • It Takes a Spark to be Faster

    Recently there was an announcement by the Apache Software Foundation about the first production-ready release of Spark for the Hadoop data-processing platform. Originally developed five years ago at the University of Berkeley's AMPLab (Algorithms, Machines and People Lab), Spark is an impressive open source in-memory processing engine built around speed, ease of use, and advanced analytics. It can be deployed with Hadoop or independently.

    Spark is an alternative to MapReduce – instead of running jobs in long batch modes, it runs jobs in bursts of short batches. Its key benefits come from reliable caching of intermediate data in memory as opposed to writing to disk every time. Through its ecosystem components, Spark can enable:

    -          SQL Queries (Shark)

    -          Streaming Analytics (Spark Streaming)

    -          Machine Learning Library (MLLib)

    -          Graph computation jobs (GraphX)

    So how are we seeing it deployed amongst industry leaders?  Some examples of its real time usages include:

    • Machine-generated data collection and analysis, especially where data has to be joined from multiple sources
    • Stream Processing such as log analysis of live streams of alerts. 
    • Social data analysis
    • Recommendation engines

    With Spark, applications running in Hadoop clusters are able to run as much as 100x faster in memory. Additionally, it helps clusters on a disc run up to 10x faster.  Applications can be written in Java, Scala, or Python.  You can read more about Apache Spark at Cloudera or Databricks.

    Join us at Hadoop Summit 2014 this week to learn about some of the cool things Dell is doing! We're in Booth G14.  And attend the Dell, Cloudera and Intel Fireside Chat, Thursday, June 5th, 12:35-1:20 to discuss Hadoop tuning and real world benchmarking.