Dr. Jeff Layton
Dr. Jeff Layton, Ph.D.
Dell Enterprise Technologist - HPC

In part 2 of this article series we examined what current PetaFLOPS systems look like focusing on the only two systems that have achieved a PetaFLOPS on the Top500 benchmark. One of these systems, Roadrunner at Sandia National Labs, is what is called a hybrid system. In short, it uses conventional CPUs as well as “accelerators” to improve performance. In the case of Roadrunner, the accelerators are Cell processors like those in your Sony PS3.

Then in the last article we explored what PetaFLOPS systems could look like in the future using only CPUs.

In this article I wanted to focus a bit more on hybrid systems. In particular I want to focus on hybrid PetaFLOPS systems that utilize GPUs (Graphical Processing Units) that you find on your video cards and game consoles.


China Leads the Way in Hybrids

While we didn’t talk about the system in Part 2 of this series, the current #5 system on the Top500 is in China and uses a very unique architecture. It combines ATI GPUs with Intel processors in a blade architecture to achieve 563.10 TFLOPS of sustained performance. It is well worth examining the system in a little more detail to understand what a hybrid CPU/GPU system looks like on a larger scale.

The Chinese National University of Defense Technology (NUDT) announced the system which is called Tianhe-1 (TH-1). Figure 1 below shows the system as a whole (image taken from http://wccftech.com/forum/news-around-the-web/28232-china-unveils-its-fastest-supercomputer.html)

National University of Defense Technology (NUDT) - TIANHE-1 (TH-1)

Figure 1 – TH-1 from above



The system consists of 2,560 compute nodes, which are blades. Each blade is a dual-socket board with two Intel E5540 processors (Nehalem at 2.53 GHz) along with two ATI Radeon 4870 GPUs that each havw a PCIe x16 connector to the motherboard. Figure 2 below illustrates the board layout. (http://wccftech.com/forum/news-around-the-web/28232-china-unveils-its-fastest-supercomputer.html)


TH-1 board layout

Figure 2 – TH-1 board layout



The two objects circled in red are the ATI GPUs. (You can see the Radeon label on the one in the lower right-hand corner).

According to the report on the Top500 list (http://www.top500.org/blog/2009/11/13/tianhe_1_chinas_first_petaflop_s_scale_supercomputer), TH-1 has the following configuration:

  • 80 racks
  • 2,560 compute nodes
    • Two socket with Intel E5540 processors
    • Two ATI Radeon 4870 cards each with x16 connectors
    • 32GB of memory
  • 512 operation nodes
  • InfiniBand is used (specific speed is not specified)
    • Two tiers
    • First tier has 9 switches for the 2,560 nodes (284-285 ports per switch)
    • The first tier of switches connects to second tier via 18 uplinks. There are 4 second tier switches.


One interesting thing in the article is that they down-clocked the GPUs to get them to achieve greater stability. They decreased the core frequency from 750 MHz to 575 MHz and the GPU memory frequency was down-clocked to 900 MHz from 650 MHz. This points out that either they need to reduce performance for heat issues or perhaps because consumer grade GPU manufacturers like to over clock their cards to get better video performance. This is fine for shooting zombies because if a pixel is wrong and the zombie’s eyes are blue instead of black, it’s not a big deal. But if you get memory errors on a computational algorithm because of over-clocking then you have a problem. (You get wrong answers.)

The ATI Radeon 4870 is a member of what you might call the first generation of real GP-GPUs (General Purpose GPUs). It is capable of 1.2 TeraFLOPS in single precision or 240 GFLOPS in double precision. It has the best double precision performance of any first generation GP-GPU.

To determine how much compute power the GPUs are providing, let’s work backwards from the total performance – 563.10 TeraFLOPS. There are 2,560 compute nodes each with 2 sockets and Intel Nehalem processors at 2.53 GHz. The theoretical peak performance of each node is:

2.53 GHz * 4 ops/clock * 4 cores * 2 sockets = 80.96 GFLOPS


Let’s assume a somewhat conservative efficiency in estimating the “real” performance of 72%. Then the estimated real HPL performance of the CPUs is the following:

(90.86 * 0.72 * 2560) / (1000 GFLOPS/TFLOPS) = 149.225 TFLOPS


This means that the CPUs provided about 26.5% of the total HPL performance of the system. This also means that the GPUs contributed about 413.875 TFLOPS in performance (73.5% of the total).

The theoretical performance of each GPU is 240 GFLOPS in Double Precision. (I’m assuming that double precision was used throughout the HPL run.) Therefore the theoretical performance of the GPUs is the following:

240 GFLOPS * 2 GPUs per node * 2,560 nodes = 122880 GFLOPS = 1228.8 TFLOPS (1.2 PFLOPS!)


Therefore the efficiency of the GPUs in contributing to the overall performance is:

413.875 / 1228.8 = 33.68%


While the efficiency seems small you must remember that this is the first large scale hybrid system with GPUs. Moreover, all of the software was new as well. Consequently, the number is actually pretty good in my opinion (plus it will only get better).

As an aside, AMD (ATI) is saying (http://blogs.amd.com/nigeldessau/2009/11/20/dreaming-of-dumplings-2/) that 60% of the real performance comes from the GPUs.


Nvidia Testing
Nvidia has been working very diligently to develop the software ecosystem around their GPUs. As part of this project there is a presentation by Dr. Fatica (http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Fatica.pdf) that discusses running Linpack and HPL on Nvidia GPUs.

In the study Dr. Fatica used an 8 node cluster where each node had 2 sockets each with a quad-core Intel E5462 processor (2.8 GHz). Every 2 nodes was connected to an Nvidia S1070-500 giving two Teslas GPUs to each node. The nodes were all connected with SDR InfiniBand. He was able to achieve 1.258 TFLOPS (1258 GLOPS) with this configuration. To determine how efficient the Tesla GPUS were, we can again work backwards from the final performance.

Assuming 72% efficiency for SDR InfiniBand, each node is capable of producing the following HPL performance:

2.8 GHz * 4 ops/clock * 4 cores/socket * 2 sockets * 0.72 = 64.512 GFLOPS


Then if we take the performance from each node (1/8 of 1258 GFLOPS) and subtract the CPU performance we obtain the following:

(1258 / 8) – 64.512 = 92.738 GFLOPS


Since there are two GPUS per node, this means that each GPU contributes 46.369 GFLOPS.


Hybrid GPU Systems – The Next Generation


This article is all about doing some projections about what the next generation of hybrid GPU systems would look like and how they would meet the 1 PFLOPS target performance. So let’s take a look at what the next generation hybrid systems could look like.

In the previous article we examined what the next CPUs could look like and what PFLOPS class systems would look like based on these processors. We need to do the same thing for the next generation GPUs.

Nvidia’s next generation GPU, that is due out sometime in 2010, is code-named Fermi. It has a number of new features including ECC memory, additional and increased caches, and an increased double precision performance (key to HPL performance). According to this article, (http://www.pcper.com/article.php?aid=789) it has anywhere from 4x to 8x improvement in performance. So let’s be optimistic and choose 8x as the performance boost for Fermi. That means that a Fermi GPU should be capable of:

Estimated Fermi HPL Performance = 46.369 * 8 = 370.952 GFLOPS


This is for “real” HPL performance and not peak performance.

ATI (AMD) is also introducing a new GPU for computing with a code name of Evergreen (http://en.wikipedia.org/wiki/Evergreen_%28GPU_family%29). They have already introduced consumer video cards using the GPU. According to this article (http://en.wikipedia.org/wiki/Evergreen_%28GPU_family%29), the peak single precision performance is 2.73 TFLOPS. The peak double precision performance is obtained by dividing the single precision performance by 5:

Peak Double Precision performance = 2720 GFLOPS / 5 = 544 GLOPS


If we assume the Chinese TH-1 system efficiency, then the actual performance on HPL is the following:


Actual HPL performance = 544 * 0.3368 = 206.72 GFLOPS



Projections – System 1: The year 2010

As a generic approach, let’s assume that we use the Fermi card since it’s faster for now. Also, based on the previous article about using only CPUs, the highest performing configuration is a 4 socket AMD Magny-Cours system. This should be able to achieve about 375.04 GFLOPS peak performance per 4 socket node. Assuming it’s about 82% efficient with QDR InfiniBand, then the real performance per node is about 362.11 GFLOPS

Let’s also assume we can put 4 Fermi cards in a 4U box with the 4 socket server. The following table summarizes the configuration of the system:

AMD Magny-Cours 2010
CPU Performance per Node (GFLOPS) 362.11
Total GPU Performance per node (GFLOPS) 1483.808
Total Performance per Node (GFLOPS) 1845.92
Target Performance (GFLOPS) 1,000,000
Number of Nodes needed 542
Number of racks (assume 4U nodes) 54.2



So the faster system for the Top500 benchmark that can be created in 2010 will have 4U nodes with 4 socket AMD Magny-Cours processors and 4 Nvidia Fermi cards. The result is that you will need 542 nodes to reach 1 PFLOPS. Compare this to using only CPUs where you will need 2,665 nodes.

If you also look closely at the table you will see that the GPUs are providing 80.4% of the computational power of the nodes. This illustrates the great potential computing power of GPUs.


Projections – System 2: The year 2011

The next jump in CPU performance will occur in 2011 when Intel launches the next generation processor with the code name Sandy Bridge. The table below illustrates what a Sandy Bridge system with 4 Nvidia Fermi cards would look like to reach 1 PFLOPS.

Intel Sandy Bridge 2011
CPU Performance per Node (GFLOPS) 307.53
Total GPU Performance per node (GFLOPS) 1483.808
Total Performance per Node (GFLOPS) 1791.34
Target Performance (GFLOPS) 1,000,000
Number of Nodes needed 559
Number of racks (assume 4U nodes) 55.8



Notice that in this case the GPUs are providing about 82.8% of the computational power (a little bit more from the previous system).

It is interesting that in 2011, the Intel Sandy Bridge system is a bit larger than the 2010 Magy-Cours system. For Intel you need 559 nodes while with AMD you only need 542 nodes (just a little bit less).


Projections – System 2: The year 2013

If you refer to the previous article you will see that I projected two CPUs for 2013 – one from Intel and one from AMD. Let’s use these processors as the basis for systems with 4 GPUs in them but let’s increase the GPU performance modestly to 600 GLOPS per GPU hoping that Nvidia and ATI will have a new GPU by that time.

The resulting systems are in the tables below:

Intel 2013
CPU Performance per Node (GFLOPS) 615.07
Total GPU Performance per Node (GFLOPS) 2400.00
Total Performance per Node (GFLOPS) 3015.07
Target Performance (GFLOPS) 1,000,000
Number of Nodes needed 332
Number of Racks (assume 4U nodes) 33.2



Notice that about 79.6% of the performance comes from the GPUs. More importantly, this system has only 332 nodes which easily fits into a single InfiniBand switch greatly simplifying the configuration!

The AMD system is summarized in the following table:

AMD 2013
CPU Performance per Node (GFLOPS) 1511.42
Total GPU Performance per node (GFLOPS) 2400.00
Total Performance per Node (GFLOPS) 3911.42
Target Performance (GFLOPS) 1,000,000
Number of Nodes needed 256
Number of racks (assume 4U nodes) 25.6



This system only uses 61.4% of the computational performance from the GPUs. But it also shows that you only need 256 nodes for this system! Many systems today have this configuration (actually a very large number of systems are this size and larger).



Summary

Hybrid systems that include GPUs as part of the computation power of the system are probably the future of HPC. These “accelerators” can easily give a big jump in performance in a very dense package. The TH-1 system in China has proven that it is very possible to build a hybrid system from commodity components and achieve near PFLOPS performance. With the new GPUS from Nvidia and ATI and the new processors from Intel and AMD, achieving 1 PFLOPS isn’t difficult at all.

The new AMD Magny-Cours systems with 4 GPUs can easily result in a system that is very “buildable” today. If you want to achieve 1 PFLOPS today, this is obviously the easy way to achieve it. If you can wait 2 years, you will be able to build a system that fits into fewer than 26 racks and takes only 256 nodes!

However, if everything were this simple why aren’t people building 1 PFLOPS systems routinely? The answer is that things are not always as simple as they seem. The next article in this series will talk about problems with large systems.

-- Dr. Jeff Layton