Jeff Layton
Jeff Layton, Ph.D.
Dell Enterprise Technologist - HPC
As the introductory article mentioned, PetaFLOPS may be here faster than you may realize. You can construct PetaFLOPS systems today as discussed Part 1 of this series because Jaguar at Oak Ridge and Roadrunner at Sandia have already hit the PetaFLOPS level. However, it’s definitely not cheap or easy to reach this level. This article examines what PetaFLOPS systems might look like today and in the near future using CPUs.

Measuring Performance:
To determine when we reach 1 PFLOPS we need an objective measure. While there are arguments for and against the Top500 benchmark, this article will assume that the performance of systems is measured by the HPL (Top500) application. An idealized model of estimating the performance using the number of double precision operations per clock cycle, the total number of cores, and the “efficiency” of the system on HPL. Basically:

Performance = Number of operations per clock * number of cores * efficiency

These three simple numbers, operations per clock (abbreviated “ops per clock”), number of cores, and efficiency, will allow us to estimate performance. However, making educated estimates of these three numbers is not easy and always leaves room for debate. In addition, these three numbers have implications for configurations, particularly the network configuration.

Let’s take a look at current CPUs from Intel and AMD to help us develop these three numbers.


Intel Systems:
Intel is using what they call a “tick-tock” model of processor development. The “tick” is a die shrink and the “tock” is a change to the architecture. The current generation of Nehalem (Xeon 5500) processors is in the “tock” phase of the process, and the upcoming processors, code-named Westmere, are a “tick”, shrinking the die to 32 nm. So the Nehalem and Westmere processors have the following key parameters (key to our estimates of the performance):

· Nehalem:
o Clock speed is limited to 2.93 GHz at 95W
o Capable of four operations per clock (double precision)
o Four cores per socket
o Restricted to 2 sockets per node

· Westmere:
o It is assumed that clock speeds will remain about the same for the same power. So the top speed is likely to be approximately 2.93 GHz or 3.0 GHz (Note: this is an assumption since Intel has not yet announced the processor)
o 6 cores per socket (http://www.theregister.co.uk/2010/02/03/intel_westmere_ep_preview/)
o Capable of four operations per clock (again, this is an assumption since this processor is a “tick” and is only a die shrink).
o Restricted to 2 sockets per node

· Nehalem-EX:
o Maximum speed looks to be 2.26 GHz (http://en.wikipedia.org/wiki/Xeon#Beckton)
o 8-cores per socket (http://en.wikipedia.org/wiki/Xeon#Beckton)
o Four operations per clock (again, this is an assumption since this processor is a “tick” and is only a die shrink).
o 4 sockets and larger are allowed (http://en.wikipedia.org/wiki/Xeon#Beckton)


The next generation of processors after Westmere involves an architecture change. This generation, called Sandy Bridge, has some of the following characteristics:

· Next generation (Sandy Bridge):
o 2.8 GHz to 3.4 GHz without Turbo (http://en.wikipedia.org/wiki/Intel_Sandy_Bridge_%28microarchitecture%29)
o 8 flops per clock (http://en.wikipedia.org/wiki/Intel_Sandy_Bridge_%28microarchitecture%29)
o Sandy Bridge B2: 8 cores and 6 core variants (http://en.wikipedia.org/wiki/Intel_Sandy_Bridge_%28microarchitecture%29)
o I have kept the number of sockets per system at 2 (just conjecture)

AMD:
AMD is taking a slightly different approach compared to Intel by putting more cores in their CPUs with lower clock speeds. The list of processors starts with the current processor, the 6-core “Istanbul” processor. Then it continues with the next-generation processors code-named Magny-Cours (named after a Formula 1 race track).

· Istanbul
o This is AMD’s current generation CPU
o 2.8 GHz is the maximum clock speed at this time (that I know of)
o Four operations per clock

· Magny-Cours (G34):
o Next generation AMD processor
o 8-core and 12-core per socket versions (http://blogs.amd.com/work/tag/magny-cours/)
o 2.4 GHz appears to be the faster clock speed (http://www.techspot.com/news/37972-amd-begins-shipping-magnycours-opteron.html)
o Four operations per clock (This is an assumption since the processor hasn’t been announced).
o Up to 4 sockets per system

Future Intel and AMD Processors:
Intel and AMD have both hinted at processors beyond the next generation with AMD providing more detail on a relative near-term processor code-named Interlagos. However, for these processors I have taken some “poetic” license and played something of a “what-if” game to project performance. While these projections are fun to a certain extent, they also help in determining what processors will have to deliver to reach 1 PetaFLOPS for certain system configurations. I have no knowledge of these processors beyond what I read in the open news as everyone else.

Interlagos uses the same basic socket as the upcoming Magny-Cours processor but increases the number of cores per socket to 16 (http://www.cio.com/article/490450/Inside_AMD_s_16_Core_Interlagos_Server_Chip). For the purposes of this article I lowered the clock speed to 2.3 GHz. However, I have left the number of operations per clock at 4 compared to Intel’s Sandy Bridge processor that has 8 operations per clock (they are both to be released about the same time).

Beyond Interlagos, I have taken guesses at what configurations might look like. These configurations help in determining what processor performance has to be to achieve 1 PetaFLOPS for a given system configuration. These processors are purely conjecture at this point.

Future Intel Processor:
· Increased the number of cores per socket to 16 (doubling the number of cores from Nehalem-EX).
· Kept clock speed constant at 2.93 GHz
· Kept number of operations per clock at 8
· 2-socket systems

Future AMD Processor:
· Increased the number of cores per socket to 24 (doubling the number of cores of Magny-Cours)
· Kept clock speed at 2.4 GHz (conjecture)
· Increased number of operations per clock to 8
· Allowed for 4-socket systems

Below is a table of results for some real systems and some conjectured systems. I have left the decimals in just so you can reproduce the results if you like. We all know that you can’t have 0.2 of a node or 0.16 of an IB port. J


Results for Pure CPU drive System:

Pure CPU Driven Systems
2009 2010 2011 2013+
AMD Istanbul Intel Nehalem Intel Westmere AMD Magny-Cours
4-Socket
Intel Nehalem-EX AMD InterLagos 4 Socket Intel Sandy Bridge Future Intel Future AMD
FLOPS Per Clock 4 4 4 4 4 4 8 8 8
Clock Speed (GHz) 2.8 2.93 2.93 2.4 2.26 2.3 2.93 2.93 2.4
Number of Cores per Socket 6 4 6 12 8 16 8 16 24
Number of Sockets per Node 2 2 2 4 4 4 2 2 4
Per Node GFLOPS (peak) 134.40 93.76 140.64 441.60 289.28 588.80 375.04 750.88 1,843.20
Number of Nodes 7,440.48 10,665.53 7,110.35 2,264.49 3,456.86 1,698.37 2,666.38 1,333.19 542.53
Peak GFLOPS 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000
QDR Efficiency 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85
Number of Nodes for real performance 8,753.50 12,547.68 8,365.12 2,664.11 4,066.89 3,316.92 1,998.08 1,568.46 638.28
Number of Racks for 1U 208.42 298.75 199.17 63.43 96.83 47.57 74.69 37.34 15.20
Number of Racks for 2U 416.83 597.51 398.34 486.31 193.66 99.90 149.38 74.69 31.91
Number of Racks for Blades 136.77 196.06 130.71 83.25 127.09 62.44 49.01 24.51 19.95
Number of IB ports 8,753.50 12,547.68 8,365.12 2,664.11 4,066.89 3,316.92 1,998.08 1,568.46 638.28




Observations:
The table has some interesting results. As you can see it takes a large number of nodes today to reach 1 PFLOPS. But at the same time, the number of cores from both Intel and AMD is increasing rapidly. If the “Future AMD” processor is close to correct you will be able to get a four-socket node with 96 cores and each about 1.8 TFLOPS performance in a single node!

There is a lot of data so let’s look at two trends. The first trend is the number of nodes required to reach 1 PFLOPS using QDR InfiniBand for both Intel and AMD:

There is a lot of data so let’s look at two trends. The first trend is the number of nodes required to reach 1 PFLOPS using QDR InfiniBand for both Intel and AMD:

Number of nodes required to reach PetaFLOPS



Currently (2009) you need anywhere from about 9,000 to 12,500, two socket nodes to reach 1 PFLOPS on the HPL benchmark.

When the new processors from Intel and AMD are introduced this year you can see a big drop in the number of nodes to 2,600 to about 4,000. The least number of nodes from Intel comes from a four socket Nehalem-EX system. Remember that Nehalem-EX will have 8 cores so a four socket system will have 32 cores! AMD will have four socket systems around Many-Cours that will have up to 48 cores per node resulting in about 2,600 nodes to reach 1 PFLOPS.

Then in 2013 with my projected (really it is desired) processors from AMD and Intel, the number of nodes drops to about 640 to 1,600. With this number of nodes it is very easy to create a simple fat-tree topology with InfiniBand that is cost–effective. However, these nodes have a massive number of cores (32 to 96) so the question of a single QDR InfiniBand network link being enough for these nodes is an interesting question.

Another way to look at the results is to plot the amount of floor space or number of racks the systems will take. This is an interesting question because the four-socket nodes will likely have to be put into a 2U chassis or larger to handle the heat.


Number of racks to reach PetaFLOPS


In this plot, to even out the comparison, I assume a 1U chassis for two-socket systems and a 2U chassis for four-socket systems. While there are more dense packaging available (e.g. blades), using rack mount nodes keeps things simple. The numbers also only take into consideration the nodes themselves so switches, storage, management nodes, etc., are not included.
Notice that today we need 200-300 racks to reach 1 PFLOPS on the Top500 benchmark (wow – that’s a lot of racks). With the new processes from AMD and Intel this drops to about 130 to 200 racks. This is still a very large number of racks.

In 2013 we get to about 31-75 racks. While still a very large number of racks, this is much more manageable.

Notice that there are two trends driving the reduction in the number of nodes:
1. Increasing the number of cores
2. The number of operations per clock

The first driver is seeing a doubling in the number of cores every 2-3 years! The second driver is increasing the amount of work the processor can do per clock cycle. It looks like Intel’s Sandy Bridge processor will double the number of operations per clock to 8. This will effectively halve the number of nodes required to reach 1 PFLOPS.

Final Comments:
In this article I’ve looked at what I call “pure CPU” systems for reaching 1 PFLOPS. This means the systems only use CPUs for computing. Moreover, the performance of the systems is measured solely by the Top500 benchmark (HPL). We all know that the Top500 is a single application and that real performance is determined by running your particular workload, but the Top500 benchmark gives us an arbitrary benchmark to compare systems.

These “what-if” scenarios can be fun to play periodically but they serve a very useful purpose. They can tell us if current processors and system packaging are up to the task of reaching 1 PFLOPS. Perhaps more importantly, they can also tell us what kind of processor it will take to result in a reasonable system for reaching 1 PFLOPS.
The results are very interesting because we’re seeing that the new processors from Intel and AMD are likely to really help make PFLOPS-scale systems more realistic for HPC in just a few years (less than 3).

In the next article I will take a look at Hybrid systems for reaching 1 PFLOPS and contrast then with the CPU only systems discussed here.

-- Dr. Jeff Layton