Join us at Super Computing 2011!We invite you to visit us at the SC2011 conference in Seattle, Nov 14-17 at Booth #2040 See first-hand how we are enabling research discovery with Dell HPC solutions.
Results for Pure CPU drive System:
Observations:The table has some interesting results. As you can see it takes a large number of nodes today to reach 1 PFLOPS. But at the same time, the number of cores from both Intel and AMD is increasing rapidly. If the “Future AMD” processor is close to correct you will be able to get a four-socket node with 96 cores and each about 1.8 TFLOPS performance in a single node!There is a lot of data so let’s look at two trends. The first trend is the number of nodes required to reach 1 PFLOPS using QDR InfiniBand for both Intel and AMD:There is a lot of data so let’s look at two trends. The first trend is the number of nodes required to reach 1 PFLOPS using QDR InfiniBand for both Intel and AMD:
Currently (2009) you need anywhere from about 9,000 to 12,500, two socket nodes to reach 1 PFLOPS on the HPL benchmark.When the new processors from Intel and AMD are introduced this year you can see a big drop in the number of nodes to 2,600 to about 4,000. The least number of nodes from Intel comes from a four socket Nehalem-EX system. Remember that Nehalem-EX will have 8 cores so a four socket system will have 32 cores! AMD will have four socket systems around Many-Cours that will have up to 48 cores per node resulting in about 2,600 nodes to reach 1 PFLOPS.Then in 2013 with my projected (really it is desired) processors from AMD and Intel, the number of nodes drops to about 640 to 1,600. With this number of nodes it is very easy to create a simple fat-tree topology with InfiniBand that is cost–effective. However, these nodes have a massive number of cores (32 to 96) so the question of a single QDR InfiniBand network link being enough for these nodes is an interesting question.Another way to look at the results is to plot the amount of floor space or number of racks the systems will take. This is an interesting question because the four-socket nodes will likely have to be put into a 2U chassis or larger to handle the heat.
In this plot, to even out the comparison, I assume a 1U chassis for two-socket systems and a 2U chassis for four-socket systems. While there are more dense packaging available (e.g. blades), using rack mount nodes keeps things simple. The numbers also only take into consideration the nodes themselves so switches, storage, management nodes, etc., are not included.Notice that today we need 200-300 racks to reach 1 PFLOPS on the Top500 benchmark (wow – that’s a lot of racks). With the new processes from AMD and Intel this drops to about 130 to 200 racks. This is still a very large number of racks.In 2013 we get to about 31-75 racks. While still a very large number of racks, this is much more manageable.Notice that there are two trends driving the reduction in the number of nodes:1. Increasing the number of cores2. The number of operations per clockThe first driver is seeing a doubling in the number of cores every 2-3 years! The second driver is increasing the amount of work the processor can do per clock cycle. It looks like Intel’s Sandy Bridge processor will double the number of operations per clock to 8. This will effectively halve the number of nodes required to reach 1 PFLOPS.Final Comments:In this article I’ve looked at what I call “pure CPU” systems for reaching 1 PFLOPS. This means the systems only use CPUs for computing. Moreover, the performance of the systems is measured solely by the Top500 benchmark (HPL). We all know that the Top500 is a single application and that real performance is determined by running your particular workload, but the Top500 benchmark gives us an arbitrary benchmark to compare systems.These “what-if” scenarios can be fun to play periodically but they serve a very useful purpose. They can tell us if current processors and system packaging are up to the task of reaching 1 PFLOPS. Perhaps more importantly, they can also tell us what kind of processor it will take to result in a reasonable system for reaching 1 PFLOPS.The results are very interesting because we’re seeing that the new processors from Intel and AMD are likely to really help make PFLOPS-scale systems more realistic for HPC in just a few years (less than 3).In the next article I will take a look at Hybrid systems for reaching 1 PFLOPS and contrast then with the CPU only systems discussed here.-- Dr. Jeff Layton