Join us at Super Computing 2011!We invite you to visit us at the SC2011 conference in Seattle, Nov 14-17 at Booth #2040 See first-hand how we are enabling research discovery with Dell HPC solutions.
By Dan Stanzione and Tommy Minyard, Texas Advanced Computing CenterFor a long time now, we've known that sustained performance on clusters is a complicated thing to measure, especially when talking about parallel jobs. In columns like this one, and others too numerous to count, it has been stressed that sustained performance for real applications is as much about balance (processor, interconnect, filesystem, memory) as it is about clock frequency. But despite that, the speed of the processor has still been the driving force for many when making decisions about clusters. Look no further than the Top 500 list to see that this is true. The Top 500 uses the High Performance Linpack (HPL) benchmark, which, while somewhat sensitive to interconnect performance, generally delivers a pretty high fraction of the peak performance of the processor (in the last few years, that means 60-70% on poorly balanced clusters, and 75-85% on well balanced ones... you can see this yourself by working out the ratio of Rmax to Rpeak on the reported numbers in the list, particularly for the older systems). Given that HPL delivers a pretty good fraction of peak performance on most processors, not surprisingly, higher clock rate has meant higher HPL, higher Top 500 number, and the impression that your new cluster is "faster." The big gotcha here is the not-so-well-kept-secret in the HPC community that *peak* performance and *real* application performance didn’t really have that much to do with one another, and the performance of HPL did not reflect the ability of a cluster to get work done. This has become especially true with the last generation of new quad-core processors. The fact of the matter is, while processors have been following the Moore's law curve, most of our real applications have been increasingly starved for memory bandwidth (i.e. the ability to get data from main memory into those increasingly fast processors). HPL doesn't really suffer too much from inadequate memory bandwidth, so the magnitude of the problem hasn't been quite as obvious. Intel has been well aware of this, however, and has taken a quantum leap forward in memory bandwidth with the Intel Xeon processor 5500 sequence “Nehalem” series of processors (and continued into the current Intel Xeon processor 5600 sequence “Westmere” processors and beyond). If you've been out shopping for cluster processors, it might appear on the surface that things have been pretty stagnant. Two, three or even four years ago, you could get a quad-core processor, issuing four floating point instructions per cycle, running somewhere between 2-3GHz. If you looked recently, you could get a quad-core Nehalem processor, issuing four floating point instructions per cycle, running between 2-3GHz. So what happened to Moore's Law performance doubling, you may ask. Well, it happened, but it's primarily in memory bandwidth. If you look at peak performance, things look about the same. Let's say two to three years ago you were looking at some Intel Harpertown , and let's assume for sake of round numbers they ran at exactly 2.5GHz. The peak performance of one of these chips would be 2.5GHz*(4 instructions per cycle)*(4 cores per chip) = 40 GigaFLOPS or so of peak performance. If you looked at the quad-core versions of the Nehalem at 2.5GHz, the math would be the same. But that's the clean theory world of peak performance. Let's look at some real performance instead. The figure below shows the performance of the Weather Research Forecast (WRF) V3.1.1 application, typical of climate and weather models, running on 3.0GHz Intel Xeon processor E5450 “Harpertown” processors, compared to 2.66GHz Intel Xeon processor X5550 “Nehalem” processors in a 2-socket Dell blade configuration. On the Harpertown processors, the performance flattens out with just four cores on a node in use, stays constant up to eight cores and then scales almost linearly when going from one to two nodes, as expected. In contrast, the Nehalem processor continues to increase in performance up to eight cores, with single core performance better than Harpertown by 40%. However, with all eight cores in use on a node, the Nehalem beats the Harpertown by almost 4 to 1! The reason for this is clear when looking at the memory bandwidth available on a node -- see the second figure below. In the case of the Harpertown, the memory bandwidth flattens out to a maximum at just two cores, while for the Nehalem, memory bandwidth continues increasing out to all eight cores. As you can see, on a real application when using all cores on a node, the Nehalem outpaces the older architectures by better than 2:1. So, your performance did double... just not if you judge by HPL numbers. So, a cluster with a peak performance of 100 TeraFLOPS today is a whole lot more productive than a peak 100 TeraFLOPS cluster of a couple of years ago. In fact, a modern 50TF cluster may be faster for many workloads than a 100TF cluster of 2008; but the 100TF one will still rank higher on the Top 500 list. The boost in memory bandwidth that Intel introduced with the Nehalem architecture has been a real game changer in overall system performance, but it's really thrown a wrench in the way we look at the HPL benchmark and things like peak performance. These were never perfect measures, but they were the best we had. Our grain of salt has gotten a whole lot bigger lately. And while the Nehalem architecture gave us a quantum leap in memory bandwidth, increasing core count, and the creeping up of clock speeds again -- well beyond 3Ghz -- will make it difficult to maintain this fantastic bandwidth per core in future products. Further muddying the waters, we're seeing lots of other products that will claim high peak numbers, including the introduction of GPU-based systems. All of these new products will first claim new heights in peak performance, and sometime later claim fantastic HPL performance, but keep in mind that the real cost benefit analysis should focus on your particular application workload, and no matter what architecture you have, speedy floating point units aren't useful if you can't keep them fed with data. So, keep an eye on benchmarks for your workloads, not just eye popping peak numbers. The good news, however, is that with recent architectures, significant improvements in memory bandwidth delivered to each socket have been made, which have done a great deal to close the gap between peak performance and delivered performance, giving us huge productivity gains at the same clock rate. Just keep in mind this won't always show up on the Top 500. Leave a COMMENT here