Before everyone got knee deep into the HPC extravaganza that is SC, I wanted to provide a little perspective. That is, the show isn't about the largest system(s) in the world, it's about being the most productive - solving problems, making discoveries, generating understanding.

Anyone care to guess how my "HPC" systems there in the world? I can promise you that it's at least a couple of order magnitudes more than Top500. There are, arguably, ten of thousands HPC systems out there. They are running applications, solving problems, allowing people to make new discoveries, creating new understanding, and above all, making people more productive by being another tool in their arsenal. The fastest systems that are in the rarified air of the top of the Top500 are allowing people to tackles problems that never could have been tackled before. But these systems are used by only a relative handful of people, perhaps in the low hundreds. What they are doing is terribly important but there are tens of thousands more people who are running HPC applications as we speak.

So is SC09 about the rarified high-end or about everyone else? The answer is fairly obvious - it's about both. However, many people forget that the vast majority of HPC is not in the rarified levels of Jaguar at ORNL or similar systems, it's in the small to medium systems that many companies use, many universities, and individual researchers. We need the high-end systems to blaze the trails for larger systems for everyone, to understand how to build and maintain applications that scale to amazing levels, and to demonstrate to everyone what can be accomplished (sort of an HPC "moon shot"). But at the same time, everyone else can't afford systems that take up thousands of square feet and cost upward of $1M per year in electricity. I think that we as an industry lose sight of that.

Back to the Future - Again?


If you follow the Top500 you will also see that clusters dominate the list. The majority of systems on the Top500 list are clusters. To understand why, let's look back a little.

Prior to clusters the dominate HPC system was a proprietary large system built by a few vendors. The systems were costly so they became centralized resources for everyone to share. But the number of people using HPC grew very quickly as did their needs. At the same time, the system performance was not growing at the same rate. The result was that the amount of HPC time that each person received was actually shrinking.

The system vendors were creating some wonderful hardware leading the way in new computational techniques (e.g. vector processing). But every time they came out with a new technology, users had to port and tune their applications for the new hardware. This took time, effort, and $$ to accomplish. So to take advantage of new hardware to achieve better performance you had to invest.

While all of this was happening the lowly x86 processor was getting faster and faster. In addition, more and more people were buying PC's and game consoles, etc., driving down the cost of x86 processors. Consequently the price/performance of PC class processors was becoming very attractive.

These two trends: (1) people getting less time on HPC systems, (2) x86 processor getting much faster and much cheaper; collided and out popped the cluster. Tom Sterling and Don Becker popularized the concept of a cluster because they needed to get more HPC time but didn't want to break the bank. So they created the commodity cluster (beowulf) using off the shelf hardware coupled with Linux and other open-source tools, to put HPC power back into the hands of the researchers that needed and wanted it. If you look back at the history of beowulf you will see that some early systems were really HPC workstations than large centralized systems. Remember that the desire was to get more HPC power into the hands of the individual. What better way to do that than to give them a very good price/performance HPC system that they controlled?

However, lately there has been the trend of taking multiple smaller clusters and combine them into a single larger centralied cluster. The concept relies on the argument that by doing this, money will be saved (centralizing aspects of the cluster), and when the system isn't fully utilized the individuals will have the opportunity to run on more processors than they would otherwise, tackling larger problems than they could before. The theory is great but how does it work in reality? Are we repeating the same problem with the previous generation of HPC systems?

Jeff