In 1994, Wulf and McKee predicted that computer performance improvement would soon cease. They reasoned that CPUs were becoming too fast for their own good. CPUs would soon become so much faster than memory that program execution time would depend entirely on how fast the memory could feed CPUs data. They dubbed this problem hitting the memory wall, a natural consequence of Moore’s Law.

John McCalpin (the inventor of the STREAM memory bandwidth benchmark) plotted the projected divergence on his web page:

McCalpin


Of course dire predictions like this are not unusual in the computing industry. Luckily, they rarely come true. Even so, I decided to plot CPU versus memory performance data gathered in our lab over the past five years and compare it with McCalpin’s projection.

Here’s how we’re doing 15 years after Wulf and McKee’s prediction:

HPC CPU vs RAM 1


It turns out our measurements track McCalpin’s projections very closely. CPU performance doubled three times over the past five years. Memory performance doubled once.

(It is interesting to note that CPU clock rates stayed constant. Speed improvements over the last five years were primarily due to other processor features such as multi-core and SSE register width.)

So what does this have to do with Nehalem?

In my previous post I stated that Nehalem’s architecture is fundamentally different than Intel’s previous x86_64 systems. Most of Nehalem’s enhancements involved the memory subsystem. Specifically, Intel replaced their Front Side Bus (FSB) architecture with integrated memory controllers. This drastically improved Nehalem’s memory bandwidth and latency.

Look what happens when we add 11G servers to our CPU versus RAM comparison:

HPC CPU v RAM 2



There are two important things to notice in the updated graph. First, Nehalem’s memory bandwidth quadruples jumping from ~9.5 GB/s to ~38 GB/s. This enormous increase is due to the DDR-3 memory -- which supports faster DIMMs at 3 channels -- and the integrated controller, which removes the contention inherent to a shared bus.

Second, CPU frequency stays constant between 10G to 11G, but CPU performance jumps 10%. This means that Nehalem operates more efficiently than previous architectures. We’ll explore the reasons for this jump in later posts.

So what does this mean for Wulf and McKee’s predictions? Did these changes avert the impending disaster? In my opinion, we’re not out of the woods yet. Massive parallelism through multi-core and accelerators is driving an even deeper divergence between CPU and memory speed. But these micro-benchmark results confirm that Nehalem pushed back the memory wall at least in the short term.

In an upcoming post we’ll examine whether these micro-benchmark improvements translate into real-world application performance gains.