I think you’ve heard by now that Intel has released a new processor, the Xeon 5500 (code-named Nehalem-EP). While the consumer version of this processor has been released as the Core i7, this is server version of the processor. It is a sea change for Intel processors for a number of reasons that I want to present.


Intel’s Processor Development Clock

I’m not sure if you know about Intel’s approach to processors but it is called tick-tock. Every year there is a tick or tock. A tick is a shrink in the chip. The 2007 change from the Clovertown processor to the Harpertown processor was a “tick.” It was a shrink from 60 nm to 45 nm. The “tock” is a new processor architecture that is called Nehalem-EP. Next year, 2010, there will be a shrink for Nehalem-EP to a smaller size (the “tick”). This will be followed in 2011 by a new processor (the “tock”). It’s a very aggressive cadence, but Intel is keeping pace.

Just remember that behind this tick-tock strategy is Moore’s Law. While it’s doubtful we’ll see much increase in clock speed (if any), remember that Moore’s Law is about the number of transistors. With a chip shrink (“tick”) for a given thermal envelope, the number of transistors will increase. So what will processor manufacturers do with the extra transistors? There are a number of things they could do:

  • Create more cores moving from 4-core or 6-core processors to a larger number.
  • Add more cache (L1, L2, or L3).
  • Do something else creative with the transistors.

This last bullet is the “potpourri” for processor design and could involve all kinds of things. With Nehalem-EP, one of the key features falls into this last category.


Introducing Mr. Nehalem-EP

I love the old Sidney Poitier movie They Call Me MISTER Tibbs where there is that great line that is the title for the movie, “They call me MISTER Tibbs.” I think Nehalem-EP is something that we should call “MISTER Nehalem-EP” because it is the big daddy on the market from Intel and deserves some serious respect.

There are a bunch of new features in the processor that I want to present in the context of HPC. From many perspectives it is a processor that answers many of the desires of the HPC market. It includes:

  • On-board memory controller (remember I mentioned the third category? Intel has used extra transistors to put the memory controller on the chip to improve memory performance)
  • Nehalem-EP uses DDR3 DIMMs instead of FB-DIMMs. This reduces power and increases bandwidth.
  • Intel has created QPI (Quick-Path Interconnect) that connects the sockets to each other and to the various devices on the board – NICs, drives, USB ports, etc. The exact performance of QPI depends upon the processor speed because there are several “levels” of processor available from Intel.
  • Intel has introduced something called Turbo Boost that allows cores that are not being used to drop to a low power state while the remaining processors can increase their clock speed to use the extra available power (built-in overclocking!)
  • With Nehalem-EP, Intel has also brought back Hyperthreading. While, to be honest, the original hyperthreading was widely reviled within HPC, the new Hyperthreading has actually shown to help performance for certain HPC applications.
  • The chipset used with Nehalem-EP has PCIe Gen 2 so we have more throughput for PCIe devices (more lanes). This means more performance – for example, you can plug in a QDR InfiniBand HCA in the node and experience the wonder of 32 Gb/s in data transfer.
  • Nehalem-EP keeps the same thermal envelopes of Harpertown. While this may not sound important, it does mean that vendors (such as Dell) don’t have to add cooling to their nodes to handle the extra power (it means we can design new nodes much faster).

Let’s talk about these features a little more in-depth (I hope to have more blogs that talk more in-depth about each feature in the future).


Memory Improvements

Probably the most important new feature in Nehalem-EP revolves around the memory enhancements. Nehalem-EP has an integrated memory controller that improves memory bandwidth. At the same time, Intel has three memory channels per socket with up to three DIMMs per memory channel (more than the Harpertown processors).


Nehalem Memory Layout Schematic

Figure 1 – Schematic of dual-socket Nehalem-EP memory layout


In the above figure the numbers on the left label the memory channels and the numbers on the right label the number of DIMMs per channel (you can call them “banks” if you like). At this time, the Xeon 5500 is limited to two sockets and a maximum of 18 DIMMs slots.

Intel has switched from FB-DIMMs to DDR3 memory. The two big impacts of this switch are that DDR3 has better bandwidth and uses less power than FB-DIMMs. A DDR3 DIMM uses 0.5-1 W less power than the equivalent FB-DIMM.

What does all of this mean to HPC? The Harpertown processor has a memory bandwidth measured by Stream in the 9-11 GB/s range. With the optimal memory configuration and fastest processors, the Nehalem-EP has a memory bandwidth that is a little faster than 35 GB/s! So Intel has blessed us with about 3 times the memory bandwidth.


QPI

As I mentioned earlier, Intel has also created something called QPI that connects the processors to each other and also connects them to the board devices via the South Bridge. The figure below illustrates how everything is connected on the board:


Nehalem Dual-Socket Layout Schematic


Figure 2 – Dual-socket Nehalem-EP Schematic

Notice that QPI gives a connection speed of up to 25.6 GB/s between the processors. It’s a NUMA (Non-Uniform Memory Access) architecture where each core can address any of the memory. Also notice that QPI connects the processors to the PCIe Gen2 slots.


How About Application Performance?

At this point you may be asking yourself, “these improvements sound good but how does it impact real application performance?” I’m glad you asked. How these improvements affect performance is, of course, dependent upon the specific application. In particular, applications that are memory bandwidth-limited will see a big jump in performance while applications that aren’t affected as much by memory bandwidth will see much smaller performance improvements, if any at all.

I tested a major CFD (Computational Fluid Dynamics) application from NASA called USM3D. It’s in use by a number of companies, primarily for the aerodynamics analysis of aircraft. The application was run on a 9.6 million-cell problem on 8 cores (single node) on a preproduction Nehalem-EP running at 2.66 GHz (dual-socket) and a dual-socket 3.0 GHz Intel Harpertown processor with a 1,333 MHz FSB (front-side bus). With Nehalem-EP we saw a slightly larger than 3x improvement in wall clock time relative to the Harpertown (not shabby). This example may be slightly pathological because the application is so dependent upon memory bandwidth. But it does show you the potential of the Xeon 5500.

So the best guidance I can give at this point is that your performance improvement on Nehalem-EP relative to Harpertown is dependent upon the specific application but could range from 0 to 3X.


Summary

Nehalem-EP (Xeon 5500) is a big jump in improvement for Intel. It has a number of new features including a 3X+ improvement in memory bandwidth, PCIe Gen 2, Turbo Boost, QPI, Hyperthreading, and all within the same thermal envelope as Harpertown processors. I only talked about two of the major features, memory improvements and QPI, in this blog, but I did show what all of these improvements can do for performance with a memory bandwidth-sensitive application getting up to a 3X improvement in wall clock time.

In future blogs other important aspects of Nehalem-EP will be discussed in more depth. So stay tuned to this bat-channel, er, HPC blog. In particular, in the next blog I will discuss some of the aspects of memory configuration and its impact on performance (warning – the configuration options are vast).