The Intel® Xeon® processor 5500 series architecture (code named, "Nehalem")–based systems bring a whole new level of memory bandwidth to the high-performance computing (HPC) party. But the new architecture brings with it some options that allow you to trade memory price, memory performance, and memory capacity. I want to spend a little bit of time reviewing various memory configuration options and their impact on memory bandwidth.


Nehalem Architecture

Nehalem brings a new memory architecture to the Intel processor lineup. Rather than have the memory controller located off the CPU—remember the frontside bus (FSB) architecture?—the memory controller is now on the CPU! In addition, there are up to three memory channels that are connected to each socket and up to three dual in-line memory modules (DIMMs) per memory channel. This figure illustrates this architecture:


Nehalem Memory Layout Schematic


The numbering on the left indicates the memory channels and the numbers on the right indicate the number of DIMMs per memory channel (which “bank”). Notice that you can have only three DIMMs per memory channel (this is a limitation from the architecture and not a Dell-imposed limitation). For a two-socket system you can have a maximum of 18 DIMM slots.

Memory Options

Nehalem offers a wide variety of memory options. These include Unregistered DIMMs (UDIMMs) and Registered DIMMs (RDIMMs) at various speeds and capacities. At this time, Intel allows DIMMs to run at 800 MHz, 1066 MHz, and 1333 MHz. If you include various DIMM sizes and ranks, you can see there is a huge range of options for memory with Nehalem. So this begs the question, which one is best for me? Since I’m in HPC, in this blog, I want to focus on the best options for HPC and the implications of making certain configuration decisions.

UDIMM Versus RDIMM

There are some differences between UDIMMs and RDIMMs that are important in choosing the best options for memory performance. First, let’s talk about the differences between them.

RDIMMs have a register on-board the DIMM (hence the name “registered” DIMM). The register/phased locked loop (PLL) is used to buffer the address and control lines and clocks only. Consequently, none of the data goes through the register PLL on an RDIMM. On prior generations, double data rate 2 (DDR2), the Register—for buffer the address and control lines—and the PLL for generating extra copies of the clock were separate, but for DDR3 they are in a single part. There is about a one-clock cycle delay through the register, which means that with only one DIMM per channel, UDIMMs will have slightly less latency (better bandwidth). But when you go to two DIMMs per memory channel, because of the high electrical loading on the address and control lines, the memory controller uses something called a “2T” or “2N” timing for UDIMMs. Consequently, every command that normally takes a single clock cycle is stretched to two clock cycles to allow for settling time. Therefore, for two or more DIMMs per channel, RDIMMs will have lower latency and better bandwidth than UDIMMs.

Based on guidance from Intel and internal testing, RDIMMs have better bandwidth when using more than one DIMM per memory channel (recall that Nehalem has up to three memory channels per socket). But, based on results from Intel, for a single DIMM per channel, UDIMMs produce approximately 0.5 percent better memory bandwidth than RDIMMs for the same processor frequency and memory frequency (and rank). For two DIMMs per channel, RDIMMs are about 8.7 percent faster than UDIMMs.

For the same capacity, RDIMMs will be require about 0.5 to 1.0 W per DIMM more power because of the Register/PLL power. The reduction in memory controller power to drive the DIMMs on the channel is small in comparison to the RDIMM Register/PLL power adder.

RDIMMs also provide an extra measure of reliability, availability, and serviceability (RAS). They provide address/control parity detection at the Register/PLL such that if an address or control signal has an issue, the RDIMM will detect it and send a parity error signal back to the memory controller. It does not prevent data corruption on a write, but the system will know that it has occurred; whereas on UDIMMs, the same address/control issue would not be caught (at least not when the corruption occurs).

Another difference is that server UDIMMs support only x8 wide DRAMs, whereas RDIMMs can use x8- or x4-wide DRAMs. Using x4 DRAMs allows the system to correct all possible DRAM device errors (SDDC, or “Chip Kill”), which is not possible with x8 DRAMs unless channels are run in Lockstep mode (huge loss in bandwidth and capacity on Nehalem). So if SDDC is important, x4 RDIMMs are the way to go.

Dell currently has 1 GB and 2 GB UDIMMs for Nehalem. Consequently, we can support up to 6 GB per socket with UDIMMs at optimal performance (one DIMM per memory channel). For capacities greater than 12 GB per socket, RDIMMs are the only DIMM type supported at this time.

In addition, please note that UDIMMs are limited to two DIMMs per channel, so RDIMMs must be used if greater than two DIMMs per channel (some Dell servers will have three DIMMs per channel capability).

In summary, the comparison between UDIMMs and RDIMMs is:

  • Typically UDIMMs are a bit cheaper than RDIMMs
  • For one DIMM per memory channel UDIMMs have slightly better memory bandwidth than RDIMMs (0.5 percent)
  • For two DIMMs per memory channel RDIMMs have better memory bandwidth (8.7 percent) than UDIMMs
  • For the same capacity, RDIMMs will be require about 0.5 to 1.0 W per DIMM than UDIMMs
  • RDIMMs also provide an extra measure of RAS
    • Address/control signal parity detection
    • RDIMMs can use x4 DRAMs, so SDDC can correct all DRAM device errors even in independent channel mode
  • UDIMMs are currently limited to 1 GB and 2 GB DIMM sizes from Dell
  • UDIMMs are limited to two DIMMs per memory channel

DIMM Count and Memory Configurations

Recall that you are allowed up to three DIMMs per memory channel (i.e., three banks) per socket (a total of nine DIMMs per socket). With Nehalem the actual memory speed depends on the speed of the DIMM itself, the number of DIMMs in each channel, and the CPU speed itself. Here are some simple rules for determining DIMM speed:

  • If you put only one DIMM in each memory channel you can run the DIMMs at 1333 MHz (maximum speed). This assumes that the processor supports 1333 MHz (currently, the 2.66 GHz, 2.80 GHz, and 2.93 GHz processors support 1333 MHz memory) and the memory is capable of 1333 MHz.
  • As soon as you put one more DIMM in any memory channel (two DIMMs in that memory channel) on any socket, the speed of the memory drops to 1066 MHz (basically the memory runs at the fastest common speed for all DIMMs).
  • As soon as you put more than two DIMMs in any one memory channel, the speed of all the memory drops to 800 MHz.

So as you add more DIMMs to any memory channel, the memory speed drops because of the electrical loading of the DRAMs that reduces timing margin, not power constraints.

If you don’t completely fill all memory channels there is a reduction in the memory bandwidth performance. Think of these configurations as “unbalanced” configurations from a memory perspective.


Unbalanced Memory Configurations

Recall that Nehalem has three memory channels per socket and up to three DIMM slots per memory channel. Ideally, Nehalem wants all three memory channels filled because it can then interleave memory access to get better performance. It is certainly valid to not fill up all of the memory channels, but this causes an unbalanced memory configuration that impacts memory bandwidth.

Let’s look at a simple example. Let’s assume I have all three DIMM slots filled for the first DIMM slot. Then I add a single DIMM in the first memory channel in the second DIMM slot, as shown here:
Nehalem - Unbalanced memory configuration - One DIMM in second bank

In this case, the first bank (first DIMM slot) has all three memory channels occupied (1, 2, and 3). For the second bank (the second DIMM slot), only the first memory channel has a DIMM, for a total of four DIMMs per socket. Nehalem currently interleaves the memory for this configuration across Bank 0—memory channels 1 and 2, and then interleaves across Bank 0, DIMM 3 and Bank 1, DIMM 1 (basically two 2-way interleaves instead of a single 3-way interleave in the case of only filling Bank 0). This change in interleaving reduces memory bandwidth by about 23 percent for the case of 1066 MHz memory.

Let’s look at the next logical step, putting a DIMM in Bank 1 for the second memory channel:


Neahelm Unbalanced memory configuration - Two DIMMs in second bank


In this case, interleaving happens across Bank 0 (all three memory channels) and across the two occupied memory channels in Bank 1. While the second interleaving isn’t completely full, it still interleaves across two memory channels. So in this case performance should be a bit better than the previous example, but not as good as if all three memory channels were full.

The final case is not an unbalanced case, but it is when all three memory channels are filled for Bank 1:


Nehalem - Balanced Memory configuration - Second Bank fully populated

In this case, interleaving happens across all three memory channels for Bank 0 and all three memory channels for Bank 1. This recovers the peak performance and is about 23 percent faster than the case of Bank 1 only having a single memory channel occupied.

The basic summary from discussing the memory configuration is that you want to populate all three memory channels in a particular Bank if at all possible. If only two of the memory channels are occupied, then the memory bandwidth performance decreases (the relative drop is not known at this time). If only one memory channel is occupied, then the memory bandwidth performance drops by about 23 percent!


Memory Bandwidth Performance

I want to give you some guidance on memory bandwidth using the STREAM benchmark, since memory bandwidth is one of the fantastic new features with Nehalem. The numbers I’m going to provide are for guidance, and the exact performance depends on the DIMM types, frequencies, processor speed, the compiler choice and compiler options, and BIOS settings. So don’t take these numbers as gospel, but they can be used to understand the impact of DIMM configurations on memory bandwidth performance. For the following guidance I’m assuming that the DIMMs are the same size:

  1. For Nehalem processors with a QPI speed of 6.4 GT/s and using three 1333 MHz DIMMs (one per memory channel) per socket, you can expect to see memory bandwidths of about 35 GB/sec.
  2. For 1066 MHz memory for either one DIMM per memory channel (three total DIMMs per socket) or two DIMMs per memory channel (six total DIMMs per socket), you can see a little over 32 GB/sec (about an 8.5 percent reduction from number 1).
  3. For 800 MHz memory for either one DIMM per memory channel (three total DIMMs per socket), two DIMMs per memory channel (six total DIMMs per socket), or three DIMMs per memory channel (nine total DIMMs per socket), you could see up to about 25 GB/sec (about a 28.5 percent reduction relative to number 1 or 22 percent reduction relative to number 2).

One of the key factors in these numbers is that the DIMM sizes are all the same. For example, you use all 2 GB DIMMs or all 1 GB DIMMs that are either all UDIMMs or RDIMMs. If you use DIMMs of different sizes anywhere on the node—for example, using 2 GB DIMMs and 1 GB DIMMs—you can lose up to 5–10 percent in memory bandwidth performance.

Also, remember the guidance from the unbalanced configurations. If at all possible you want to fill all three memory channels for a specific bank.


Recommended Memory Configurations

With all these options, UDIMMs versus RDIMMs, various capacities, speeds, mixing DIMM sizes, and processor speeds, there are a HUGE number of options, and it’s not always clear which configuration gives you the best memory bandwidth. So, I want to summarize my recommendations. I will start with the configuration with the best memory bandwidth and then move down in performance.

  • The best memory bandwidth performance is a single UDIMM per memory channel
    • At this time you are limited to 1 GB UDIMMs or 2 GB UDIMMs
      • This means you can only have 3 GB–6 GB per socket.
      • Be careful of selecting DIMM sizes in case you want to upgrade in the future.
    • Remember that UDIMMs don’t have all the RAS features of RDIMMs.
    • UDIMMs use less power than RDIMMs (0.5–1W per DIMM).
    • Ninety-five W processors (2.66 GHz, 2.80 GHz, and 2.93 GHz) allow the memory to run at 1333 MHz.
    • At this time UDIMMs should be cheaper than RDIMMS.
  • Switching to RDIMMs reduces memory bandwidth, but applications are unlikely to notice the difference (less than 0.5 percent reduction in memory bandwidth).
  • As soon as you add a second DIMM to any memory channel the speed drops to 1066 MHz for all DIMMs (approximately an 8.5 percent reduction in memory bandwidth). This assumes that your processor is capable of supporting 1066 memory.
    • If you need to go this route, either for capacity or cost, then I recommend populating two DIMM slots for each memory channel for both sockets, since the DIMM speed is at 1066 MHz anyway and it gives you the best performance.
      • Recall that if you populate only one memory channel on the second bank, you will lose about 23 percent of your memory bandwidth performance.
    • UDIMMs can still be used but you have reached the maximum number of DIMMs per socket (six).
    • RDIMMs have better memory bandwidth performance.
      • 8.7 percent when Banks 0 and Banks 1 are occupied.
    • Use 1066 MHz DIMMs, since 1333 MHz DIMMs don’t result in any more memory bandwidth.
  • As soon as you add a third DIMM to any memory channel, the speed drops to 800 MHz for all DIMMs (28.5 percent drop in memory bandwidth relative to 1333 MHz memory).
    • I recommend if you need three DIMMs for any memory channel, that you populate all DIMM slots, since the memory speed will be 800 MHz and populating a subset of memory channels can greatly impact memory bandwidth performance.
    • RDIMMs have to be used for this case.
  • Use the same size DIMMs for all configurations if possible (5–10 percent loss in memory bandwidth if different size DIMMs are used).

This guidance is not as simple as one would hope, but for the big boost in memory bandwidth, we have to learn to live with a little more complexity. If you want something even simpler, here are my recommendations for memory configurations that give you the best possible memory bandwidth performance:

  • Use RDIMMs.
  • Keep the DIMM sizes the same (otherwise lose 5–10 percent memory bandwidth).
  • Think in 3s:
    • 1x3 per socket is 1333 MHz memory (if the processor supports that speed)
    • 2x3 per socket is 1066 MHz memory (lose 8.5 percent memory bandwidth)
    • 3x3 per socket is 800 MHz memory (lose about 22 percent more memory bandwidth)
  • Select the DIMM size that gets you the capacity you want and the price you want (1 GB DIMMs, 2 GB DIMMs, and 4 GB DIMMs)
    • Note: You may not get the exact memory capacity you want.

I hope this last set of bullets is a little easier to use. Personally, I’m more than willing to give up a little complexity in memory configurations for the memory bandwidths that Nehalem is capable of providing.


Examples of Simple Rules:

If we follow these simple rules then we end up with something like these configurations:

  • Start with a processor that supports 1333 MHz memory.
    • Populate all three memory channels and a single bank:
      • Choosing 1 GB RDIMMs gives you 6 GB total memory for a two socket system.
      • Choosing 2 GB RDIMMs gives you 12 GB total memory for a two socket system.
      • Choosing 4 GB DIMMs gives you 24 GB total memory for a two socket system.
    • Populate the three memory channels and two banks:
      • Choosing 1 GB RDIMMs gives you 12 GB total memory for a two socket system.
      • Choosing 2 GB RDIMMs gives you 24 GB total memory for a two socket system.
      • Choosing 4 GB RDIMMs gives you 48G B total memory for a two socket system.
    • Populate the three memory channels and three banks:
      • Choosing 1 GB RDIMMs gives you 18 GB total memory for a two socket system.
      • Choosing 2 GB RDIMMs gives you 36 GB total memory for a two socket system.
      • Choosing 4 GB RDIMMs gives you 72 GB total memory for a two socket system.

The rules are fairly easy to follow, but you may not get the exact memory capacity you want while keeping the memory bandwidth performance as high as you want. But from these configurations you can have memory capacities of 6 GB, 12 GB, 18 GB, 24 GB, 36 GB, 48 GB, and 72 GB using 1 GB, 2 GB, or 4 GB DIMMs.


Summary

I hope this blog has helped. It is long and complex, but I wanted to explain the options you have and the impact they have on memory bandwidth. I’ve also tried to give you some simple guidelines to help configure the memory you want with the performance you want.


Acknowledgements

I want to acknowledge and thank Onur Celebioglu and Stuart Berke of Dell for helping me review this blog (many times) for factual errors and the usual grammar mistakes I'm prone to commit. In particular, Stuart helped a great deal with the details of how and why in regard to memory with Nehalem.

I also want to thank Intel—in particular, Lance Shuler, Chad Martin, and Ed Kurtzer—for answering questions about the inner workings of Nehalem's memory and how it is manifested in memory bandwidth.

Jeff