TechCenter Blogs

Nehalem and Memory Configurations

TechCenter

TechCenter
DellTechCenter.com is a community for IT professionals that focuses on Data Center and End User Computing best practices. Here you can learn about and share knowledge about Dell products and solutions.

Nehalem and Memory Configurations

The Xeon 5500 (Nehalem) based systems bring a whole new level of memory bandwidth to the HPC party. But the new architecture brings with it some options that allow you to trade memory price, memory performance, and memory capacity. I want to spend a little bit of time reviewing various memory configuration options and their impact on memory bandwidth.

Nehalem Architecture


Nehalem brings a new memory architecture to the Intel processor lineup. Rather than have the memory controller located off the CPU (remember the Front-Side Bus, FSB, architecture?) the memory controller is now on the CPU! In addition, there are up to three memory channels that are connected to each socket and up to three DIMMs per memory channel. The figure below illustrates this

Nehalem Memory Layout Schematic

The numbering on the left indicates the memory channels and the numbers on the right indicate the number of DIMMs per memory channel (which “bank”). Notice that you can have only 3 DIMMs per memory channel (this is a limitation from the architecture and not a Dell imposed limitation). For a two-socket system you can have a maximum of 18 DIMM slots.

Memory Options

Nehalem offers a wide variety of memory options. These include Unregisterd DIMMs (UDIMMs) and Registered DIMMs (RDIMMs) at various speeds and capacities. At this time, Intel allows DIMMs to run at 800 MHz, 1066 MHz, and 1333 MHz. If you include various DIMM sizes and ranks, you can see there is a huge range of options for memory with Nehalem. So this begs the question, which one is best for me? Since I’m in HPC, in this blog, I want to focus on the best options for HPC and the implications of making certain configuration decisions.

UDIMM vs. RDIMM:

There are some differences between UDIMMs and RDIMMs that are important in choosing the best options for memory performance. First, let’s talk about the differences between them.
RDIMMs have a register on-board the DIMM (hence the name “registered” DIMM). The register/PLL is used to buffer the address and control lines and clocks only. Consequently, none of the data goes through the register /PLL on an RDIMM (PLL is Phase Locked Loop. On prior generations (DDR2), the Register - for buffer the address and control lines - and the PLL for generating extra copies of the clock were separate, but for DDR3 they are in a single part). There is about a one clock cycle delay through the register which means that with only one DIMM per channel, UDIMMs will have slightly less latency (better bandwidth). But when you go to 2 DIMMs per memory channel, due to the high electrical loading on the address and control lines, the memory controller use something called a “2T” or “2N” timing for UDIMMs. Consequently every command that normally takes a single clock cycle is stretched to two clock cycles to allow for settling time. Therefore, for two or more DIMMs per channel, RDIMMs will have lower latency and better bandwidth than UDIMMs.

Based on guidance from Intel and internal testing, RDIMMs have better bandwidth when using more than one DIMM per memory channel (recall that Nehalem has up to 3 memory channels per socket). But, based on results from Intel, for a single DIMM per channel, UDIMMs produce approximately 0.5% better memory bandwidth than RDIMMs for the same processor frequency and memory frequency (and rank). For two DIMMs per channel, RDIMMs are about 8.7% faster than UDIMMs.

For the same capacity, RDIMMs will be require about 0.5 to 1.0W per DIMM more power due to the Register/PLL power. The reduction in memory controller power to drive the DIMMs on the channel is small in comparison to the RDIMM Register/PLL power adder.

RDIMMs also provide an extra measure of RAS. They provide address/control parity detection at the Register/PLL such that if an address or control signal has an issue, the RDIMM will detect it and send a parity error signal back to the memory controller. It does not prevent data corruption on a write, but the system will know that it has occurred, whereas on UDIMMs, the same address/control issue would not be caught (at least not when the corruption occurs).

Another difference is that server UDIMMs support only x8 wide DRAMs, whereas RDIMMs can use x8 or x4 wide DRAMs. Using x4 DRAMs allows the system to correct all possible DRAM device errors (SDDC, or “Chip Kill”), which is not possible with x8 DRAMs unless channels are run in Lockstep mode (huge loss in bandwidth and capacity on Nehalem). So if SDDC is important, x4 RDIMMs are the way to go.

Dell currently has 1GB and 2GB UDIMMs for Nehalem. Consequently we can support up to 6 GB per socket with UDIMMs at optimal performance (one DIMM per memory channel). For capacities greater than 12GB per socket, RDIMMs are the only DIMM type supported at this time.

In addition, please note that UDIMMs are limited to 2 DIMMs per channel so RDIMMs must be used if greater than 2 DIMMs per channel (some of Dell’s servers will have 3 DIMMs per channel capability).
In summary the comparison between UDIMMs and RDIMMs is

  • Typically UDIMMs are a bit cheaper than RDIMMs
  • For one DIMM per memory channel UDIMMs have slightly better memory bandwidth than RDIMMs (0.5%)
  • For two DIMMs per memory channel RDIMMs have better memory bandwidth (8.7%) than UDIMMs
  • For the same capacity, RDIMMs will be require about 0.5 to 1.0W per DIMM than UDIMMs
  • RDIMMs also provide an extra measure of RAS
    • Address / control signal parity detection
    • RDIMMs can use x4 DRAMs so SDDC can correct all DRAM device errors even in independent channel mode
  • UDIMMs are currently limited to 1GB and 2GB DIMM sizes from Dell
  • UDIMMs are limited to two DIMMs per memory channel
DIMM Count and Memory Configurations

Recall that you are allowed up to 3 DIMMs per memory channel (i.e. 3 banks) per socket (a total of 9 DIMMs per socket). With Nehalem the actually memory speed depends upon the speed of the DIMM itself, the number of DIMMs in each channel, the CPU speed itself. Here are some simple rules for determining DIMM speed.

  • If you put only 1 DIMM in each memory channel you can run the DIMMs at 1333 MHz (maximum speed). This assumes that the processor supports 1333 MHz (currently, the 2.66 GHz, 2.80 GHz, and 2.93 GHz processors support 1333 MHz memory) and the memory is capable of 1333 MHz
  • As soon as you put one more DIMM in any memory channel (two DIMMs in that memory channel) on any socket, the speed of the memory drops to 1066 MHz (basically the memory runs at the fastest common speed for all DIMMs)
  • As soon as you put more than two DIMMs in any one memory channel, the speed of all the memory drops to 800 MHz

So as you add more DIMMs to any memory channel, the memory speed drops. This is due to the electrical loading of the DRAMs that reduces timing margin, not power constraints.
If you don’t completely fill all memory channels there is a reduction in the memory bandwidth performance. Think of these configurations as “unbalanced” configurations from a memory perspective.

Unbalanced Memory Configurations

Recall that Nehalem has 3 memory channels per socket and up to three DIMM slots per memory channel. Ideally Nehalem wants all three memory channels filled because it can then interleave memory access to get better performance. It is certainly valid to not fill up all of the memory channels but this causes an unbalanced memory configuration that impacts memory bandwidth.
Let’s look at a simple example. Let’s assume I have all three DIMM slots filled for the first DIMM slot. Then I add a single DIMM in the first memory channel in the second DIMM slot. This is show below

Nehalem - Unbalanced memory configuration - One DIMM in second bank

In this case, the first bank (first DIMM slot) has all three memory channels occupied (1, 2, and 3). For the second bank (the second DIMM slot), only the first memory channel has a DIMM. This is a total of 4 DIMMs per socket. Nehalem currently interleaves the memory for this configuration across Bank 0 – memory channels 1 and 2, and then interleaves across Bank 0, DIMM 3 and Bank 1 DIMM 1 (basically two 2-way interleaves instead of a single 3-way interleave in the case of only filling Bank 0). This change in interleaving reduces memory bandwidth by about 23% for the case of 1066 MHz memory.

Let’s look at the next logical step – putting a DIMM in Bank 1 for the second memory channel:

Neahelm Unbalanced memory configuration - Two DIMMs in second bank

In this case, interleaving happens across Bank 0 (all three memory channels) and across the two occupied memory channels in Bank 1. While the second interleaving isn’t completely full, it still interleaves across two memory channels. So in the case performance should be a bit better than previous example but not as good as if all three memory channels were full.

The final case is not an unbalanced case, but is when all three memory channels are filled for Bank 1:

Nehalem - Balanced Memory configuration - Second Bank fully populated

In this case interleaving happens across all three memory channels for Bank 0 and all three memory channels for Bank 1. This recovers the peak performance and is about 23% faster than the case of Bank 1 only having a single memory channel occupied.  The basic summary from discussing the memory configuration is that you want to populate all three memory channels in a particular Bank if at all possible. If only two of the memory channels are occupied then the memory bandwidth performance decreases (the relative drop is not known at this time). If only one memory channel is occupied then the memory bandwidth performance drops by about 23%!

Memory Bandwidth Performance

I wanted to give you some guidance on memory bandwidth using the STREAM benchmark since memory bandwidth is one of the fantastic new features with Nehalem. The numbers I’m going to provide are for guidance and the exact performance depends upon the DIMM types, frequencies, processor speed, the compiler choice and compiler options, as well as BIOS settings. So don’t take these numbers as gospel, but they can be used to understand the impact of DIMM configurations on memory bandwidth performance. For the following guidance I’m assuming that the DIMMs are the same size.

  1. For Nehalem processors with a QPI speed of 6.4 GT/s and using three 1333 MHz DIMMs (one per memory channel) per socket, you can expect to see memory bandwidths of about 35 GB/s.
  2. For 1066 MHz memory for either one DIMM per memory channel (3 total DIMMs per socket) or two DIMMs per memory channel (6 total DIMMs per socket), you can see a little over 32 GB/s (about a 8.5% reduction from number 1)
  3. For 800 MHz memory for either one DIMM per memory channel (3 total DIMMs per socket), two DIMMs per memory channel (6 total DIMMs per socket), or three DIMMs per memory channel (9 total DIMMs per socket), you could see up to about 25 GB/s (about a 28.5% reduction relative to number 1 or 22% reduction relative to number 2)

One of the key factors in these numbers is that the DIMM sizes are all the same. For example you use all 2GB DIMMs or all 1GB DIMMs that are either all UDIMMs or RDIMMs. If you use DIMMs of different sizes anywhere on the node, for example using 2GB DIMMs and 1GB DIMMs, you can lose up to 5-10% in memory bandwidth performance.

Also remember the guidance from the unbalanced configurations. If at all possible you want to fill all three memory channels for a specific bank.

Recommended Memory Configurations

With all of these options, UDIMMs vs. RDIMMs, various capacities, speeds, mixing DIMM sizes, and processor speeds, there is a HUGE number of options and it’s not always clear which configuration gives you the best memory bandwidth. So, I wanted to summarize my recommendations. I will start with the configuration with the best memory bandwidth and then move down in performance.

  • The best memory bandwidth performance is a single UDIMM per memory channel
    • At this time you are limited to 1GB UDIMMs or 2GB UDIMMs
      • This means you can only have 3GB-6GB per socket
      • Be careful of selecting DIMM sizes in case you want to upgrade in the future
    • Remember that UDIMMs don’t have all of the RAS features of RDIMMs
    • UDIMMs use less power than RDIMMs (0.5-1W per DIMM)
    • 95W processors (2.66 GHz, 2.80 GHz, and 2.93 GHz) allow the memory to run at 1333 MHz
    • At this time UDIMMs should be cheaper than RDIMMS
  • Switching to RDIMMs will reduce memory bandwidth but applications are unlikely to notice the difference (less than 0.5% reduction in memory bandwidth)
  • As soon as you add a second DIMM to any memory channel the speed drops to 1066 MHz for all DIMMs (approximately an 8.5% reduction in memory bandwidth). This assumes that your processor is capable of supporting 1066 memory
    • If you need to go this route, either for capacity or cost, then I recommend populating two DIMM slots for each memory channel for both sockets since the DIMM speed is at 1066 anyway and it gives you the best performance
      • Recall that if you only populate one memory channel on the second bank, you will lose about 23% of your memory bandwidth performance
    • UDIMMs can still be used but you have reached the maximum number of DIMMs per socket (6)
    • RDIMMs have better memory bandwidth performance
      • 8.7% when Banks 0 and Banks 1 are occupied
    • Use 1066 MHz DIMMs since 1333 MHz DIMMs don’t result in any more memory bandwidth
  • As soon as you add a third DIMM to any memory channel, the speed drops to 800 MHz for all DIMMs (28.5% drop in memory bandwidth relative to 1333 MHz memory).
    • I recommend if you need three DIMMs for any memory channel, that you populate all DIMM slots since the memory speed will be 800 MHz and populating a subset of memory channels can greatly impact memory bandwidth performance
    • RDIMMs have to be used for this case
  • Use same size DIMMs for all configurations if possible (5-10% lose in memory bandwidth if different size DIMMs are used)


This guidance is not as simple as one would hope, but for the big boost in memory bandwidth, we have to learn to live with a little more complexity. If you want something even simpler, here are my recommendations for memory configurations that give you the best possible memory bandwidth performance

  • Use RDIMMs
  • Keep the DIMM sizes the same (otherwise lose 5-10% memory bandwidth)
  • Think in 3’s
    • 1x3 per socket is 1333 MHz memory (if the processor supports that speed)
    • 2x3 per socket is 1066 MHz memory (lose 8.5% memory bandwidth)
    • 3x3 per socket is 800 MHz memory (lose about 22% more memory bandwidth)
  • Select the DIMM size that gets you the capacity you want and the price you want (1GB DIMMs, 2GB DIMMs, and 4GB DIMMs)
    • Note: You may not get the exact memory capacity you want

I hope this last set of bullets is a little easier to use. Personally, I’m more than willing to give up a little complexity in memory configurations for the memory bandwidths that Nehalem is capable of providing.

Examples of Simple Rules:

If we follow these simple rules then we end up with something like these configurations:

  • Start with a processor that supports 1333 MHz memory
    • Populate all three memory channels and a single bank
      • Choosing 1GB RDIMMs gives you 6GB total memory for a two socket system
      • Choosing 2GB RDIMMs gives you 12GB total memory for a two socket system
      • Choosing 4GB DIMMs gives you 24GB total memory for a two socket system
    • Populate the three memory channels and two banks
      • Choosing 1GB RDIMMs gives you 12GB total memory for a two socket system
      • Choosing 2GB RDIMMs gives you 24GB total memory for a two socket system
      • Choosing 4GB RDIMMs gives you 48GB total memory for a two socket system
    • Populate the three memory channels and three banks
      • Choosing 1GB RDIMMs gives you 18GB total memory for a two socket system
      • Choosing 2GB RDIMMs gives you 36GB total memory for a two socket system
      • Choosing 4GB RDIMMs gives you 72GB total memory for a two socket system

The rules are fairly easy to follow buy you may not get the exact memory capacity you want while keeping the memory bandwidth performance as high as you want. But from these configurations you can have memory capacities of 6GB, 12GB, 18GB, 24GB, 36GB, 48GB, and 72GB using 1GB, or 2GB, or 4GB DIMMs.

Summary

I hope this blog has helped. It is long and complex but I wanted to explain the options you have and the impact they have on memory bandwidth. I’ve also tried to give you some simple guidelines to help configure the memory you want with the performance you want.

Acknowledgements

I want to acknowledge and thank Onur Celebioglu and Stuart Berke of Dell for helping me review this blog (many times) for factual errors and the usual grammar mistakes I'm prone to commit. In particular Stuart helped a great deal with the details of how and why in regard to memory with Nehalem. I also want to thank Intel - in particular Lance Shuler, Chad Martin, and Ed Kurtzer for answering questions about the inner workings of Nehalem's memory and how it is manifested in memory bandwidth.

To post a comment login or create an account

Comment Reminder

Unrelated comments or requests for service will be unpublished. Please post your technical questions in the Support Forums or for direct assistance contact Dell Customer Service or Dell Technical Support.. All comments must adhere to the Dell Community Terms of Use.

  • How does SDDC really work in Nehalem with DDR3 DIMMs?  Are there any rules (pairing, etc.)?

  • What do you think should happen when we have like 4GB and 8GB for example?  If it is 4GB, do you recommend doing 1GB dimms with 2 per processor socket or just a single 2GB dual rank dimm per processor socket? 

  • Thanks for such a through post on the memory options - its a bit clearer to me now though still tough to figure out the best options. I'm seeing on a R510 the "optimized" memory configs are mostly 3x number of modules but they list the speed as 1066mhz. I'll assume that is the rated speed on the module not the effectived speed that you are saying...is only 800mhz when using 3x.

  • Sorry I was confused. I'd edit my post if possible but I see the 3x configs refer t 3 modules per channel not populating all 3 channels in 1x or 2x configs.