The Xeon 5500 (Nehalem) based systems bring a whole new level of memory bandwidth to the HPC party. But the new architecture brings with it some options that allow you to trade memory price, memory performance, and memory capacity. I want to spend a little bit of time reviewing various memory configuration options and their impact on memory bandwidth.
Nehalem brings a new memory architecture to the Intel processor lineup. Rather than have the memory controller located off the CPU (remember the Front-Side Bus, FSB, architecture?) the memory controller is now on the CPU! In addition, there are up to three memory channels that are connected to each socket and up to three DIMMs per memory channel. The figure below illustrates this
The numbering on the left indicates the memory channels and the numbers on the right indicate the number of DIMMs per memory channel (which “bank”). Notice that you can have only 3 DIMMs per memory channel (this is a limitation from the architecture and not a Dell imposed limitation). For a two-socket system you can have a maximum of 18 DIMM slots.
Nehalem offers a wide variety of memory options. These include Unregisterd DIMMs (UDIMMs) and Registered DIMMs (RDIMMs) at various speeds and capacities. At this time, Intel allows DIMMs to run at 800 MHz, 1066 MHz, and 1333 MHz. If you include various DIMM sizes and ranks, you can see there is a huge range of options for memory with Nehalem. So this begs the question, which one is best for me? Since I’m in HPC, in this blog, I want to focus on the best options for HPC and the implications of making certain configuration decisions.
There are some differences between UDIMMs and RDIMMs that are important in choosing the best options for memory performance. First, let’s talk about the differences between them. RDIMMs have a register on-board the DIMM (hence the name “registered” DIMM). The register/PLL is used to buffer the address and control lines and clocks only. Consequently, none of the data goes through the register /PLL on an RDIMM (PLL is Phase Locked Loop. On prior generations (DDR2), the Register - for buffer the address and control lines - and the PLL for generating extra copies of the clock were separate, but for DDR3 they are in a single part). There is about a one clock cycle delay through the register which means that with only one DIMM per channel, UDIMMs will have slightly less latency (better bandwidth). But when you go to 2 DIMMs per memory channel, due to the high electrical loading on the address and control lines, the memory controller use something called a “2T” or “2N” timing for UDIMMs. Consequently every command that normally takes a single clock cycle is stretched to two clock cycles to allow for settling time. Therefore, for two or more DIMMs per channel, RDIMMs will have lower latency and better bandwidth than UDIMMs.
Based on guidance from Intel and internal testing, RDIMMs have better bandwidth when using more than one DIMM per memory channel (recall that Nehalem has up to 3 memory channels per socket). But, based on results from Intel, for a single DIMM per channel, UDIMMs produce approximately 0.5% better memory bandwidth than RDIMMs for the same processor frequency and memory frequency (and rank). For two DIMMs per channel, RDIMMs are about 8.7% faster than UDIMMs.
For the same capacity, RDIMMs will be require about 0.5 to 1.0W per DIMM more power due to the Register/PLL power. The reduction in memory controller power to drive the DIMMs on the channel is small in comparison to the RDIMM Register/PLL power adder.
RDIMMs also provide an extra measure of RAS. They provide address/control parity detection at the Register/PLL such that if an address or control signal has an issue, the RDIMM will detect it and send a parity error signal back to the memory controller. It does not prevent data corruption on a write, but the system will know that it has occurred, whereas on UDIMMs, the same address/control issue would not be caught (at least not when the corruption occurs).
Another difference is that server UDIMMs support only x8 wide DRAMs, whereas RDIMMs can use x8 or x4 wide DRAMs. Using x4 DRAMs allows the system to correct all possible DRAM device errors (SDDC, or “Chip Kill”), which is not possible with x8 DRAMs unless channels are run in Lockstep mode (huge loss in bandwidth and capacity on Nehalem). So if SDDC is important, x4 RDIMMs are the way to go.
Dell currently has 1GB and 2GB UDIMMs for Nehalem. Consequently we can support up to 6 GB per socket with UDIMMs at optimal performance (one DIMM per memory channel). For capacities greater than 12GB per socket, RDIMMs are the only DIMM type supported at this time.
In addition, please note that UDIMMs are limited to 2 DIMMs per channel so RDIMMs must be used if greater than 2 DIMMs per channel (some of Dell’s servers will have 3 DIMMs per channel capability). In summary the comparison between UDIMMs and RDIMMs is
Recall that you are allowed up to 3 DIMMs per memory channel (i.e. 3 banks) per socket (a total of 9 DIMMs per socket). With Nehalem the actually memory speed depends upon the speed of the DIMM itself, the number of DIMMs in each channel, the CPU speed itself. Here are some simple rules for determining DIMM speed.
So as you add more DIMMs to any memory channel, the memory speed drops. This is due to the electrical loading of the DRAMs that reduces timing margin, not power constraints. If you don’t completely fill all memory channels there is a reduction in the memory bandwidth performance. Think of these configurations as “unbalanced” configurations from a memory perspective.
Recall that Nehalem has 3 memory channels per socket and up to three DIMM slots per memory channel. Ideally Nehalem wants all three memory channels filled because it can then interleave memory access to get better performance. It is certainly valid to not fill up all of the memory channels but this causes an unbalanced memory configuration that impacts memory bandwidth. Let’s look at a simple example. Let’s assume I have all three DIMM slots filled for the first DIMM slot. Then I add a single DIMM in the first memory channel in the second DIMM slot. This is show below
In this case, the first bank (first DIMM slot) has all three memory channels occupied (1, 2, and 3). For the second bank (the second DIMM slot), only the first memory channel has a DIMM. This is a total of 4 DIMMs per socket. Nehalem currently interleaves the memory for this configuration across Bank 0 – memory channels 1 and 2, and then interleaves across Bank 0, DIMM 3 and Bank 1 DIMM 1 (basically two 2-way interleaves instead of a single 3-way interleave in the case of only filling Bank 0). This change in interleaving reduces memory bandwidth by about 23% for the case of 1066 MHz memory.
Let’s look at the next logical step – putting a DIMM in Bank 1 for the second memory channel:
In this case, interleaving happens across Bank 0 (all three memory channels) and across the two occupied memory channels in Bank 1. While the second interleaving isn’t completely full, it still interleaves across two memory channels. So in the case performance should be a bit better than previous example but not as good as if all three memory channels were full.
The final case is not an unbalanced case, but is when all three memory channels are filled for Bank 1:
In this case interleaving happens across all three memory channels for Bank 0 and all three memory channels for Bank 1. This recovers the peak performance and is about 23% faster than the case of Bank 1 only having a single memory channel occupied. The basic summary from discussing the memory configuration is that you want to populate all three memory channels in a particular Bank if at all possible. If only two of the memory channels are occupied then the memory bandwidth performance decreases (the relative drop is not known at this time). If only one memory channel is occupied then the memory bandwidth performance drops by about 23%!
I wanted to give you some guidance on memory bandwidth using the STREAM benchmark since memory bandwidth is one of the fantastic new features with Nehalem. The numbers I’m going to provide are for guidance and the exact performance depends upon the DIMM types, frequencies, processor speed, the compiler choice and compiler options, as well as BIOS settings. So don’t take these numbers as gospel, but they can be used to understand the impact of DIMM configurations on memory bandwidth performance. For the following guidance I’m assuming that the DIMMs are the same size.
One of the key factors in these numbers is that the DIMM sizes are all the same. For example you use all 2GB DIMMs or all 1GB DIMMs that are either all UDIMMs or RDIMMs. If you use DIMMs of different sizes anywhere on the node, for example using 2GB DIMMs and 1GB DIMMs, you can lose up to 5-10% in memory bandwidth performance.
Also remember the guidance from the unbalanced configurations. If at all possible you want to fill all three memory channels for a specific bank.
With all of these options, UDIMMs vs. RDIMMs, various capacities, speeds, mixing DIMM sizes, and processor speeds, there is a HUGE number of options and it’s not always clear which configuration gives you the best memory bandwidth. So, I wanted to summarize my recommendations. I will start with the configuration with the best memory bandwidth and then move down in performance.
This guidance is not as simple as one would hope, but for the big boost in memory bandwidth, we have to learn to live with a little more complexity. If you want something even simpler, here are my recommendations for memory configurations that give you the best possible memory bandwidth performance
I hope this last set of bullets is a little easier to use. Personally, I’m more than willing to give up a little complexity in memory configurations for the memory bandwidths that Nehalem is capable of providing.
If we follow these simple rules then we end up with something like these configurations:
The rules are fairly easy to follow buy you may not get the exact memory capacity you want while keeping the memory bandwidth performance as high as you want. But from these configurations you can have memory capacities of 6GB, 12GB, 18GB, 24GB, 36GB, 48GB, and 72GB using 1GB, or 2GB, or 4GB DIMMs.
I hope this blog has helped. It is long and complex but I wanted to explain the options you have and the impact they have on memory bandwidth. I’ve also tried to give you some simple guidelines to help configure the memory you want with the performance you want.
I want to acknowledge and thank Onur Celebioglu and Stuart Berke of Dell for helping me review this blog (many times) for factual errors and the usual grammar mistakes I'm prone to commit. In particular Stuart helped a great deal with the details of how and why in regard to memory with Nehalem. I also want to thank Intel - in particular Lance Shuler, Chad Martin, and Ed Kurtzer for answering questions about the inner workings of Nehalem's memory and how it is manifested in memory bandwidth.
How does SDDC really work in Nehalem with DDR3 DIMMs? Are there any rules (pairing, etc.)?
What do you think should happen when we have like 4GB and 8GB for example? If it is 4GB, do you recommend doing 1GB dimms with 2 per processor socket or just a single 2GB dual rank dimm per processor socket?
Thanks for such a through post on the memory options - its a bit clearer to me now though still tough to figure out the best options. I'm seeing on a R510 the "optimized" memory configs are mostly 3x number of modules but they list the speed as 1066mhz. I'll assume that is the rated speed on the module not the effectived speed that you are saying...is only 800mhz when using 3x.
Sorry I was confused. I'd edit my post if possible but I see the 3x configs refer t 3 modules per channel not populating all 3 channels in 1x or 2x configs.