Author: Bruce Wagner, September 2016 (Solutions Performance Analysis Lab)
The goal of this blog is to illustrate the performance impact of DDR4 memory selection. Measurements were made on a Broadwell-EP CPU system configuration using the industry standard benchmarks listed in the following table 1.
Table 1: Detail of Server and Applications used with Intel Broadwell processor
Dell PowerEdge R630
2 x E5-2699 v4 @2.2GHz, 22 core, 145W, 55M L3 Cache
DDR4 product offerings including:
8GB 1Rx8 2400MT/s RDIMM (DPN 888JG)
32GB 2Rx8 2400MT/s RDIMM (DPN CPC7G)
64GB 4Rx8 2400MT/s LR-DIMM (DPN 29GM8)
1 x 750W
Red Hat Enterprise Linux 7.2 (3.10.0-327.el7.x86_64)
Memory Operating Mode – Optimizer
Node Interleaving – Disabled
Snoop mode – Opportunistic Snoop Broadcast
Logical Processor – Enabled
System profile – Performance
Intel optimized 126.96.36.199 linux64 binaries (http://www.spec.org/cpu2006)
v5.10 source from https://www.cs.virginia.edu/stream/
Intel Parallel Studio 2016 update2 compilation
Table 2 and figure 1 detail the memory subsystem within the 13G PowerEdge R630 as containing 24 DIMM sockets split into two sets of 12, one set per processor. Each 12-socket set is organized into four channels with three DIMM sockets per channel.
Table 2: Memory channels
Channel 0 DIMM Slots
Channel 1 DIMM Slots
Channel 2 DIMM Slots
Channel 3 DIMM Slots
A1, A5, A9
A2, A6, A10
A3, A7, A11
A4, A8, A12
B1, B5, B9
B2, B6, B10
B3, B7, B11
B4, B8, B12
Figure 1: Memory socket locations
Figure 2: Performance Impact of Memory Type
From Figure 2 we see that a memory configuration based upon Registered DIMMs (RDIMMs) provides a comprehensive 3.1% performance advantage as compared to an equivalent sized one composed of Load-Reduced DIMMs (LR-DIMM) despite both running at 2400 MT/s. LR-DIMMs make larger capacity memory configurations possible, but their inherently higher access latency results reduced application performance. LR-DIMMs also impose a nearly 30% power consumption penalty over the equivalent size/speed RDIMM. LR-DIMM should be resorted to only when the total system memory capacity requirement dictates a 3DPC configuration.
Table 3: Memory speed limits for 13G PowerEdge Models
Figure 3: Performance Impact of DIMM Rank Organization
From figure 3 we see that a 1DPC memory configuration composed of DIMMs of dual rank internal organization outperforms one composed of single rank DIMMs by 14%. This is due to DRAM’s large inherent delay when reversing read and write cycle access on a given rank leading to a significant reduction in throughput bandwidth on the memory channel. Given dual rank DIMMs or multiple DIMMs per channel, the CPU’s integrated memory controller can overlap schedule reads and writes on the memory channel to minimize RW turnaround time impact.
Figure 4: Performance Impact of Memory Speed
Figure 4 shows that a 2400 MT/s memory configuration provides 14% higher overall application performance than a 2133 MT/s one all other factors being the same. Modern 8Mbit 1.2V DDR4 DIMM technology is such that the higher speed incurs only a nominal increase in power consumption and thermal dissipation. 2400 MT/s DIMMs pricing and availability is also rapidly trending to be the commodity sweet spot.
Figure 5: Performance Impact of DIMM Slot Population
Figure 5 shows that a 2DPC population results in a slight 0.9% workload performance uplift over a 1DPC one attributed to the same memory controller data transfer overlap efficiency improvements as discussed for figure 3. A 3DPC result is shown to further highlight the marked performance degradation that results from the necessity to down clock the memory subsystem from 2400 MT/s to 1866 MT/s.
Figure 6: Performance Impact of DIMM Population Balance
In figure 6 we see a wide disparity in overall system memory bandwidth as a result of DIMM population balance.
Although the default Optimizer (aka Independent Channel) Memory Operating Mode supports odd numbers of DIMMs per CPU, there is a severe performance penalty in doing so.
The full list of memory module installation guidelines can be found within the product owner’s manual available thru www.dell.com.
In summary, to maximize workload performance the recommendation for 13G 2 socket servers is to populate all available channels with (2) dual–rank registered, 2400 MT/s DIMMs per channel.
Well, this is a great insight
Very useful information. Thank you for sharing your findings.