· *With two PowerEdge C6145s attached to a PowerEdge C410x (Full Sandwich) configuration, the best performance achieved with HPL is 2891 GFLOPS (31% theoretical peak) and it consumes 5030 watts.** * *· On a single PowerEdge C6145 attached to a C410x (Half Sandwich) configuration, the best performance with HPL is 1697 GFLOPS (19% theoretical peak) and it consumes 3802 watts. **· The measured GFLOPS per watt show that the C6145 and C410x solution converts power to FLOPS up to 1.7X more efficiently compared a CPU only configuration. **· GPGPUs offer a great potential of improving performance of suitable HPC applications. *

Introduction

There is a lot of interest in the High Performance Computing (HPC) Community to use General Purpose Graphics Processing Units (GPGPUs) for accelerating compute intensive simulations. Dell HPC Engineering team has configured and evaluated GPU based solutions to help customers select the correct solutions according to their specific needs. Dell has introduced the PowerEdge C410x as a primary workhorse for GPU based number crunching and our solutions are built around it. The current offering combines one or two AMD-based PowerEdge C6145 servers as host servers to the C410x.

Figure 1: Two PowerEdge C6145 host servers ”sandwich” a PowerEgde C410x.

As show in the Figure 1, a Power Edge C410x is used with two PowerEdge C6145 hosts. The PowerEdge C410x is an external 3U PCI-e expansion chassis, with a space for 16 GPUs. Compute nodes connect to the C410x via a Host Interface Card (HIC) and an iPASS cable. All connected nodes are mapped to the available GPUs according to a user defined configuration. The exact way the 16 GPUs are allocated can be dynamically reconfigured easily using a web GUI, making the operation easier and faster. Currently, the available GPU to host ratios are 2:1, 4:1 and 8:1. So, a single compute node can access up to 8 GPUs! The design of the C410x allows for a high GPU density solution with efficient power utilization characteristics. Each C6145 has 2 compute nodes, giving a total of 4 compute nodes, all in a 7U rack space. Each compute node is configured with four AMD Opteron 6132 HE processors, 128 GB of DDR3 1,333 MHz memory, 4 PCIE connectors (SLOT 1, 2, 3 and MEZZ) and 1 PCIE external connector (iPASS). The total bandwidth (IOH 1 and IOH 2) between the node and external instruments is up to 10.4GT/s. SLOT 3 and MEZZ are connected to IOH1; the rest are attached to IOH2.

Figure 2: iPASS cable and InfiniBand connection diagram

Configuration

As shown in Figure 2, each compute node is connected to the C410x using two iPASS Cables (red) and to the InfiniBand switch (blue) for internode communication. It is *critical* that computes nodes are connect to the C410x exactly as shown above since any other configuration may result in performance degradation. The details of the components used are given below:

Power Edge C410x | GPGPUs Model | NVIDIA Tesla M2070 |

Number of GPGPUs | 16 | |

iPASS Cables | 8 | |

Mapping | 2:1, 4:1, 8:1 | |

PowerEdge C6145: Compute Node | Processor | 4 Opteron 6132 HE @ 2.2 GHz |

Memory | 128 GB 1333 MHz | |

BIOS | 1.7.0 (4/13/11) | |

BMC FW | 1.02 | |

PIC FW | [0116] | |

OS | RHEL 5.5, (2.6.18-194.e15) | |

CUDA | 4.0 | |

M2070 GPGPU | Number of cores | 448 |

Memory | 6 GB | |

Memory bandwidth | 150 GB/s | |

Peak Performance: Single Precision | 1030 GFLOPS | |

Peak Performance: Double Precision | 515 GFLOPS | |

Benchmark | GPU Enabled HPL from nVIDIA | Version 11 |

Best Practices for System Configuration

· The inter-nodes connections through the InfiniBand switch should use the MEZZ card, which are installed on IOH 1 and share the bandwidth with GPUs connected on SLOT 3.

· Based on measured bandwidth test, the best bandwidth utilization can be achieved if a single HIC connects to a maximum of two GPUs.

. Using two HIC Cards per compute node is highly recommended with C6145 and C410x solution.

· Due to the NUMA architecture of the C6145, special attention should be given to process to memory mapping. In general, using memory near the GPGPUs gives more performance.

· Single compute node can’t work with more than 12 GPUs due to some system limitation.

Performance

Figure 3: Performance improvement due to GPGPUs.

As shown in table 1, each M2070 GPGPU has a peak performance of 515 GFLOPs, giving a fully populated C410x with 16 GPUs a peak capacity of 8240 GFLOPs. Similarly, the peak compute capacity of a single C6145 compute node is 281.6 GFLOPs; all four nodes are rated at 1126.4 GFLOPS. The total peak performance of the GPGPU solution as show in figure 1 is 9369 GFLOPs (double precision). Figure 3 shows the improvement in HPL performance due to GPGPU acceleration. As a reference the blue bars show the measured performance with CPUs only. The red bars show performance improvement when a total of 16 GPGPUs are used for acceleration. Two C6145 are attached to the C410x, and the mapping per compute node is set to either 4:1 or 2:1. When all four compute nodes of the C6145 are used with no GPGPUs attached the performance is what? GFLOPS giving an efficiency of 72.1%. By using 4 GPUs/node, the performance increases to 2891.0 GFLOPS, which is 3.6X the performance with only CPUs. For HPL using the maximum number of 16 GPGPUs is beneficial in both cases. However keeping the mapping ratio to 2:1 for HPL gives 1.6X more performance compared to a mapping ratio of 4:1.

Power Consumption and Efficient Power Utilization

Compute intensive benchmarks like HPL typically consume a large amount of power because they stress the processor and memory subsystems. It is of interest from the datacenter design point of view to have accurate power consumption values. Figure 4 shows the associated solution power consumption of the GPGPU solution. When all four nodes are used with 16 GPGPUs the total power consumption is 5030.5 watts which is 2.1 X the power consumed for compute nodes without GPGPUs. The GFLOPS/watt metric is a measure of how efficiently the power consumed is converted to useful performance. Figure 5 show the GFLOPS/watt of the GPGPU solutions. When all four nodes are used, the GFLOPS/watts are 0.575 which is about 1.7 X the GFLOPS/watts when using a CPUs only solution.

Figure 4: Power Consumption of the C410x and C6145 compute nodes

Figure 5: Performance per Watt of the C410x and C6145 compute nodes