Executive Summary

· With two PowerEdge C6100s attached to a C410x (Full Sandwich) configuration, the best performance achieved on HPL is 4215 GFLOPS (46% theoretical peak) and it consumes 5525 watts.
· On a single PowerEdge C6100 attached to a C410x (Half Sandwich) configuration, the best performance achieved on HPL is 2281 GFLOPS (26% theoretical peak) and it consumes 4118 watts.
· The measured GFLOPS/watt show that the C6100 and C410x solution converts power to useful FLOPS up to 2.1X more efficiently compared a CPU only configuration.
· In general, the C6100 and C410x solution can achieve better performance by using faster CPUs and /or adding more memory in the C6100 compute nodes.

There is a lot of interest in the High Performance Computing (HPC) Community to use General Purpose Graphics Processing Units (GPGPUs) for accelerating compute intensive simulations. Dell HPC Engineering team has configured and evaluated GPU based solutions to help customers select the correct solutions according to their specific needs. Dell has introduced the C410x as a primary workhorse for GPU based number crunching and our solutions are built around it. The current offering combines one or two Intel-Based PowerEdge C6100 servers as host servers to the C410x.

Figure 1: Two PowerEdge C6100 host servers “sandwich” a C410x.

As show in the Figure 1, a C410x is used with two PowerEdge C6100 hosts. The C410x is an external 3U PCI-e expansion chassis, with a space for 16 GPUs. Compute nodes connect to the C410x via a Host Interface Card (HIC) and an iPASS cable. All connected nodes are mapped to the available GPUs according to a user defined configuration. The exact way the 16 GPUs are allocated can be dynamically reconfigured easily using a web GUI, making the operation easier and faster. Currently, the available GPU to host ratios are 2:1, 4:1 and 8:1. So, a single compute node can access up to 8 GPUs! The design of the C410x allows for a high GPU density solution with efficient power utilization characteristics. Each C6100 has four compute nodes, giving a total of eight compute nodes, all in a 7U rack space. The compute node is configured with two Intel Xeon X5650 2.67GHz processors and 48 GB of DDR3 1,333 MHz memory.

Figure 2: iPASS cable and InfiniBand connection diagram.

As shown in Figure 2, each compute node is connected to the PowerEdge C410x using an iPASS Cable (red) and to the InfiniBand switch (blue) for internode communication. The details of the components used are given below:

PowerEdge C410x GPGPUs Model NVIDIA Tesla M2070
Number of GPGPUs 16
iPASS Cables 8
Mapping 2:1, 4:1
PowerEdge C6100: Compute Node Processor Two X5650 @ 2.67 GHz
Memory 48 GB 1333 MHz
BIOS 1.54.92 (2/10/11)
BMC FW 1.11
PIC FW [0114]
OS RHEL 5.5, (2.6.18-194.e15)
CUDA 4.0
M2070 GPGPU Number of cores 448
Memory 6 GB
Memory bandwidth 150 GB/s
Peak Performance: Single Precision 1030 GFLOPS
Peak Performance: Double Precision 515 GFLOPS
Benchmark GPU Enabled HPL from nVIDIA Version 11

As shown in table 1, each M2070 GPGPU has a peak performance of 515 GFLOPs, giving a fully populated C410x with 16 GPUs a peak capacity of 8240 GFLOPs. Similarly, the peak compute capacity of a single C6100 compute node is 128.1 GFLOPs; all eight nodes are rated at 1024.8 GFLOPS. The total peak performance of the GPGPU solution as show in figure 1 is 9264 GFLOPs (double precision).

Figure 3a: Performance improvement due to GPGPUs.

Figure 3a shows the improvement in HPL performance due to GPGPU acceleration. As a reference the blue bars show the measured performance with CPUs only. The green (red) bars show performance improvement when a total of 16 GPGPUs (8 GPGPUs) are used for acceleration. There are two sets of readings. In the first set only one C6100 is attached to the C410x, and the mapping per compute node is set to either 4:1 or 8:1 depending on how many GPGPUs are evaluated. In the second set two C6100 are attached to the C410x, and the mapping per compute node is set to either 4:1 or 2:1. When all eight compute nodes of the C6100 are used with no GPGPUs attached, the performance is 934 GFLOPS giving an efficiency of 91.1%. By using 1 GPUs/node to accelerate, the performance increases to 2410 GFLOPS (with an efficiency of 46.8%), which is 2.4X the performance with only CPUs. Similarly, by using 2 GPUs/node, the performance further increases to 4215 GFLOPS (with an efficiency of 45.5%), which is 4.5X the performance with CPUs only. For HPL using the maximum number of 16 GPGPUs is beneficial in both cases. However keeping the mapping ratio to 2:1 for HPL gives 1.9X more performance compared to a mapping ratio of 4:1.

Power Consumption and Efficient Power Utilization
Compute intensive benchmarks like HPL typically consume a large amount of power because they stress the processor and memory subsystems. It is of interest from the datacenter design point of view to have accurate power consumption values. Figure 3b shows the associated solution power consumption of the GPGPU solution. When all eight nodes are used with 16 GPGPUs the total power consumption is 5525 watts which is 2.2 X the power consumed for compute nodes without GPGPUs. The GFLOPS/watt metric is a measure of how efficiently the power consumed is converted to useful performance. Figure 4 show the GFLOPS/watt of the GPGPU solutions. When all eight nodes are used, the GFLOPS/watts is 0.763 which is more the twice the GFLOPS/watts when using a CPUs only solution.

Figure 3b: Power Consumption of the C410x and C6100 compute nodes.

Figure 4: Power Consumption per Watt of the C410x and C6100 compute nodes.