Edgeby Shawn Gao and Saeed Iqbal

The new generation of NVIDIA General Purpose Graphic Processor Units (GPUs), code name “Kepler,” has significant performance improvements, especially in accelerating compute-intensive applications. High Performance Linpack (HPL) stresses the compute and memory subsystems of the test systems and is widely accepted as a reference benchmark in the HPC community. To investigate GPU-accelerated HPL, it not only provides deeper insights into the performance of the systems but enables faster research and discovery for researchers in the world. Our study will compare HPL performance of 2 GPU models, K20 and K40, in PowerEdge R720.

GPUs are attached inside the servers to provide the extra compute horsepower required for application acceleration. Dell now offers a full-featured solution (up to 2 GPUs per server) based on the PowerEdge R720 servers. In this blog, we compare the performance and power consumption results of a GPU-accelerated HPL on single-node PowerEdge R720 with 2 latest models of GPUs: K20 and K40.

Figure 1: HPL performance and efficiency on single-node. Results are presented for different GPUs.

Figure 1 shows the performance of HPL on single-node R720 with different GPUs. Compared to a CPU only configuration, an acceleration of 5X is obtained by using K40 GPU and an acceleration of 4.5X with K20 GPUs. The HPL efficiency is marginally higher on the K40 vs. K20. Figure 2 shows the power consumption results of running HPL. As shown, the power efficiency, i.e. the useful work delivered (GFlops in case of HPL) for every watt of power consumed, improves by adding GPUs. With two Kepler GPUs the power efficiency is up to 3X compared to the CPU only configuration.

Figure 2: Total Power and Power Efficiency of the eight node cluster.

First and foremost, using GPUs can substantially accelerate HPL. As shown in Figure 1, using CPUs only, each compute node delivers about 490 GFLOPs of sustained performance, by adding GPUs the sustained performance improves to about 2600 GFLOPS.  Second, using GPUs improves the performance/watt ratio as well. The power consumption with the addition of GPUs increases but not as much as the corresponding performance improvement.  As shown in Figure 2, a CPU only node consumes about 537 Watts and operates at 0.92GFLOPS/Watt, adding GPUs the power consumption increases to about 938 Watts but now operates at 2.78GFLOPS/Watt, which represents an increase of about 200% in performance/Watt. 

One of the significant changes from K20 to K40 is the bandwidth upgrade from PCIe Gen 2 to PCIe Gen 3. Compared the host-to-device and the device-to-host bandwidth, we have found that it improves 77% on host-to-device bandwidth, and 58% on device-to-host bandwidth on PCIe Gen 3 (K40) when compared to PCIe Gen2 (K20). For the device bandwidth (GPU internal memory bandwidth), it shows 25% improvement (181 GB/s for K40 and 145 GB/s for K20). The more detailed comparison shows in figure 3 below. 

Figure 3: K20/K40 GPUs host-to-device(H2D) and device-to-host(D2H) bandwidth comparison.


Compute Node


PowerEdge R720


Number of Compute Nodes



Compute Node processor

Two Intel @ 2.7 GHz, (Xeon E5-2697 v2)



128 GB 1600 MHz



NVIDIA Tesla K20 and K40


Number of GPUs






K20 GPUs


Number of cores




5 GB

~11 GB

Memory bandwidth

208 GB/s

288 GB/s

Peak Performance(SP): Single Precision



Peak Performance(DP): Double Precision



PCIe Gen

Gen 2

Gen 3

Power Capping







Benchmark : GPU-accelerated HPL

Version 2






RHEL 6.4