by Mayura Deshmukh and Saeed Iqbal

The new generation of NVIDIA General Purpose Graphic Processor Units (GPUs), code name “Kepler,” has significant performance improvements, especially in accelerating compute-intensive applications. High Performance Linpack (HPL) stresses the compute and memory subsystems of the test systems and is widely accepted as a reference benchmark in the HPC community. To investigate GPUs for accelerating HPL, it not only provides deeper insights into the performance of the systems but enables faster research and discovery for researchers in the world. Our study will compare HPL performance on various “Kepler” GPUs (K20 and K40).

Dell now offers a full-featured GPU solution based on the PowerEdge R720 servers. In this solution GPUs are attached inside the servers to provide the extra compute horsepower required for application acceleration. Two of the latest Tesla K20 or K40 GPUs can be added to each PowerEdge R720 server. In this blog, we will present and compare the performance and power results of a GPU-accelerated HPL on single-node PowerEdge R720 with K20 and K40.

Figure 1: HPL performance and efficiency on single-node. Results are presented for different GPUs 

Figure 1 shows the performance of HPL on single-node R720 with different GPUs. Compared to a CPU-only configuration, an acceleration of 6.2X is obtained by using K40 GPU and an acceleration of 5.1X with K20 GPUs. And the HPL efficiency is slightly higher on K40 (79.9%) than on K20 (79.1%). Below, Figure 2 shows the power consumption results of running HPL. As shown, the power efficiency, i.e. the useful work delivered for every watt of power consumed, improves by adding GPUs. With two Kepler GPUs the power efficiency is up to 3.7X compared to the CPU-only configuration.

Figure 2: Total Power and Power Efficiency of the eight node cluster. 

In conclusion, first, using GPUs can substantially accelerate HPL. As shown in Figure 1, using CPUs only, each compute node delivers about 419 GFLOPs of sustained performance. By adding K40m GPUs the sustained performance improves to about 2600 GFLOPS.  Second, using GPUs improves the performance/watt ratio as well. The power consumption due to GPUs increases but not as much as the corresponding performance improvement.  As shown in Figure 2, a CPU-only node consumes about 505 Watts and operates at 0.83 GFLOPS/watt. Adding K40m GPUs, the power consumption increases to about 850 Watts but now operates at 3.07 GFLOPS/watt, which represents an increase of about 250% in performance/watt. 

One of the significant changes from K20 to K40 is the bandwidth upgrade from PCIe Gen 2 to PCIe Gen 3. Comparing the host-to-device and the device-to-host bandwidth, we found that it improves 74% on host-to-device bandwidth, and 56% on device-to-host bandwidth. And for the device-to-device bandwidth, it shows 40% improvement (203 GB/s for K40 and 145 GB/s for K20). The more detailed comparison is shown in figure 3 below. 

Figure 3: K20/K40 GPUs host-to-device (H2D) and device-to-host (D2H) bandwidth comparison.