by Saeed Iqbal and Shawn Gao 

The NVIDIA Tesla K20 GPU has proven performance and power efficiency across many HPC applications in the industry.  The K20 is based on the latest Kepler GK110 architecture by NVIDIA, and incorporates several innovative features and micro-architectural enhancements implemented in the Kepler design. Since the K20 release, NVIDIA has launched an upgrade to K20 called the K20X.  The K20X has a higher number of processing units and higher memory and memory bandwidth. This blog quantifies the performance and power efficiency improvements of K20X compared to K20. The information presented in this blog is beneficial in making an informed decision between the two powerful GPU options.

High Performance LINPACK (HPL) is an industry standard compute intensive benchmark. HPL is traditionally used to stress the compute and memory subsystem.  Now, with the increasingly common use of GPUs, a GPU-enabled version of HPL is developed and maintained by NVIDIA.  The GPU-enabled version of HPL utilizes the traditional compute subsystem of CPUs and compute accelerator of GPUs. We used the Kepler GPU-enabled HPL version 2.0 for this study.

We use the Dell PowerEdge R720 for the performance comparisons.  The PowerEdge R720 is a dual socket server and can have up to two internal GPUs installed. We keep the standard test configuration to be two GPU per server.  The PowerEdge R720 is a versatile full-featured server with a large memory capacity.    

Hardware Configuration and Results

The Server and GPU configuration details are compared in the tables below.

Table 1:  Server Configuration

Server

Model

PowerEdge R720

 

Processor

Two Intel Xeon E5-2670 @ 2.6 GHz

 

Memory

128GB ( 16x8G)  1600MHz  2 DPC

 

GPUs

NVIDIA Tesla K20 and K20X

 

Number of GPUs installed

2

 

BIOS

1.6

Software

Benchmark : GPU-accelerated HPL

HPLHhhHPLHPLHPL

Version 2.1

 

CUDA, Driver

5.0, 304.54

 

OS

RHEL 6.4

 Table 2:  K20 and K20X: Relevant parameter comparison

GPU Model

K20X

K20

Improvement (K20X)

Number of cores

2,688

2,496

7.6%

Memory (VRAM)

6 GB

5 GB

20.0%

Memory bandwidth

250 GB/s

208 GB/s

20.2%

Peak Performance(SP)

3.95 TFLOPS

3.52 TFLOPS

12.2%

Peak  Performance(DP)

1.31 TFLOPS

1.17 TFLOPS

11.9%

TDP

235W

225W

4.4%

Figure 1: HPL performance and efficiency on R720 for K20X and K20 GPUs. 

Figure 1 illustrates the HPL performance on the PowerEdge R720. The CPU-only performance is shown for reference. Clearly, there is a performance improvement with K20X of about 11.2% on HPL GFLOPS, compared to K20. Compared to the CPU-only configuration the HPL acceleration with K20X GPUs is 7.7X. Similarly with the K20 GPUs, it is 6.9X.  In addition to improved performance, the compute efficiency on K20X is slightly better than K20. As shown in Figure 1, K20X has a compute efficiency of 82.6% and K20 an efficiency of 82.1%.  It is typical for CPU-only configurations to have higher efficiency than heterogeneous CPU+GPU configurations, as in Figure 1.  The CPU-only configuration is 94.6%, and the CPU+GPU configurations are in the lower 80s. 

Figure 2: Total Power and Power Efficiency on PowerEdge R720 for K20 and K20X GPUs. 

Figure 2 illustrates the total system power consumption of the different configurations of the PowerEdge R720 server.  The first thing to note from Figure 2 is that GPUs consume substantial power.  The CPU-only configuration power consumption is about 450W, which increases to above 800W when K20/K20X GPUs are installed in the server. This represents an increase of up to 80% in power consumption. This should be taken into account during the power budgeting of large installations and the power system supply.   However, once the power is delivered to the GPUs, they are much better than CPUs alone in converting the energy to useful work. This is clear from the improved performance per watt numbers shown in Figure 2.  The K20X shows a performance per watt of 2.79 GFLOPS/W, which is about 4X better than the CPU-only configuration.  Similarly, the K20 has 2.68 GFLOPS/W power efficiency, which is about 3.8X better than the CPU-only configuration.    It is interesting to note that K20X shows a 7% improvement over its predecessor K20.

Summary

The K20X delivers about 11% higher performance and consumes 7% more power than the K20 for the HPL benchmark.   These results are in line with the expected increase when the theoretical parameters are compared.