- By Mayura Deshmukh

Every year Graphics Processing Unit (GPU) become more powerful, achieving more teraflops thus giving a quantum leap in performance for commonly used molecular dynamics and manufacturing codes, allowing researchers to use more efficient and denser high performance computing architectures. What is the performance difference between CPU and GPU? How much is the power consumption? How well does K80 GPUs perform with the Dell C4130 server? Which configuration is the best for my application? These are some of the questions which come to our mind and this blog aims to answer these and related questions.

This blog presents the work conducted to measure and analyze the performance, power consumption and performance per watt of a single Dell PowerEdge C4130 server with nVidia K80 GPUs. The PowerEdge C4130 server is the latest GPU high density design from Dell, offering up to four GPUs in a 1U form factor. The uniqueness of PowerEdge C4130 is that it presents a configurable system design, potentially making it a better fit, for the wider variety of extreme HPC applications. 

The HPC focused Tesla series K80 GPU provides 1.87/2.91 TFLOPs (double precision) compute capacity, which is about 31%-75% more than K40, the previous Tesla card.  K40’s base clock is 745MHz, though it can be boosted up to 810MHz or 875MHz. K80 has a base clock of 562MHz, but it can climb up to 875MHz, at 13MHz increments. Another new feature of the K80, is Autoboost, which provides additional performance, if additional power and thermal head room is available. In the K80, the internal GPUs are based on the GK210 architecture and have a total of 4,992 cores which represent a 73% improvement over K40.  The K80 has a total memory of 24GBs which is divided equally between the two internal GPUs; this is a 100% more memory capacity compared to the K40.   The memory bandwidth in K80 is improved to 480 GB/s.  The rated power consumption of a single K80 is a maximum of 300 watts.

Configuration

The C4130 offers eight configurations “A” through “H”. Since GPUs provide the bulk of compute horsepower, the configurations can be divided into three groups based on expected performance, the first group of four configurations, “A”, “B”, “C” and “G” with four GPUs each, the second group of a single configuration “H” with three GPUs, and the third group of three configurations, “D” “E” and “F” with two GPUs each. The quad GPU configurations: “A”, “B” and “G” have an internal PCIe switch module. The details of the various configurations are shown in the Table 1 and the block diagram (Figure 1) below:

Table 1: C4130 Configurations

 

  Figure 1: C4130 Configuration Block Diagram

Table 2 gives more information about the hardware configuration, profiles and firmware used for the benchmarking.

Table 2: Hardware Configuration

 

Bandwidth

CUDA’s heterogeneous programming model uses both the CPU and GPU, so data transfer between CPUs and GPUs greatly affect performance.

Figure 2: Memory Bandwidth for C4130 

     

Figure 2: Memory Bandwidth for C4130 

Figure 2 shows the host-to-device (CPU--> GPU) and device-to-host (CPU<-- GPU) memory bandwidth for all the C4130 configurations. Bandwidth is within range of 12000 MB/s (Peak is 15754 MB/s)

Nvidia’s GPUDirect Peer to Peer feature enables GPUs on the same PCIe root complex to directly transfer data between their memories, avoiding any copies to system memory. This dramatically lowers CPU overhead, and reduces latency, resulting in significant performance improvements in data transfer time for applications. Without the peer to peer feature, to get data from one GPU to another on the same host, one would use cudaMemcpy() first to get the data from the GPU to system memory, then another cudaMemcpy() to get the same data onto the second GPU.

Figure 3: Peer-to-peer Bandwidth for C4130

Figure 3 shows the peer to peer communication between the GPUs for the C4130 with a switch module (Configuration B) Vs C4130 without switch module (Configuration C - Dual CPUs, Balanced with four GPUs).

  • For configuration B the bandwidth is constant at 24.6 GB/s across all GPU’s.
  • For configuration C bandwidth is:
    • 24.6 GB/s for data transfers between GPUs on the same card (GPU1<-->GPU2, GPU3<-->GPU4, GPU5<-->GPU6, GPU7<-->GPU8)
    • 19.6 GB/s for data transfers between GPUs connected to the same CPU (GPU1,2<-->GPU3,4; GPU5,6<-->GPU7,8)
    • 18.7 GB/s for data transfers between GPUs connected to the other CPU (GPU1,2,3,4<-->GPU5,6,7,8)
  • For configuration G, which has two virtual switches the bandwidth is:
    • 24.6 GB/s for data transfers between GPUs connected to the same CPU via virtual switch (GPU1,2<-->GPU3,4; GPU5,6<-->GPU7,8)
    • 17.5 GB/s for data transfers between GPUs connected to the other CPU (GPU1,2,3,4<-->GPU5,6,7,8)

Applications that require a lot of peer to peer communication can benefit from the high bandwidth offered by the C4130 switch module configurations (A, B, G).

HPL

HPL solves a random dense linear system in double-precision arithmetic on distributed-memory systems and is a very compute intensive benchmark. NVIDIA pre compiled HPL, Intel MKL 2015 and OpenMPI 1.6.5 were used for the benchmarking. The problem size (N) used was ~90% of the system memory.

Figure 4: HPL performance and power consumption with C4130

   

 

The blue bars on the left graph in Figure 4 shows the HPL performance characterization of PowerEdge C4130. The results are achieved in GFLOPS which is the Y-axis on the graph. 

  • Performance for the four GPU configurations –“A”, “B”, “C” and “G”, ranges from 6.5 to 7.3 TFLOPS. Configuration “C” and “G”, with two GPUs balanced per CPU are the highest performing configurations with 7.3 TFLOPS. The performance difference between “A” and “B” can be attributed to the additional CPU in configuration “B”. The difference from “B” to “G” or “C” is due to different GPU to CPU ratios; all three have the same number of compute resources.  Configuration “C” and “G” are balanced with two GPUs per CPU while “B” has the all four GPU attached to a single CPU.
  • The only three GPU configuration “H” achieved 6.4 TFLOPS which falls between the performance of the four GPU and two GPU configuration.
  • For the two GPU configurations, “D” is highest with 3.8 TFLOPS, “E” and “F” with 3.6 TFLOPS. Configuration “E” has one less CPU explaining the difference in performance than “D”.
  • Both “D” and “F” have two CPUs and two GPUs but for configuration “F” both the GPUs are connected to just one CPU, whereas for Configuration “D” each GPU is connected to each CPU (more cores per GPU).

Compared to a CPU-only performance, run on two E5-2690v3, an acceleration of ~9X is obtained by using four K80, 7X by using three GPUs and an acceleration of ~4.7X with two K80 GPUs.  The HPL efficiency is significantly higher on K80 (low to upper 80s) compared to previous generation of GPUs.

 The red bars on the right graph in Figure 4 represent the power consumption for the HPL runs. The quad GPU configurations “A”, “B”, “C” and “G” consume significantly more power than the CPU-only runs, which is expected for compute intensive loads. But the energy efficiency (calculated as performance per watt) with these configurations is 4+ GFLOPS/w compared to the 1.6 GFLOPS/s of the CPU-only HPL runs. The power consumption for the three GPU configuration “H” is 2.7X and the energy efficiency is 4.1 GFLOPS/w which makes it an energy efficient lower cost alternative to the quad GPU configurations. The dual GPU configurations “D”, “E” and “F” consume low power (1.8X to 2.1X compared to CPU-only runs) and the energy efficiency is in the range of 3.5 GFLOPS/w to 3.9 GFLOPS/w that is about 2.3X better than the CPU only runs.

NAMD

NAMD is designed for high-performance simulation of large bio molecular systems. The benchmarks ApoA1 (92224 atoms) is a high density lipoprotein found in plasma, which helps extraction of cholesterol from tissues to liver. F1ATPase (327506 atoms) is responsible for the synthesizing of the molecule adenosine tri-phosphate. STMV (Satellite Tobacco Mosaic Virus) is a small, icosahedral virus, which worsens the symptoms of infections by tobacco mosaic virus. STMV is a large benchmark case with 1066628 atoms.

Figure 5: NAMD performance and power consumption with C4130

    

   

Figure 5 quantifies the performance and power consumption of NAMD for all the C4130 configurations compared to the CPU-only server (i.e. server with two CPUS).

  • The acceleration on NAMD is sensitive to number of CPUs and the memory available in the system E.g. there is a significant difference in the acceleration between “A” and “B” for the quad GPU configuration and between “E” and “F” for the dual GPU configuration. This difference becomes more apparent as the problem size increases. “B” with a similar configuration to “A” but with an additional CPU and memory performs 43% better compared to “A” and “F” with an additional CPU and memory than E performs 26% better.
  • Among the four quad GPU configurations NAMD performs best on configuration “C” and “G”. The difference in the two highest performing configurations and the other configurations (“A” and “B”) is the manner in which GPUs are attached to the CPU. The balanced configurations “G” (with switch) and “C” (without switch) have 2 GPUs attached to 2 CPUs resulting in 7.8X acceleration over the CPU-only case. The same four GPUs attached via a switch module to a single CPU, configuration “B” results in about 7.7X acceleration.
  • “H” the three GPU configuration falls in between the four GPU and two GPU configurations with respect to performance with 7.1X acceleration than the CPU-only configuration. “H” with an extra CPU and more memory performs better than the four GPU configuration “A”
  • “D” and “F” with 2 CPUs and 2 GPUs perform better with 5.9X acceleration compared to 4.4X in configuration “E” (1 CPU and 2 GPUs).  

As shown in the right graph of Figure 5, the power consumption for quad GPU configurations is ~ 2.3X resulting in accelerations from 4.4X to 7.8X and the energy efficiency (performance per watt) ranges from 2.0X to 3.4X. Configuration “C” and “G” along with providing the best performance also do well from energy efficiency perspective (an acceleration of 7.8X for 2.3X more power)amongst the quad GPU configurations. Configuration “H” with three GPUs is more energy efficient configuration than the quad GPU configurations with performance per watt of 3.7X providing 7.1X acceleration with only 1.9X more power. Configuration “F” is the most energy efficient configuration, consuming only 1.5X more power with performance per watt of 3.8X.

ANSYS Fluent

ANSYS Fluent is a computational fluid dynamics application used for fluid flow design engineering analysis. The equation solvers used to drive the simulation are computational intensive. Approximately 3 GB GPU memory is required for a 1M Cell simulation. The benchmarks run are the ANSYS pipes 1.2M and 9.6M steady state, non-combustive cases.

Figure 6: ANSYS Fluent performance and power consumption with C4130

    

 

The left graph in Figure 6 shows performance of ANSYS Fluent compared to 4 CPU cores. Code performs best for configuration with 1: 2 CPU: GPU ratio.

  • The quad GPU configurations provide 3.9-4.4X acceleration compared to tests run on 4 CPU cores. Configuration “C” and “G” provide the best performance amongst the four GPU configurations
  • The three GPU configuration “H” provides 3.7X acceleration
  • The dual GPU configuration “E” with two GPU’s connected to a single CPU provides the best acceleration of 2.8X amongst all the dual GPU configurations

In Figure 6 the right graph shows the power consumption data for all the configurations compared to the power consumed when the benchmarks were run on 4 CPU cores. The numbers in yellow at the bottom of the bars indicate the relative performance per watt for the configurations. The quad GPU configurations consume 3.7X-3.9X more power and provide 3%-20% more performance per watt. The three GPU configuration “H” is the most energy efficient configuration consumes 2.8X more power but provides the most performance per watt (32% more than the 4 core runs) of all the configurations. The dual GPU configurations consume 2.1X-2.5X more power and the energy efficiency is7%-28% better.      

Fluent scales well on CPU cores so to understand the benefit of using GPUs we experimented by using the same number of licenses and running the benchmark on the CPU cores Vs running it on CPU+ GPU.

Figure 7: ANSYS Fluent optimizing licensing costs

  

 

Figure 7 shows the data for the 1.2M and 9.6 Fluent benchmark run on only CPU cores Vs quad GPU configurations “A”, “B” and “C”. The benchmark output is the wall clock time which is the Y-axis (lower is better), the X-axis shows the number of CPU cores used for the test (that is the number of fluent licenses required). As shown in Figure 7 using 24 licenses, the GPU approach: that is using 16cores + 8GPUs provides 48% better performance than just using 24 CPU cores for the 9.6M benchmark and is 25% better for the 1.2M benchmark. Similarly, Table 3 shows the performance benefit of GPU approach Vs CPU approach for 24, 20, 16, 12 and 8 licenses for the 9.6M and 1.2M benchmark cases.

Table 3: Fluent GPU vs CPU approach with same number of licenses

 

Conclusion

The C4130 server with nVidia Tesla K80 GPUs demonstrates exceptional performance and power-efficiency gains for compute intensive workloads and applications like NAMD and Fluent. Fluent scaling is very impressive on CPU cores but depending on your problem and licensing model there is a definitive performance benefit with using GPU’s. Applications that do a lot of GPU peer-to-peer communication can gain from the higher bandwidth offered by the C4130 switch configurations.