Authors:  Saeed Iqbal and Kevin Tubbs

The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide range of engineering and science disciplines in both commercial and academic organizations. OpenFOAM has an extensive range of features to solve a wide range of fluid flows and physics phenomenon. OpenFOAM provides tools for all three stages of CFD, preprocessing, solvers, and post processing. Almost all are capable of being run in parallel as standard making it an important resource for a wide range of scientists and engineers using HPC for CFD.

General purpose Graphic Processor Units (GPUs) technology is increasingly being used to accelerate compute-intensive HPC applications across various disciplines in the HPC community.  OpenFOAM CFD simulations can take a significant amount of time and are computational intensive. Comparing various alternatives for enabling faster research and discovery using CFD is of key importance. SpeedIT libraries from Vratis provide GPU-accelerated iterative solvers that replace the iterative solvers in OpenFOAM.

In order to investigate the GPU-acceleration of OpenFOAM, we simulate the three dimensional lid driven cavity problem based on the tutorial provided with OpenFOAM. The 3D lid driven cavity problem is an incompressible flow problem solved using OpenFOAM icoFoam solver. The majority of the computational intensive portion of the solver is the pressure equation. In the case of acceleration, only the pressure calculation is offloaded to the GPUs. On the CPUs, the PCG solver with DIC preconditioner is used.  In the GPU-accelerated case, the SpeedIT 2.1 algebraic multigrid precoditioner with smoothed aggregation (AMG) in combination with the SpeedIT Plugin to OpenFOAM is used.

Figure 1: OpenFOAM performance of 3D cavity case using 4 million cells on a single node.

Figure 1 shows the performance of OpenFOAM’s the 3D lid driven cavity case using approximately 4 million cells on a single R720 node. The results are presented for CPU only, CPU + 1 M2090 GPU, and CPU + 2 M2090. The R720 CPU only results reflect the maximum number of cores available on this configuration (16 cores). The software limits the number of CPU cores used for GPU-acceleration mapping one CPU core to one GPU. The R720 + 1 M2090 and R720 + 2 M2090 results reflect the use of 1 core + 1 GPU and 2 cores + 2 GPU’s respectively. Compared to a CPU only configuration, no acceleration is obtained by using one GPU and an acceleration of 1.5X with two GPUs.  Figure 2 shows the power consumption results for the 4 million cell simulation.  In all cases, the power consumption is measured. As shown, the power efficiency, i.e. the useful work delivered for every watt of power consumed, improves by adding GPUs.  The power efficiency is defined as the performance (simulations/day) per measured power consumption (Watt). With one M2090, the power efficiency is approximately 1.3X and with two M2090 GPUs the power efficiency is almost 1.3X compared to the CPU only configuration.

Figure 2: Total Power and Power Efficiency of 3D cavity case on 4 million cells on a single node.

Figure 3 shows the performance of OpenFOAM’s 3D lid driven cavity case using approximately 8 million cells on a single R720 node. The size of the problem required the use of both GPUs. Compared to a CPU only configuration, an acceleration of 1.5X was achieved with two GPUs.  Figure 4 shows the power consumption results for the 8 million cell simulation.  As shown, the power efficiency also improves for the larger simulation.  With two M2090 GPUs the power efficiency is almost 1.3X compared to the CPU only configuration.

  Figure 3: OpenFOAM performance 3D cavity case using 8 million cells on a single node.

Figure 4: Total Power and Power Efficiency of 3D cavity case on 8 million cells on a single node.

 

In conclusion, first, using GPUs can accelerate the OpenFOAM icoFoam solver for incompressible fluid flow. As shown in Figure 2, using CPUs only, a single node delivers about 24 simulations/day of sustained performance for a problem size of 4 million cells. Adding 1 GPU delivers about the same sustained performance but increases the performance/watt ratio, while adding 2 GPUs the sustained performance improves to about 36 simulations/day.  Second, using GPUs improves the performance/watt ratio as well. The power consumption due to GPUs increases but not as much as the corresponding performance improvement.  As shown in Figure 3, the CPU only simulation consumes about 400 Watts and operates at 0.061 (simulations/day)/Watt. Adding 1 GPU but using only one core of the CPU, the power consumption decreases to about 300 Watts and operates at 0.078 (simulations/day)/Watt, which represents an increase of about 28% in performance/Watt.  Adding 2 GPUs and using only two cores of the CPU, the power consumption increases to about 445 Watts and operates at 0.0083 (simulations/day)/Watt, which represents an increase of about 36% in performance/Watt. Similar trends are shown in figures 4 and 5 for the problem size of 8 million cells. On the larger problem size, the performance increased from about 15 simulations/day for CPU only to about 24 simulations/day for 2 GPUs.  The power consumption increased from about 391 Watts operating at 0.039 (simulations/day)/Watt for CPU only to about 462 Watts operating at 0.051 (simulations/day)/Watt for 2 GPUs. This represents an increase of about 32% in performance/Watt.

Configuration and Installation

Each PowerEdge R720 has a dual Intel Xeon E5-2600 series processor. Please note installing two NVIDIA Tesla M2090 GPUs requires the use of a GPU enablement kit, the x16 option on the 3rd riser, and dual, redundant 1100W power supplies, shown in Figure 5. The details of the hardware and software components are given below:

Figure 5: Two M2090 GPUs can be attached inside the R720 using a riser and associated power cables.

Compute Node Model PowerEdge R720
Compute Node processor Two Intel @ 2.2 GHz, 95 W (Xeon ES-2660)
Memory 64 GB 1333 MHz
GPUs NVIDIA Tesla M2090
Number of GPUs 2
M2090 GPU Number of cores 512
Memory 6 GB
Memory bandwidth 177 GB/s
Peak Performance: Single Precision 1,331 GFLOPS
Peak Performance: Double Precision 665 GLOPS
Power Capping 250W
Software OpenFOAM Version 1.7.1
SpeedIT from Vratis Version 2.1
CUDA 4.0(285.05.23)
OS RHEL 6.2