by Shawn Gao and Saeed Iqbal

General purpose Graphic Processor Units (GPUs) have proven suitable for accelerating compute-intensive applications e.g., Computational Fluid Dynamics (CFD), Molecular Dynamics (MD), Quantum Chemistry (QC), Computational Finance (CF) and Oil & Gas applications, etc. However, among the available areas, Molecular Dynamics (MD) and N-body Simulation have benefited tremendously due to GPU acceleration; and equally important, there are freely available sophisticated GPU-enabled simulators. NAMD and NBODY are such GPU-accelerated simulators. More detailed information about NAMD, NBODY and GPUs can be found in the reference section [1, 2].

In this blog we evaluate improved NAMD and NBODY performance due to GPU accelerated compute nodes. The proteins STMV, which consist of 1066K atoms, is chosen due to its relatively large problem size. The performance measure is “days/ns”, that shows the number of days required to simulate 1 nanosecond of real-time. 

Figure 1: Relative performance of NAMD benchmarks on single-node R720.

Figure 1 shows the relative performance of the NAMD benchmarks on the single-node R720. STMV is accelerated about 3.4X on Tesla K40 and 2.9X on Tesla K20. Figure 2 shows the additional power required for GPUs; there is about 1.5X increase in total power consumption. From the power efficiency point of view running STMV with dual internal GPUs is beneficial as the performance gain is 3.4X for an additional 1.5X power with K40. We also observe that the problem size is a key factor in determining how much a particular simulation gets accelerated. For more detailed analysis, please refer to our previous studies [2, 3]. 

Figure 2: Relative Power Consumption of NAMD benchmarks on single-node R720.

NBODY simulation comes from CUDA Sample Pack. To eliminate software overhead, we ran the benchmark multiple times and chose the number of bodies equal to 1 million. Figure 3 shows the average performance improvement is about 20% from K20 to K40 in both cases (1 GPU per node or 2 GPUs per node). Also, dual GPUs configuration can significantly increase total performance up to 4X.  

Figure 3: Relative Performance of NBODY benchmarks (double precision) on single-node R720.

Cluster Configuration

PowerEdge R720 has dual Intel Xeon E5-2600 series processors. The details of the hardware and software components are given below:

Compute Node

Model

PowerEdge R720

 

Number of Compute Nodes

1

 

Compute Node processor

Two Intel @ 2.7 GHz, 95W (Xeon E5-2697 v2)

 

Memory

128 GB 1600 MHz

 

GPUs

NVIDIA Tesla K20/K40

 

Number of GPUs

2

 

 

 

GPUs

K20 GPUs

K40 GPU

Number of cores

2496

2880

Memory

5 GB

~11 GB

Memory bandwidth

208 GB/s

288 GB/s

Peak Performance(SP): Single Precision

3.5 TFLOPS

~4.3 TFLOPS

Peak Performance(DP): Double Precision

1.1 TFLOPS

~1.4 TFLOPS

PCIe Gen

Gen 2

Gen 3

Power Capping

235W

235W

 

 

 

Software

NAMD

NBODY

Version

2.9

SDK

CUDA

4.2

5.5

OS

RHEL 6.4

RHEL 6.4

 Reference

  1. NAMD  www.ks.uiuc.edu/Research/namd/
  2. GPGPU http://www.nvidia.com/TeslaApps
  3. NAMD Performance on PE C6100 and C410X  http://en.community.dell.com/techcenter/high-performance-computing/w/wiki/namd-performance-on-pe-c6100-and-c410x.aspx
  4. Faster Molecular Dynamics with GPUs   http://en.community.dell.com/techcenter/high-performance-computing/b/hpc_gpu_computing/archive/2012/08/07/faster-molecular-dynamics-with-gpus.aspx