Executive Summay

  • Given a single fully populated C410X with 16 M2070 GPUs, the recommended solution for NAMD is an 8 node configuration shown below. On STMV, a standard large NAMD benchmark, it shows a 3.5X performance compared to an equivalent CPU only cluster.
  • It is recommended to use X5650 processors on compute nodes for NAMD

Figure 1: The Dell GPGPU Solution based on a C410X and two PEC6100s
Introduction
General Purpose Graphics Processing Units (GPUs) are a very suitable for accelerating molecular dynamic (MD) simulations. GPUs can give a quantum leap in performance for commonly used MD codes, making it possible for researchers to use more efficient and dense high performance computing architectures. NAMD is a very well know and commonly used MD simulator. It is a parallel molecular dynamics code designed for high-performance simulation of large bio-molecular systems; It is developed by the joint collaboration of the Theoretical and Computational Biophysics Group (TCB) and the Parallel Programming Laboratory (PPL) at the University of Illinois at Urbana-Champaign. NAMD is distributed free of charge with source code. NAMD has four benchmarks for varying problem size, table below gives the each benchmark and it problem size:
NAMD Benchmark Problem Size in Number of Atoms
ER-GRE 36K
APOA1 92K
F1ATPASE 327K
STMV 1066K
The performance of these benchmarks is measured in “day/ns”. On a given compute system, “days/ns” shows the number of compute days required to simulate 1 nanosecond of real-time. So, the lower the day/ns required for on a given architecture the better. The Dell HPC Engineering team has configured and evaluated GPU based solutions to help customers select solutions according to their specific needs. As shown in Figure 1, the current offering combines one or two PowerEdge C6100 servers as host server to the PowerEdge C410x, resulting in a 4 node or 8 node compute clusters. The GPU solution uses 16 NVIDA ™ Tesla M2070 GPUs and the CUDA 4.0 software stack. The NAMD code is run without any optimization; however to get good scaling during parallel runs, the following parameter values are changed (in accordance in the guidelines to run the benchmarks on parallel machines):
NAMD Parameter Value
Time Steps 500
Output Energies 100
Output Timing 100
Hardware and Software Configuration
Figure 2 shows the hardware configuration used, as shown in Figure 2, each compute node (PE C6100) is connected to the PE C410x using an iPASS Cable (red) and to the InfiniBand switch (blue) for internode communication. The details of the hardware and software components used for the 4 and 8 node NAMD configuration are given below:
Figure 2: The PCIe Gen2 x16 iPASS cables and InfiniBand connection diagram. 8 compute nodes are connected to the C410x using an iPASS cable
PowerEdge C410x GPGPUs Model NVIDIA Tesla M2070
Number of GPGPUs 16
iPASS Cables 8
Mapping 2:1, 4:1
1x (2x) PowerEdge C6100 4 (8) compute nodes Node Processor Two X5650 @ 2.66 GHz
Memory 48 GB 1333 MHz
OS RHEL 5.5, (2.6.18-194.e15)
CUDA 4.0
M2070 GPGPU Number of cores 448
Memory 6 GB
Memory bandwidth 150 GB/s
Peak Performance: Single Precision 1030 GFLOPS
Peak Performance: Double Precision 515 GFLOPS
Benchmark NAMD v2.8b1 www.ks.uiuc.edu/Research/namd
Sensitivity to Problem Size and Host Processor
Figure 3 shows the performance of the four NAMD benchmarks on an 8 node cluster. The comparison is between a CPU only cluster and the same cluster with 2 GPU/node, also the host processor is changed to find its impact on the overall performance. As shown in the figure 3, for small problems the performance is better on CPU only cluster, however for the two larger problems performance is better on the GPU attached cluster. There seems to be a threshold between 100K -300K atoms when the advantage shifts from CPUs only to a cluster with GPUs. For larger problems there is clearly an advantage in using GPUs, the largest STMV shows a 3.5X speedup compared to the CPU only cluster (with the X5670 processors).
Figure 3: Performance of NAMD benchmarks on the 8 node cluster. The performance is expressed in “day/ns” (lower is better)
Using the faster X5670 2.93GHz processor improves the performance in all cases. However the impact of using the faster processor is more pronounced with the CPU only clusters. On the GPU attached cluster, the most compute intensive tasks are transferred to GPUs, hence the impact of using X5670 results in very similar performance compared to X5650. If we consider the two largest problems, on average the faster processors give 6.7% more performance at the cost of 7.0% more power and a higher cost. Considering the largest problem only, the faster processors gain 0.08% performance at the cost of 9.6% more power and higher cost. Based on these facts we recommend using X5650 processors in compute nodes, because for larger problem sizes (1 million atoms or more) the difference between the slower and faster is minimal. Also in this study, from here onwards, we have focused only on the larger NAMD benchmarks.


Selecting the Cluster Size
Figure 4 compares the performance of the two large NAMD benchmarks on a 4 node and 8 node clusters, while keeping the number of GPUs fixed at 16. As shown in Figure 4, F1ATPASE, about 327K atoms, does only slightly better on 8 nodes, but STMV, which is about 1066K atoms, runs about 35% faster on 8 nodes compared to 4 nodes.
Figure 4: Comparing performance of two NAMD benchmarks on a 4 node and 8 node clusters. The performance is expressed in “day/ns” (lower is better)
Figure 5: Comparing power consumption of two NAMD benchmarks on a 4 node and 8 node clusters
Figure 5 compares the relevant power consumption of the two benchmarks on a 4 node and 8 node clusters. F1ATPASE consumes about 26% more power for about 2% gain in performance. STMV consumes about 32% more power for about 35% more performance. The choice between a 4 or 8 node cluster depends on problem size in number of atoms, for problem sizes of about 325K and below range the 4 node cluster might provide the best value , as it is less expensive and consumes less power while performing similar to the 8 node cluster. However for problem sizes of 1000K or larger the 8 node cluster may provide the best value.