Dell Community
High Performance Computing Blogs

## High Performance Computing

A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.

### HPC GPU Computing

Sort by: | |
• #### GPUs…, Nodes, Sockets, Cores and FLOPS, Oh, My!

In a previous post, I described how to compute the peak theoretical floating point performance of a potential system.

In that post, I alluded to GPUs coming into the mix: “When might you need MHz these days, you ask? Think GPU speeds.” Well, that time has come!  The nVidia GTC conference is soon (www.gputechconf.com) and systems are now regularly shipping with GPUs such as the nVidia K20 and K20x which operate at MHz frequencies.

There are several references available that indicate that the new nVidia K20 contains 2,496 cores.  And the operating frequency is also available.  Do not attempt to use these 2 pieces of data to compute a peak theoretical floating point performance number as described in the previous blog.

The K20 does indeed contain 2,496 cores, but not all are available for double precision floating point math.  These cores are arranged into what are called Streaming Multiprocessor (SM) units.  SM units in a GPGPU on an nVidia card are analogous to CPUs in sockets on a motherboard.  Each SM does indeed contain 192 cores, all of which are available for single precision floating point math.  But unlike most CPUs, all GPU cores are not available for double precision floating point math.  On the nVidia K20 SM, 64 cores can perform double precision floating point math at a rate of 2 flops/clock.

There are 13 SM units in the K20, operating at a 706 MHz frequency  Here is the use of MHz and the reference in the previous blog.  706 MHz is 0.706 GHz.   Note that 13 SMs * 192 cores per SM is the quoted 2,496 cores total.  Also note in the math below that the 64 double precision core count is used and not the 192 (single precision) core count quoted.

Here’s the peak theoretical floating point math for a K20:

GFLOPS  =  13 SM/K20 * 64 cores/SM  *  0.706 GHz/core  * 2 GFLOPs/GHz

GFLOPS  =  1,174.784

I have seen this appear as 1.17 TFLOPS or 1,175 GFLOPS.

Additionally, the nVidia K20x contains an additional SM unit for a total of 14 SM units and it operates at a slightly higher frequency of 732 MHz or 0.732 GHz.

Here’s the peak theoretical floating point math for a K20x:

GFLOPS  =  14 SM/K20 * 64 cores/SM  *  0.732 GHz/core  * 2 GFLOPs/GHz

GFLOPS  =  1,311.744

I have seen this appear as 1.31 TFLOPS or 1,312 GFLOPS.

Hope that helps.  Compute the CPU performance as described in the previous blog.  Compute the GPU performance as described here.  The total system performance is the sum of these.

Remember that this is the peak theoretical floating point performance.  Since it is theoretical, it is the performance you are guaranteed to never see!  But we also already have a few blogs posted about real-world performance using GPUs:

Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations: http://dell.to/JsWqWT

ANSYS Mechanical Simulations with the M2090 GPU on the Dell R720:  http://dell.to/JT79KF

Accelerating High Performance Linpack (HPL) with GPUs:  http://en.community.dell.com/techcenter/high-performance-computing/b/hpc_gpu_computing/archive/2012/08/07/accelerating-high-performance-linpack-hpl-with-gpus.aspx

If you have comments or can contribute additional information, please feel free to do so.  Thanks.  --Mark R. Fernandez, Ph.D.

@MarkFatDell

#Iwork4Dell

• #### Accelerating CFD using OpenFOAM with GPUs

Authors:  Saeed Iqbal and Kevin Tubbs

The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide range of engineering and science disciplines in both commercial and academic organizations. OpenFOAM has an extensive range of features to solve a wide range of fluid flows and physics phenomenon. OpenFOAM provides tools for all three stages of CFD, preprocessing, solvers, and post processing. Almost all are capable of being run in parallel as standard making it an important resource for a wide range of scientists and engineers using HPC for CFD.

General purpose Graphic Processor Units (GPUs) technology is increasingly being used to accelerate compute-intensive HPC applications across various disciplines in the HPC community.  OpenFOAM CFD simulations can take a significant amount of time and are computational intensive. Comparing various alternatives for enabling faster research and discovery using CFD is of key importance. SpeedIT libraries from Vratis provide GPU-accelerated iterative solvers that replace the iterative solvers in OpenFOAM.

In order to investigate the GPU-acceleration of OpenFOAM, we simulate the three dimensional lid driven cavity problem based on the tutorial provided with OpenFOAM. The 3D lid driven cavity problem is an incompressible flow problem solved using OpenFOAM icoFoam solver. The majority of the computational intensive portion of the solver is the pressure equation. In the case of acceleration, only the pressure calculation is offloaded to the GPUs. On the CPUs, the PCG solver with DIC preconditioner is used.  In the GPU-accelerated case, the SpeedIT 2.1 algebraic multigrid precoditioner with smoothed aggregation (AMG) in combination with the SpeedIT Plugin to OpenFOAM is used.

Figure 1: OpenFOAM performance of 3D cavity case using 4 million cells on a single node.

Figure 1 shows the performance of OpenFOAM’s the 3D lid driven cavity case using approximately 4 million cells on a single R720 node. The results are presented for CPU only, CPU + 1 M2090 GPU, and CPU + 2 M2090. The R720 CPU only results reflect the maximum number of cores available on this configuration (16 cores). The software limits the number of CPU cores used for GPU-acceleration mapping one CPU core to one GPU. The R720 + 1 M2090 and R720 + 2 M2090 results reflect the use of 1 core + 1 GPU and 2 cores + 2 GPU’s respectively. Compared to a CPU only configuration, no acceleration is obtained by using one GPU and an acceleration of 1.5X with two GPUs.  Figure 2 shows the power consumption results for the 4 million cell simulation.  In all cases, the power consumption is measured. As shown, the power efficiency, i.e. the useful work delivered for every watt of power consumed, improves by adding GPUs.  The power efficiency is defined as the performance (simulations/day) per measured power consumption (Watt). With one M2090, the power efficiency is approximately 1.3X and with two M2090 GPUs the power efficiency is almost 1.3X compared to the CPU only configuration.

Figure 2: Total Power and Power Efficiency of 3D cavity case on 4 million cells on a single node.

Figure 3 shows the performance of OpenFOAM’s 3D lid driven cavity case using approximately 8 million cells on a single R720 node. The size of the problem required the use of both GPUs. Compared to a CPU only configuration, an acceleration of 1.5X was achieved with two GPUs.  Figure 4 shows the power consumption results for the 8 million cell simulation.  As shown, the power efficiency also improves for the larger simulation.  With two M2090 GPUs the power efficiency is almost 1.3X compared to the CPU only configuration.

Figure 3: OpenFOAM performance 3D cavity case using 8 million cells on a single node.

Figure 4: Total Power and Power Efficiency of 3D cavity case on 8 million cells on a single node.

Configuration and Installation

Each PowerEdge R720 has a dual Intel Xeon E5-2600 series processor. Please note installing two NVIDIA Tesla M2090 GPUs requires the use of a GPU enablement kit, the x16 option on the 3rd riser, and dual, redundant 1100W power supplies, shown in Figure 5. The details of the hardware and software components are given below:

Figure 5: Two M2090 GPUs can be attached inside the R720 using a riser and associated power cables.

 Compute Node Model PowerEdge R720 Compute Node processor Two Intel @ 2.2 GHz, 95 W (Xeon ES-2660) Memory 64 GB 1333 MHz GPUs NVIDIA Tesla M2090 Number of GPUs 2 M2090 GPU Number of cores 512 Memory 6 GB Memory bandwidth 177 GB/s Peak Performance: Single Precision 1,331 GFLOPS Peak Performance: Double Precision 665 GLOPS Power Capping 250W Software OpenFOAM Version 1.7.1 SpeedIT from Vratis Version 2.1 CUDA 4.0(285.05.23) OS RHEL 6.2

#### Authors:  Saeed Iqbal and Shawn Gao

General purpose Graphic Processor Units (GPUs) have proven their acceleration capacity across several HPC application classes; in general, they are very suitable for accelerating compute-intensive applications E.g., Computational Fluid Dynamics (CFD), Molecular Dynamics (MD), Quantum Chemistry (QC), Computational Finance (CF) and Oil & Gas applications etc. However among the available areas, Molecular Dynamics (MD) has benefitted tremendously due to GPU acceleration.  This is in-part due to the nature of its core algorithms being suitable for the hybrid CPU-GPU computing model and equally important, freely available sophisticated GPU-enabled molecular dynamics simulators. NAMD is such a GPU-enabled simulator. For more detailed information about NAMD and GPUs, please visit http://www.ks.uiuc.edu/Research/namd/ and http://www.nvidia.com/TeslaApps.

In this blog we evaluate improved NAMD performance due to GPU accelerate compute nodes. Two proteins F1ATPASE and STMV, which consist of 327K and 1066K atoms respectively, are chosen due to their relatively large problem size. The performance measure is “days/ns”, that shows the number of days required to simulate 1 nanosecond of real-time.

Figure 1: Relative performance of two NAMD benchmarks on the 8 node R720 cluster. F1ATPASE is accelerated about 1.1X and STMV about 2.8X.

Figure 1, illustrates the relative performance of the two NAMD benchmarks on the 8 node R720 cluster, keeping the number of GPUs fixed at 16.  In both cases the benchmarks run faster due to GPUs, however the acceleration is very sensitive to problem size.  In the case of F1ATPASE we see a modest 1.1X acceleration and for STMV we observe 2.8X acceleration.  As expected the acceleration improves with problem size.  There seems to be a minimum threshold of 300K atoms to make GPUs feasible, as shown with the F1ATPASE model.  Figure 2 shows the additional power required for GPUs; there is a 1.6X increase in total power consumption. From the power efficiency point of view running STMV with dual internal GPUs is beneficial as the performance gain is 2.8X for an additional 1.6X power.

In summary, GPUs can accelerate NAMD simulations.  Problem size is a key factor in determining how much a particular simulation gets accelerated.  In a previous study (http://en.community.dell.com/techcenter/high-performance-computing/w/wiki/namd-performance-on-pe-c6100-and-c410x.aspx) we found a similar sensitivity t problem/simulation size (number of atoms).  STMV, the largest simulation we have, is about 1 million atoms and accelerates much better than smaller simulations, it is expected that even larger simulations can be accelerated even more.

Figure 2: Relative Power Consumption of two NAMD benchmarks on the 8 node R720 cluster.  In both cases there is about 1.6X increase in power consumption due to GPUs.

# Cluster Configuration

The cluster consists of one master node and eight PowerEdge R720 compute nodes, as shown in Figure 3. The compute nodes can be configured with one or two of the internal Tesla M2090 GPUs, each node in our cluster has two M2090 GPUs for acceleration. The details of the hardware, software and NAMD parameter setup are given below:

Figure 3: The 8 node Power Edge R720 Cluster. Each compute node has 2 internal GPUs; total of 16 Tesla M2090 GPUs.

#### Authors:  Saeed Iqbal and Shawn Gao

High Performance Linpack (HPL) is a commonly used reference benchmark for HPC systems. HPL stresses the compute and memory subsystems of the test systems and provides insights into the performance of these systems. Nowadays, General purpose Graphic Processor Units (GPUs) are widely used to accelerate such compute-intensive HPC applications across various disciplines in the HPC community.  Several research centers around the world are investigating GPUs for accelerating compute-intensive applications enabling faster research and discovery.  To compare various alternatives the HPL performance is of key importance.

GPUs are attached inside the servers to provide the extra compute horsepower required for application acceleration. Dell now offers a full-featured GPU solution based on the PowerEdge R720 servers (shown in Figure 1). Two of the latest Tesla M2090 GPUs can be added to each PowerEdge R720 server.  In this blog, we will present the performance and power results of a GPU-accelerated HPL on an 8-node PowerEdge R720 Cluster.

Figure 1: HPL performance and efficiency on an eight node cluster. Results are presented for different number of GPUs per node.

Figure 1 shows the performance of HPL on an eight node R720 cluster with different number of GPUs per node.  Compared to a CPU only configuration, an acceleration of 2X is obtained by using one GPU per node and an acceleration of 3.5X with two GPUs per node.  Figure 4 shows the power consumption results.  As shown, the power efficiency, i.e. the useful work delivered for every watt of power consumed, improves by adding GPUs.  With two M2090 GPUs the power efficiency is almost 1.5X compared to the CPU only configuration.

Figure 2: Total Power and Power Efficiency of the eight node cluster.

In conclusion, first, using GPUs can substantially accelerate HPL. As shown in Figure 1, using CPUs only, each compute node delivers about 250 GFLOPs of sustained performance, by adding GPUs the sustained performance improves to about 875 GFLOPS per node.  Second, using GPUs improves the performance/watt ratio as well. The power consumption due to GPUs increases but not as much as the corresponding performance improvement.  As show in Figure 2, a CPU only cluster consumes about 3000 Watts and operates at 0.72GFLOPS/Watt, adding GPUs the power consumption increases to about 6600 Watts but now operates at 1.07GFLOPS/Watt, which represents an increase of about 48% in performance/Watt.

# Configuration and Installation

Each PowerEdge R720 has a dual Intel Xeon E5-2600 series processor. Please note installing two NVIDIA Tesla M2090 GPUs requires the use of a GPU enablement kit, the x16 option on the 3rd riser, and dual, redundant 1100W power supplies, shown in Figure 3. The details of the hardware and software components are given below:

Figure 3: Two M2090 GPUs can be attached inside the R720 using a riser and associated power cables.

#### Authors:  Saeed Iqbal, Shawn Gao

Computational Structural Mechanics is commonly used by scientists and engineers to reduce the product development cycle time across various industries ranging from aerospace to structural biology.  One of the most successful techniques that lends itself to computational methods in structural analysis is the finite element method.  The finite element method is used to solve the resulting partial differential equations, inevitably making it a compute and memory intensive task.

ANSYS Mechanical is a well-known and widely used software package for computational structural mechanics. It can perform comprehensive static and dynamic analysis on structures.  It uses the finite element method to model the associated structure or process and offers various built-in solvers to solve the resulting linear system. In addition, it has a library of material models making it easy-to-use and perform coupled-physics simulations.

Typical, available processing power limits the size and number of ANSYS Mechanical simulations. Traditionally, parallel processing is used to reduce the simulation runtimes. Recently, the popularity of using Graphics Processors Units (GPU) to accelerate the simulations has generated interest in the ANSYS community because GPUs coupled with parallel processing can further reduce the simulation runtime significantly. ANSYS Mechanical version 13 has had support for GPU acceleration.  In this study we evaluate the acceleration with a single M2090 GPU on seven standard ANSYS Mechanical benchmarks.

Table 1, lists the benchmarks along with their problem sizes and the solver they use.

Table 1: Benchmarks

 ANSYS Mechanical Benchmark Problem Size in Degree of Freedom (DOFs), Solver CG - 1 1100K, JCG solver SP – 1 400K, Sparse solver SP – 2 1000K, Sparse solver SP – 3 2300K, Sparse solver SP – 4 1000K, Sparse solver SP – 5 2100K, Sparse solver SP – 6 4900K, Sparse solver

## Configuration

The Dell PowerEdge R720 is used for running the ANSYS benchmarks. The R720 is a feature rich dual-socket 2U server that can be configured with two internal GPUs as well as act as a host for external GPUs.  We have used the R720 in both internal and external configurations. For external GPUs we have used the Dell PowerEdge C410X. The C410X provides a unique, flexible and powerful 3U PCIe Expansion Chassis for housing up to 16 external GPUs.  The PE C410X can connect up to eight hosts simultaneously and share the GPUs among them by mapping 2, 4 or 8 GPUs per host. Table 2 shows the software and hardware configuration was used for this study.

Table 2: Hardware and Software Configuration

 PowerEdge R720 Processor Two Intel E-5 2660  2.2 GHz, 95W Memory 128GB @ 1333 MHz OS RHEL 6.2 CUDA 4.0 GPU Model NVIDIA Tesla M2090 GPU cores 512 GPU Memory 6 GB GPU Memory bandwidth 177 GB/s Theoretical Peak Performance: Single Precision 1331 GFLOPS Theoretical Peak Performance: Double Precision 665 GFLOPS Power Capping 225W Benchmark Suite ANSYS Mechanical Version 14 External GPU Chassis Power Edge C410X 3U, sixteen GPUs

ANSYS has several license models which limit the number of CPU cores usable for ANSYS runs. The two core license is common; we use a two core license for this study.

## Conclusion and Results

We measure the acceleration due to a single M2090 GPU of the benchmarks listed in Table 1.  The results are shown in Figure 1. The total runtime, including I/O, is selected as a performance metric in each case.  A lower time to run is better. From the graph, it is observed that the mean (geometric) acceleration by using a single GPU, across the seven benchmarks, is 79.1% for internal GPU configuration and 77.9% for the external GPU configuration. The slight difference is assumed to be due to the improved CPU to GPU bandwidth in the internal GPU configuration.

Figure 1: Runtimes of the seven ANSYS benchmarks.

• #### Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

Authors:  Saeed Iqbal, Shawn Gao, Ty Mckercher

Seismic processing algorithms are an important part of Oil and Gas workflows due to ever increasing demand for high resolution subsurface images that help delineate hydrocarbon reserves. In general, seismic processing algorithms are very compute intensive in nature.   Typically, these algorithms require clusters of powerful servers to process vast amounts of data and produce results in a timely fashion.  GPUs look very promising for further accelerating such Oil and Gas workflows by reducing the time required to nominate wells for a given prospect.  One or more GPUs can be attached to the compute nodes depending on the compute node architecture and ability of application to utilize them.

One of the key factors in realizing the acceleration is the manner in which communication between GPUs takes place; it is desirable to have high bandwidth between GPUs.  GPUDirect 2.0 is a very useful feature in NVIDIA’s CUDA 4.1 toolkit, which supports peer-to-peer communication among GPUs effectively improving the GPU to GPU bandwidth (by reducing or eliminating extra host memory copies).

Dell PowerEdge C410X provides a unique, flexible and powerful solution to take advantage of the GPUDirect 2.0 technology. First it enables eight GPUs to be connected to a single compute node, and second its architecture is ideally suited for enabling GPUDirect among them. The Dell PowerEdge C6145 is a powerful compute node used to attach to the PowerEdge C410X.  It offers a choice of using multiple HIC cards to connect to the C410X (see Figure 1). Depending on the bandwidth requirements the users can select the number of HIC cards, for most current application a single or dual HIC cards per compute node is sufficient.

Figure 1: The C6145 connected to C410X using two HIC cards in the compute node. Up to eight GPUs can be connected to a single HIC.

The configuration of the C6145 and GPU used in the study is shown in Table 1.

Table 1: Hardware and Software Configuration

 PowerEdge C6145 Processor 4 AMD Opteron 6282 SE @ 2.6GHz Memory 128GB 1333 MHz OS RHEL 6.1 CUDA 4.1 GPU Model NVIDIA Tesla M2090 GPU cores 512 GPU Memory 6 GB GPU Memory bandwidth 177 GB/s Peak Performance: Single Precision 1331 GFLOPS Peak Performance: Double Precision 665 GFLOPS Power Capping 225W External GPU Chassis Power Edge C410X 3U, sixteen GPUs

## GPU Communication Patterns

GPUs are assumed to be connected in a chain like fashion.  Most seismic shot records are quite large, and require frequent communication between GPUs since a record may not fit in a single GPU memory space.  The shot record is streamed across multiple GPUs using boundary calculations on halo regions during processing.  The goal of the communication is that each GPU exchanges data with its neighbors (GPUs on the start and end of the chain have only one neighbor) so that computation can be occurring simultaneously while halo data is transferred between GPUs.   There can be several different communication patterns to achieve this data exchange.  We evaluate the performance of two such common communication patterns in this study.

The first approach (Pattern A) uses communication patterns composed of two phases.  During the first phase all GPUs simultaneously send a message to their neighbor on the right, and during the second phase all GPUs send messages to their neighbor on the left.  This pattern is show in Figure 2 below. This pattern is also known as Left/Right Exchange.

Figure 2: Communication Pattern A

The second approach (Pattern B) also occurs in two phases as shown in Figure 3 below, and requires an even number of GPUs.  In the first phase, pairs of GPUs are selected as shown; all GPUs receive and send messages among the pairs.  In the second phase GPUs are paired differently, as shown in figure 3, to send and receive messages among the pairs again. This pattern is also called a Pairwise Exchange.

Figure 3: Communication Pattern B

It is interesting to note that at the completion of Pattern A and Pattern B, the final result as far as data exchange is concern is the same. The difference is in the timing of particular message exchanges among GPUs.

## Results

The performance of above communication patterns with different number of GPUs is shown in Figure 4.  The bandwidth is also compared on with a single IOH (one HIC card) and dual IOH (two HIC cards) connected on the C6145.

1. The communication pattern B (pairwise exchange) results in better bandwidth among GPUs compared to pattern A (left/right exchange) in the vast majority of cases.  The advantage of pattern B is more pronounced when dual IOH configuration is used.

1. The single IOH configuration results in higher bandwidth among GPUs by about 50~70%. In the single HIC configuration, more communication occurs within the C410X and less data needs to move between C410X and C6145 (host).

The choice between a single HIC and dual HIC configuration depends on the associated applications and how much time is spent on compute compared to communication. GPUDirect can automatically find the best performance paths in both configurations at runtime.

Figure 4: The Bandwidth for Communication Pattern A and B for difference number of GPUs in the chain and number of IOHs.

NVIDIA GPUDirect Technology

• #### Dell Mainstream Servers Get A Boost with NVIDIA GPUs

Guest blog post by Sumit Gupta, Senior Director, Tesla GPU Computing, NVIDIA

Every server manufacturer announced support last week for the new Intel Sandy Bridge CPUs in their new models. That includes PC giant Dell, which announced, for the first time, that it is supporting Tesla GPUs in its mainstream Dell PowerEdge R720 servers. And, our new benchmarks demonstrate why.

The PowerEdge R720 is, by far, one of the most popular servers in the Dell server portfolio, one of the highest volume servers in the world, and often a top choice for IT organizations. The plethora of enterprise-ready peripheral options and highly flexible configurations make the server an easy purchase decision.

By including Tesla GPUs in the top-selling Dell server, GPU computing is now truly available to the mass market.  And, the mass market can now take advantage of GPU acceleration for a broad range of applications.

We benchmarked the new Dell PowerEdge R720s with two Tesla M2090s GPUs using the popular computational bio-chemistry applications NAMD, AMBER, and LAMMPS.  Results are below.

In all benchmarks, the Dell systems that included two GPUs were considerably faster than CPU-only configurations – anywhere from three to six times faster.

So the most popular Dell servers just got better…and much, much faster.

The Dell R720 configuration we benchmarked is:

Dual socket Intel Xeon® E5-2650L 1.80GHz (Sandy Bridge), 16 cores total
64 GB, 1066 Mhz DDR3 (32 GB per CPU)
Two Tesla M2090 GPUs
Redhat Enterprise Linux (RHEL) 6.2
NVIDIA driver version 295.20
Detailed data below:

 Sumit Gupta joined NVIDIA in 2007 and serves as a senior manager in the Tesla GPU Computing HPC business unit, responsible for marketing and business development of CUDA-based GPU computing products. Previously, Sumit has served in a variety of positions, including product manager at Tensilica, entrepreneur-in-residence at Tallwood Venture Capital, and chip designer at S3 Graphics. He also served as a post-doctorate researcher at University of California, San Diego and Irvine, and as a software engineer at IBM and IMEC, in Belgium. Sumit has a Ph.D. in Computer Science from the University of California, Irvine, and a B.Tech. in Electrical Engineering from the Indian Institute of Technology, Delhi. He has authored one book, one patent, several book chapters and more than 20 peer-reviewed conference and journal publications.

• #### GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

Authored by: Saeed Iqbal, and Shawn Gao

NVIDIA has supported GPUDirect v2.0 technology since CUDA 4.0. GPUDirect enables peer-to-peer communication among GPUs. Peer-to-peer communication directives in CUDA allow GPUs to exchange messages directly with each other. Effective communication bandwidth attained in peer-to-peer mode depends on how GPUs are connected to the system. Given an application with a certain communication requirement and the available bandwidth developers can decide on how many GPUs to use and also which GPUs are most suitable for their particular case. Let us look at each of the cases below.

The PowerEdge C410x is a 3U enclosure which can hold 16 GPUs for example the M2090. Up to eight host servers, such as the PE C6145 or PE C6100 can connect to the C410x via a Host Interface Card (HIC) in hosts and an iPASS cable. The C410X has two layers of switches to connect iPASS cables to GPUs. The connected hosts (ports) are mapped to the 16 GPUs via an easy to use web interface.  It is important to mention the relative ease with which the GPUs attached to a server can be changed using the web interface and without requiring any alterations in cable connections. Currently, the available GPU to host ratios are 2:1, 4:1 and 8:1.  So, a single HIC can access up to 8 GPUs.

The figures below show the process of communication between GPUs conceptually. The C410X has 16 GPUs, each GPU is attached to one of the eight ports on the C410x through two layers of switches (these layers of switches are shown as single switch to simplify the diagram). The black lines represent the connection between IOHs in the hosts to switches in the C410X; the red lines show the two communicating GPUs.

Figure 1:  GPU to GPU communication via the host memory. The total Bandwidth attained is this case is about 3 GB/s.

Case 1: Figure 1 shows a scenario when the GPUs are connected to the system via separate IOHs. In this case peer to peer transfer is not supported and the messages have to go through the host memory.  The entire operation needs a Device-to-host   transfer followed by a host-to-device transfer.  The message is also stored in host memory during the transmission.

Figure 2: GPU to GPU communication via a shared IOH on the host compute node. The total Bandwidth attained is about 5 GB/s.

Case 2: Now consider two GPUs that share an IOH).  As shown in Figure 2 the GPUs are connected to the same IOH but on independent PCIe x16 links.  GPU Direct is beneficial here because the message avoids the copy to the host memory, instead it is routed directly through the IOH, to  the receiving GPUs memory.

Figure 3: GPU to GPU communication via a Switch on the C410X. The total bandwidth attained is about 6 GB/s

Case 3: Figure 3 shows peer-to-peer (P2P) communication among GPUs interconnected via a PCIe switch such as the case on a C410X. This is shortest path between two GPUs.  GPU Direct is very beneficial in this case because it does not need to be routed through IOHs and host memory.

Figure 4: The measured bandwidths for the three cases show the advantage of P2P communication.

Results are shown in figure 4, as the GPUs move “closer” to each other GPU Direct allows a faster mode of Peer-to-peer communications between them. Peer-to-Peer communication via an IOH improves the bandwidth by about 53%, and via a switch it improves by about 93%.  In addition the cudaMemcopy() function call can automatically select the best Peer-to-Peer method available between a given pair of GPUs. This feature allows the developer to use the cudaMemcopy() directive independent of the underlying system architecture.

As shown above the PowerEdge C410X is ideally suited to utilize the GPU Direct Technology. The PowerEdge C410x is unique for the compute power, density and flexibility it offers for designing GPU accelerated compute systems.

• #### HPC GPU Flexibility: The Evolution of GPU Applications and Systems Part 4

 Dell's Dr. Jeff Layton

Part 4 – GPU Configurations Using Dell Systems
GPUs are arguably one of the hottest trends in HPC. They can greatly improve performance while reducing the power consumption of systems. However, because the area of GPU Computing is still evolving both for application development and for tool set creation, GPU systems need to be as flexible as possible.

In the last blog I talked about the basic building blocks Dell has created for GPU systems. In this article I want to talk about possible GPU configurations and how they work well for the rapidly evolving GPU ecosystem.

Dell GPU Configurations
In the last article, I talked about the various Dell hardware components that can be used to create GPU computing solutions. These components were designed around a solution strategy that can be summed into a single word - flexibility. Solutions need to be created so that they can be easily adapted to evolving applications and to evolving GPU tools.

The PowerEdge C410x that I mentioned in the last blog is an external chassis that holds GPUs or really any PCIe device that meets the specific requirements of the chassis. To refresh your memory, it can accommodate up to 16 GPUs in PCIe G2 x16 slots in a 3U chassis. It also has 8 PCIe G2 x16 connections that connect the chassis to the host nodes. The current version of the chassis allows you to reduce the number of external connections to 4 so that each x16 connection into the chassis connects to 4 GPUs. It can also reduce the number of connections to 2 so that each x16 connection connects to 8 GPUs.

One approach that Dell has taken in the creation of GPU solutions is to develop solution “bricks” or solution “sandwiches” that combine the various components and form GPU compute solutions. For example, you could begin with the Dell PowerEdge C6100 that has four two-socket system boards in a 2U package, and connect two of these to a single Dell PowerEdge C410x. An example of this is shown below in Figure 1.

Figure 1 – Two Dell PowerEdge C6100 (top and bottom) and one Dell PowerEdge C410x (center)

In the middle of this group (or sandwich) is the C410x and you can see the 10 front sleds for the GPUs in Figure 1. Then above and below the C410x are the C6100 units (they each have 24 2.5” drives in this configuration but 3.5” drive configurations are also available). In total there are eight dual-socket Intel Westmere based systems, each with a PCIe x8 QDR Infiniband card and a PCIe x16 HIC card to connect to the C410x which has up to 16 GPUs.

Quick diversion – probably the best way to talk about GPU configuration is to use the nomenclature of,

(Number of GPUs) : (Number of x16 slots)

This is a better nomenclature than the number of GPUs per node or number of GPUs per socket since it describes how many GPUs are sharing a PCIe G2 x16 slot. This can be important because for applications that are early in their development cycle and are heavily dependent upon the data transfer efficiency from the CPU to the GPU, typically need lots of host-GPU bandwidth.

Using the configuration in Figure 1, we can create a 1:1 configuration (1 GPU per x16 HIC). This means we have a total of 8 GPUs in the C410x since we have 8 system boards in the two C6100s. This also gives each GPU the full x16 bandwidth if that is needed.

You could start with this configuration and add 8 GPUs at a later data when your application(s) need it. If you do this, you create a 2:1 configuration (2 GPUs per x16 slot). In the best case scenario, each GPU gets full access to the entire PCIe x16 bandwidth. In the worst case, both GPUs are communicating with the host over the PCIe bus completely saturating it resulting in each GPU only effectively getting PCIe x8 bandwidth. However, in our experience, except for a small number of applications, this rarely happens.

Recall that the C410x only has a maximum of 8 incoming HIC connections but that you don’t have to use all eight of them. So we can remove one of the C6100 units in Figure 1 and get Figure 2 so we have one C6100 that has four system boards to a single C410x.

Figure 2: Single C6100 (bottom) connected to a single C410x (top)

If Figure 1 is sometimes called a “sandwich” you can think of Figure 2 as an “open-face sandwich”. With this configuration we can increase the number of GPUs assigned to each system board.

Using Figure 2, you could create a 3:1 configuration (3 GPUs to a single x16 HIC) by only populating 12 of the 16 slots in the C410x. This configuration also gives you some flexibility to add four more GPUs if you need them at a later date. This configuration can also be extended to 4:1 (4 GPUs per single x16 HIC) by using all 16 slots in the C410x where each system will have 4 GPUs in addition to the two Intel Westmere processors, a QDR IB card, and up to 6 2.5” drives per system board.

With the updated C410x system you can go to 8:1 (8 GPUs per x16 HIC) but this means that you would need two C410x systems to connect a single C6100 as shown below in Figure 3.

Figure 3: Single C6100 (middle) connected to two C410x units (top and bottom)

In keeping with the sandwich theme, you can call this a “reverse-sandwich” where the bread has become the filling and the filling the bread (although I don’t think I want to make a reverse peanut butter sandwich). In Figure 3, each x16 HIC connection connects to 8 GPUs. While this may sound extreme there are applications that can scale with the number of GPUs even with a single HIC. Nvidia’s new version of CUDA, Version 4.0, allows direct GPU to GPU communication so that the host-to-GPU bus isn’t such a bottleneck in overall application performance (look for upcoming blogs from Dell that discuss this).

One other comment I wanted to make about the configuration in Figure 3 – this is the largest number of GPUs you can currently have (8) on a single system. This is a limitation in the current CUDA driver but I’m betting that someday Nvidia will increase this to a larger number.

We can extend this sandwich idea of host node and GPUs to the new Dell PowerEdge C6145. If you remember from my previous article this is a 2U chassis with two four-socket AMD 61xx series processors and four PCIeG2 x16 slots (electrical and mechanical) as well as a PCIe G2 x8 slot for InfiniBand. However, we’re still limited to a total of 8 GPUs for a single C6145 board so we can connect either two GPUs to a single HIC or four GPUs to two HICs in the configuration in Figure 3. This demonstrates the flexibility of these configurations.

Contrast this approach with that of using internal GPUs. If they GPUs are internal you are stuck with the number of GPUs that come with it and that’s it. At best you could start with one GPU and add a second one at some point in the future. But to do that you have to open up the case, pop the second one in, hope you don’t cause any problems, and reboot. With the approach presented here, you just add more GPUs via a sled, assign them to a node, and reboot the node. You don’t have to open up a case. If you have experienced the joys of popping open a case then closing only to find it won’t boot, you know what I’m talking about.

Summary
GPU applications and tools are quickly evolving. During this evolutionary process, the hardware requirements that allow the application to run most efficiently will also change. To be the most cost effective your GPU configurations need to be flexible. This primary means being able to vary the number of GPUs per node and vary how they are connected. This translates into separating the GPUs from the CPUs.

Dell has developed a unique external PCIe chassis that allows you to put GPUs in a separate chassis that is optimized for the high power and cooling requirements of GPUs. This chassis, called the C410x, can be connected to host nodes via HIC cards and PCIe cables according to your requirements. This means that you can upgrade/change the host nodes or the GPUs independent of one another.

In this article I’ve tried to illustrate some GPU configurations that use Dell host nodes and the C410x. This combination allows a great deal of flexibility which is precisely what you need for the rapidly evolving world of GPU computing.

In the next article I will present some benchmarks of common GPU applications.

-- Dr. Jeffrey Layton

• #### NAMD Performance on PE C6100 and C410X

Executive Summay

• Given a single fully populated C410X with 16 M2070 GPUs, the recommended solution for NAMD is an 8 node configuration shown below. On STMV, a standard large NAMD benchmark, it shows a 3.5X performance compared to an equivalent CPU only cluster.
• It is recommended to use X5650 processors on compute nodes for NAMD

Figure 1: The Dell GPGPU Solution based on a C410X and two PEC6100s
Introduction
General Purpose Graphics Processing Units (GPUs) are a very suitable for accelerating molecular dynamic (MD) simulations. GPUs can give a quantum leap in performance for commonly used MD codes, making it possible for researchers to use more efficient and dense high performance computing architectures. NAMD is a very well know and commonly used MD simulator. It is a parallel molecular dynamics code designed for high-performance simulation of large bio-molecular systems; It is developed by the joint collaboration of the Theoretical and Computational Biophysics Group (TCB) and the Parallel Programming Laboratory (PPL) at the University of Illinois at Urbana-Champaign. NAMD is distributed free of charge with source code. NAMD has four benchmarks for varying problem size, table below gives the each benchmark and it problem size:
 NAMD Benchmark Problem Size in Number of Atoms ER-GRE 36K APOA1 92K F1ATPASE 327K STMV 1066K
The performance of these benchmarks is measured in “day/ns”. On a given compute system, “days/ns” shows the number of compute days required to simulate 1 nanosecond of real-time. So, the lower the day/ns required for on a given architecture the better. The Dell HPC Engineering team has configured and evaluated GPU based solutions to help customers select solutions according to their specific needs. As shown in Figure 1, the current offering combines one or two PowerEdge C6100 servers as host server to the PowerEdge C410x, resulting in a 4 node or 8 node compute clusters. The GPU solution uses 16 NVIDA ™ Tesla M2070 GPUs and the CUDA 4.0 software stack. The NAMD code is run without any optimization; however to get good scaling during parallel runs, the following parameter values are changed (in accordance in the guidelines to run the benchmarks on parallel machines):
 NAMD Parameter Value Time Steps 500 Output Energies 100 Output Timing 100
Hardware and Software Configuration
Figure 2 shows the hardware configuration used, as shown in Figure 2, each compute node (PE C6100) is connected to the PE C410x using an iPASS Cable (red) and to the InfiniBand switch (blue) for internode communication. The details of the hardware and software components used for the 4 and 8 node NAMD configuration are given below:
Figure 2: The PCIe Gen2 x16 iPASS cables and InfiniBand connection diagram. 8 compute nodes are connected to the C410x using an iPASS cable
 PowerEdge C410x GPGPUs Model NVIDIA Tesla M2070 Number of GPGPUs 16 iPASS Cables 8 Mapping 2:1, 4:1 1x (2x) PowerEdge C6100 4 (8) compute nodes Node Processor Two X5650 @ 2.66 GHz Memory 48 GB 1333 MHz OS RHEL 5.5, (2.6.18-194.e15) CUDA 4.0 M2070 GPGPU Number of cores 448 Memory 6 GB Memory bandwidth 150 GB/s Peak Performance: Single Precision 1030 GFLOPS Peak Performance: Double Precision 515 GFLOPS Benchmark NAMD v2.8b1 www.ks.uiuc.edu/Research/namd
Sensitivity to Problem Size and Host Processor
Figure 3 shows the performance of the four NAMD benchmarks on an 8 node cluster. The comparison is between a CPU only cluster and the same cluster with 2 GPU/node, also the host processor is changed to find its impact on the overall performance. As shown in the figure 3, for small problems the performance is better on CPU only cluster, however for the two larger problems performance is better on the GPU attached cluster. There seems to be a threshold between 100K -300K atoms when the advantage shifts from CPUs only to a cluster with GPUs. For larger problems there is clearly an advantage in using GPUs, the largest STMV shows a 3.5X speedup compared to the CPU only cluster (with the X5670 processors).
Figure 3: Performance of NAMD benchmarks on the 8 node cluster. The performance is expressed in “day/ns” (lower is better)
Using the faster X5670 2.93GHz processor improves the performance in all cases. However the impact of using the faster processor is more pronounced with the CPU only clusters. On the GPU attached cluster, the most compute intensive tasks are transferred to GPUs, hence the impact of using X5670 results in very similar performance compared to X5650. If we consider the two largest problems, on average the faster processors give 6.7% more performance at the cost of 7.0% more power and a higher cost. Considering the largest problem only, the faster processors gain 0.08% performance at the cost of 9.6% more power and higher cost. Based on these facts we recommend using X5650 processors in compute nodes, because for larger problem sizes (1 million atoms or more) the difference between the slower and faster is minimal. Also in this study, from here onwards, we have focused only on the larger NAMD benchmarks.

Selecting the Cluster Size
Figure 4 compares the performance of the two large NAMD benchmarks on a 4 node and 8 node clusters, while keeping the number of GPUs fixed at 16. As shown in Figure 4, F1ATPASE, about 327K atoms, does only slightly better on 8 nodes, but STMV, which is about 1066K atoms, runs about 35% faster on 8 nodes compared to 4 nodes.
Figure 4: Comparing performance of two NAMD benchmarks on a 4 node and 8 node clusters. The performance is expressed in “day/ns” (lower is better)
Figure 5: Comparing power consumption of two NAMD benchmarks on a 4 node and 8 node clusters
Figure 5 compares the relevant power consumption of the two benchmarks on a 4 node and 8 node clusters. F1ATPASE consumes about 26% more power for about 2% gain in performance. STMV consumes about 32% more power for about 35% more performance. The choice between a 4 or 8 node cluster depends on problem size in number of atoms, for problem sizes of about 325K and below range the 4 node cluster might provide the best value , as it is less expensive and consumes less power while performing similar to the 8 node cluster. However for problem sizes of 1000K or larger the 8 node cluster may provide the best value.