High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • Accelerating CFD using OpenFOAM with GPUs

    Authors:  Saeed Iqbal and Kevin Tubbs

    The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide range of engineering and science disciplines in both commercial and academic organizations. OpenFOAM has an extensive range of features to solve a wide range of fluid flows and physics phenomenon. OpenFOAM provides tools for all three stages of CFD, preprocessing, solvers, and post processing. Almost all are capable of being run in parallel as standard making it an important resource for a wide range of scientists and engineers using HPC for CFD.

    General purpose Graphic Processor Units (GPUs) technology is increasingly being used to accelerate compute-intensive HPC applications across various disciplines in the HPC community.  OpenFOAM CFD simulations can take a significant amount of time and are computational intensive. Comparing various alternatives for enabling faster research and discovery using CFD is of key importance. SpeedIT libraries from Vratis provide GPU-accelerated iterative solvers that replace the iterative solvers in OpenFOAM.

    In order to investigate the GPU-acceleration of OpenFOAM, we simulate the three dimensional lid driven cavity problem based on the tutorial provided with OpenFOAM. The 3D lid driven cavity problem is an incompressible flow problem solved using OpenFOAM icoFoam solver. The majority of the computational intensive portion of the solver is the pressure equation. In the case of acceleration, only the pressure calculation is offloaded to the GPUs. On the CPUs, the PCG solver with DIC preconditioner is used.  In the GPU-accelerated case, the SpeedIT 2.1 algebraic multigrid precoditioner with smoothed aggregation (AMG) in combination with the SpeedIT Plugin to OpenFOAM is used.

    Figure 1: OpenFOAM performance of 3D cavity case using 4 million cells on a single node.

    Figure 1 shows the performance of OpenFOAM’s the 3D lid driven cavity case using approximately 4 million cells on a single R720 node. The results are presented for CPU only, CPU + 1 M2090 GPU, and CPU + 2 M2090. The R720 CPU only results reflect the maximum number of cores available on this configuration (16 cores). The software limits the number of CPU cores used for GPU-acceleration mapping one CPU core to one GPU. The R720 + 1 M2090 and R720 + 2 M2090 results reflect the use of 1 core + 1 GPU and 2 cores + 2 GPU’s respectively. Compared to a CPU only configuration, no acceleration is obtained by using one GPU and an acceleration of 1.5X with two GPUs.  Figure 2 shows the power consumption results for the 4 million cell simulation.  In all cases, the power consumption is measured. As shown, the power efficiency, i.e. the useful work delivered for every watt of power consumed, improves by adding GPUs.  The power efficiency is defined as the performance (simulations/day) per measured power consumption (Watt). With one M2090, the power efficiency is approximately 1.3X and with two M2090 GPUs the power efficiency is almost 1.3X compared to the CPU only configuration.

    Figure 2: Total Power and Power Efficiency of 3D cavity case on 4 million cells on a single node.

    Figure 3 shows the performance of OpenFOAM’s 3D lid driven cavity case using approximately 8 million cells on a single R720 node. The size of the problem required the use of both GPUs. Compared to a CPU only configuration, an acceleration of 1.5X was achieved with two GPUs.  Figure 4 shows the power consumption results for the 8 million cell simulation.  As shown, the power efficiency also improves for the larger simulation.  With two M2090 GPUs the power efficiency is almost 1.3X compared to the CPU only configuration.

      Figure 3: OpenFOAM performance 3D cavity case using 8 million cells on a single node.

    Figure 4: Total Power and Power Efficiency of 3D cavity case on 8 million cells on a single node.

     

    In conclusion, first, using GPUs can accelerate the OpenFOAM icoFoam solver for incompressible fluid flow. As shown in Figure 2, using CPUs only, a single node delivers about 24 simulations/day of sustained performance for a problem size of 4 million cells. Adding 1 GPU delivers about the same sustained performance but increases the performance/watt ratio, while adding 2 GPUs the sustained performance improves to about 36 simulations/day.  Second, using GPUs improves the performance/watt ratio as well. The power consumption due to GPUs increases but not as much as the corresponding performance improvement.  As shown in Figure 3, the CPU only simulation consumes about 400 Watts and operates at 0.061 (simulations/day)/Watt. Adding 1 GPU but using only one core of the CPU, the power consumption decreases to about 300 Watts and operates at 0.078 (simulations/day)/Watt, which represents an increase of about 28% in performance/Watt.  Adding 2 GPUs and using only two cores of the CPU, the power consumption increases to about 445 Watts and operates at 0.0083 (simulations/day)/Watt, which represents an increase of about 36% in performance/Watt. Similar trends are shown in figures 4 and 5 for the problem size of 8 million cells. On the larger problem size, the performance increased from about 15 simulations/day for CPU only to about 24 simulations/day for 2 GPUs.  The power consumption increased from about 391 Watts operating at 0.039 (simulations/day)/Watt for CPU only to about 462 Watts operating at 0.051 (simulations/day)/Watt for 2 GPUs. This represents an increase of about 32% in performance/Watt.

    Configuration and Installation

    Each PowerEdge R720 has a dual Intel Xeon E5-2600 series processor. Please note installing two NVIDIA Tesla M2090 GPUs requires the use of a GPU enablement kit, the x16 option on the 3rd riser, and dual, redundant 1100W power supplies, shown in Figure 5. The details of the hardware and software components are given below:

    Figure 5: Two M2090 GPUs can be attached inside the R720 using a riser and associated power cables.

    Compute Node Model PowerEdge R720
    Compute Node processor Two Intel @ 2.2 GHz, 95 W (Xeon ES-2660)
    Memory 64 GB 1333 MHz
    GPUs NVIDIA Tesla M2090
    Number of GPUs 2
    M2090 GPU Number of cores 512
    Memory 6 GB
    Memory bandwidth 177 GB/s
    Peak Performance: Single Precision 1,331 GFLOPS
    Peak Performance: Double Precision 665 GLOPS
    Power Capping 250W
    Software OpenFOAM Version 1.7.1
    SpeedIT from Vratis Version 2.1
    CUDA 4.0(285.05.23)
    OS RHEL 6.2
  • Accelerating High Performance Linpack (HPL) with GPUs

    Authors:  Saeed Iqbal and Shawn Gao

    High Performance Linpack (HPL) is a commonly used reference benchmark for HPC systems. HPL stresses the compute and memory subsystems of the test systems and provides insights into the performance of these systems. Nowadays, General purpose Graphic Processor Units (GPUs) are widely used to accelerate such compute-intensive HPC applications across various disciplines in the HPC community.  Several research centers around the world are investigating GPUs for accelerating compute-intensive applications enabling faster research and discovery.  To compare various alternatives the HPL performance is of key importance.

    GPUs are attached inside the servers to provide the extra compute horsepower required for application acceleration. Dell now offers a full-featured GPU solution based on the PowerEdge R720 servers (shown in Figure 1). Two of the latest Tesla M2090 GPUs can be added to each PowerEdge R720 server.  In this blog, we will present the performance and power results of a GPU-accelerated HPL on an 8-node PowerEdge R720 Cluster.

    Figure 1: HPL performance and efficiency on an eight node cluster. Results are presented for different number of GPUs per node.

    Figure 1 shows the performance of HPL on an eight node R720 cluster with different number of GPUs per node.  Compared to a CPU only configuration, an acceleration of 2X is obtained by using one GPU per node and an acceleration of 3.5X with two GPUs per node.  Figure 4 shows the power consumption results.  As shown, the power efficiency, i.e. the useful work delivered for every watt of power consumed, improves by adding GPUs.  With two M2090 GPUs the power efficiency is almost 1.5X compared to the CPU only configuration.

    Figure 2: Total Power and Power Efficiency of the eight node cluster.

    In conclusion, first, using GPUs can substantially accelerate HPL. As shown in Figure 1, using CPUs only, each compute node delivers about 250 GFLOPs of sustained performance, by adding GPUs the sustained performance improves to about 875 GFLOPS per node.  Second, using GPUs improves the performance/watt ratio as well. The power consumption due to GPUs increases but not as much as the corresponding performance improvement.  As show in Figure 2, a CPU only cluster consumes about 3000 Watts and operates at 0.72GFLOPS/Watt, adding GPUs the power consumption increases to about 6600 Watts but now operates at 1.07GFLOPS/Watt, which represents an increase of about 48% in performance/Watt. 

    Configuration and Installation

    Each PowerEdge R720 has a dual Intel Xeon E5-2600 series processor. Please note installing two NVIDIA Tesla M2090 GPUs requires the use of a GPU enablement kit, the x16 option on the 3rd riser, and dual, redundant 1100W power supplies, shown in Figure 3. The details of the hardware and software components are given below:

    Figure 3: Two M2090 GPUs can be attached inside the R720 using a riser and associated power cables.

  • GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

    Authored by: Saeed Iqbal, and Shawn Gao

    NVIDIA has supported GPUDirect v2.0 technology since CUDA 4.0. GPUDirect enables peer-to-peer communication among GPUs. Peer-to-peer communication directives in CUDA allow GPUs to exchange messages directly with each other. Effective communication bandwidth attained in peer-to-peer mode depends on how GPUs are connected to the system. Given an application with a certain communication requirement and the available bandwidth developers can decide on how many GPUs to use and also which GPUs are most suitable for their particular case. Let us look at each of the cases below.

    The PowerEdge C410x is a 3U enclosure which can hold 16 GPUs for example the M2090. Up to eight host servers, such as the PE C6145 or PE C6100 can connect to the C410x via a Host Interface Card (HIC) in hosts and an iPASS cable. The C410X has two layers of switches to connect iPASS cables to GPUs. The connected hosts (ports) are mapped to the 16 GPUs via an easy to use web interface.  It is important to mention the relative ease with which the GPUs attached to a server can be changed using the web interface and without requiring any alterations in cable connections. Currently, the available GPU to host ratios are 2:1, 4:1 and 8:1.  So, a single HIC can access up to 8 GPUs.

    The figures below show the process of communication between GPUs conceptually. The C410X has 16 GPUs, each GPU is attached to one of the eight ports on the C410x through two layers of switches (these layers of switches are shown as single switch to simplify the diagram). The black lines represent the connection between IOHs in the hosts to switches in the C410X; the red lines show the two communicating GPUs.

    Figure 1:  GPU to GPU communication via the host memory. The total Bandwidth attained is this case is about 3 GB/s.

    Case 1: Figure 1 shows a scenario when the GPUs are connected to the system via separate IOHs. In this case peer to peer transfer is not supported and the messages have to go through the host memory.  The entire operation needs a Device-to-host   transfer followed by a host-to-device transfer.  The message is also stored in host memory during the transmission.

    Figure 2: GPU to GPU communication via a shared IOH on the host compute node. The total Bandwidth attained is about 5 GB/s.

    Case 2: Now consider two GPUs that share an IOH).  As shown in Figure 2 the GPUs are connected to the same IOH but on independent PCIe x16 links.  GPU Direct is beneficial here because the message avoids the copy to the host memory, instead it is routed directly through the IOH, to  the receiving GPUs memory.


    Figure 3: GPU to GPU communication via a Switch on the C410X. The total bandwidth attained is about 6 GB/s

    Case 3: Figure 3 shows peer-to-peer (P2P) communication among GPUs interconnected via a PCIe switch such as the case on a C410X. This is shortest path between two GPUs.  GPU Direct is very beneficial in this case because it does not need to be routed through IOHs and host memory.

    Figure 4: The measured bandwidths for the three cases show the advantage of P2P communication.

    Results are shown in figure 4, as the GPUs move “closer” to each other GPU Direct allows a faster mode of Peer-to-peer communications between them. Peer-to-Peer communication via an IOH improves the bandwidth by about 53%, and via a switch it improves by about 93%.  In addition the cudaMemcopy() function call can automatically select the best Peer-to-Peer method available between a given pair of GPUs. This feature allows the developer to use the cudaMemcopy() directive independent of the underlying system architecture.

    As shown above the PowerEdge C410X is ideally suited to utilize the GPU Direct Technology. The PowerEdge C410x is unique for the compute power, density and flexibility it offers for designing GPU accelerated compute systems.

    For more information:

  • Dell Mainstream Servers Get A Boost with NVIDIA GPUs

    Guest blog post by Sumit Gupta, Senior Director, Tesla GPU Computing, NVIDIA

    Every server manufacturer announced support last week for the new Intel Sandy Bridge CPUs in their new models. That includes PC giant Dell, which announced, for the first time, that it is supporting Tesla GPUs in its mainstream Dell PowerEdge R720 servers. And, our new benchmarks demonstrate why.

    The PowerEdge R720 is, by far, one of the most popular servers in the Dell server portfolio, one of the highest volume servers in the world, and often a top choice for IT organizations. The plethora of enterprise-ready peripheral options and highly flexible configurations make the server an easy purchase decision.

    By including Tesla GPUs in the top-selling Dell server, GPU computing is now truly available to the mass market.  And, the mass market can now take advantage of GPU acceleration for a broad range of applications.

    We benchmarked the new Dell PowerEdge R720s with two Tesla M2090s GPUs using the popular computational bio-chemistry applications NAMD, AMBER, and LAMMPS.  Results are below.



    In all benchmarks, the Dell systems that included two GPUs were considerably faster than CPU-only configurations – anywhere from three to six times faster.

    So the most popular Dell servers just got better…and much, much faster.

    The Dell R720 configuration we benchmarked is:

    Dual socket Intel Xeon® E5-2650L 1.80GHz (Sandy Bridge), 16 cores total
    64 GB, 1066 Mhz DDR3 (32 GB per CPU)
    Two Tesla M2090 GPUs
    Redhat Enterprise Linux (RHEL) 6.2
    NVIDIA driver version 295.20
    Detailed data below:

    About Sumit Gupta

     Sumit Gupta joined NVIDIA in 2007 and serves as a senior manager in the Tesla GPU Computing HPC business unit, responsible for marketing and business development of CUDA-based GPU computing products. Previously, Sumit has served in a variety of positions, including product manager at Tensilica, entrepreneur-in-residence at Tallwood Venture Capital, and chip designer at S3 Graphics. He also served as a post-doctorate researcher at University of California, San Diego and Irvine, and as a software engineer at IBM and IMEC, in Belgium.

    Sumit has a Ph.D. in Computer Science from the University of California, Irvine, and a B.Tech. in Electrical Engineering from the Indian Institute of Technology, Delhi. He has authored one book, one patent, several book chapters and more than 20 peer-reviewed conference and journal publications.



  • HPC: Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

    Authors:  Saeed Iqbal, Shawn Gao

    Computational Structural Mechanics is commonly used by scientists and engineers to reduce the product development cycle time across various industries ranging from aerospace to structural biology.  One of the most successful techniques that lends itself to computational methods in structural analysis is the finite element method.  The finite element method is used to solve the resulting partial differential equations, inevitably making it a compute and memory intensive task.

    ANSYS Mechanical is a well-known and widely used software package for computational structural mechanics. It can perform comprehensive static and dynamic analysis on structures.  It uses the finite element method to model the associated structure or process and offers various built-in solvers to solve the resulting linear system. In addition, it has a library of material models making it easy-to-use and perform coupled-physics simulations.

    Typical, available processing power limits the size and number of ANSYS Mechanical simulations. Traditionally, parallel processing is used to reduce the simulation runtimes. Recently, the popularity of using Graphics Processors Units (GPU) to accelerate the simulations has generated interest in the ANSYS community because GPUs coupled with parallel processing can further reduce the simulation runtime significantly. ANSYS Mechanical version 13 has had support for GPU acceleration.  In this study we evaluate the acceleration with a single M2090 GPU on seven standard ANSYS Mechanical benchmarks.

    Table 1, lists the benchmarks along with their problem sizes and the solver they use.

    Table 1: Benchmarks

    ANSYS Mechanical Benchmark

    Problem Size in Degree of Freedom (DOFs), Solver

    CG - 1

    1100K, JCG solver

    SP – 1

    400K, Sparse solver

    SP – 2

    1000K, Sparse solver

    SP – 3

    2300K, Sparse solver

    SP – 4

    1000K, Sparse solver

    SP – 5

    2100K, Sparse solver

    SP – 6

    4900K, Sparse solver

    Configuration

    The Dell PowerEdge R720 is used for running the ANSYS benchmarks. The R720 is a feature rich dual-socket 2U server that can be configured with two internal GPUs as well as act as a host for external GPUs.  We have used the R720 in both internal and external configurations. For external GPUs we have used the Dell PowerEdge C410X. The C410X provides a unique, flexible and powerful 3U PCIe Expansion Chassis for housing up to 16 external GPUs.  The PE C410X can connect up to eight hosts simultaneously and share the GPUs among them by mapping 2, 4 or 8 GPUs per host. Table 2 shows the software and hardware configuration was used for this study.

    Table 2: Hardware and Software Configuration

    PowerEdge R720

    Processor

    Two Intel E-5 2660  2.2 GHz, 95W

    Memory

    128GB @ 1333 MHz

    OS

    RHEL 6.2

    CUDA

    4.0

    GPU

    Model

    NVIDIA Tesla M2090

    GPU cores

    512

    GPU Memory

    6 GB

    GPU Memory bandwidth

    177 GB/s

    Theoretical Peak Performance: Single Precision

    1331 GFLOPS

    Theoretical Peak Performance: Double Precision

    665 GFLOPS

    Power Capping

    225W

    Benchmark Suite

    ANSYS Mechanical

    Version 14

    External GPU Chassis

    Power Edge C410X

    3U, sixteen GPUs

     

    ANSYS has several license models which limit the number of CPU cores usable for ANSYS runs. The two core license is common; we use a two core license for this study.

    Conclusion and Results

    We measure the acceleration due to a single M2090 GPU of the benchmarks listed in Table 1.  The results are shown in Figure 1. The total runtime, including I/O, is selected as a performance metric in each case.  A lower time to run is better. From the graph, it is observed that the mean (geometric) acceleration by using a single GPU, across the seven benchmarks, is 79.1% for internal GPU configuration and 77.9% for the external GPU configuration. The slight difference is assumed to be due to the improved CPU to GPU bandwidth in the internal GPU configuration.

    Figure 1: Runtimes of the seven ANSYS benchmarks.

    For more information:

  • GPUs…, Nodes, Sockets, Cores and FLOPS, Oh, My!

    In a previous post, I described how to compute the peak theoretical floating point performance of a potential system.

    http://www.delltechcenter.com/page/Nodes,+Sockets,+Cores+and+FLOPS,+Oh,+My#fbid=TkQxC6Vb2Bi  

    In that post, I alluded to GPUs coming into the mix: “When might you need MHz these days, you ask? Think GPU speeds.” Well, that time has come!  The nVidia GTC conference is soon (www.gputechconf.com) and systems are now regularly shipping with GPUs such as the nVidia K20 and K20x which operate at MHz frequencies.

    There are several references available that indicate that the new nVidia K20 contains 2,496 cores.  And the operating frequency is also available.  Do not attempt to use these 2 pieces of data to compute a peak theoretical floating point performance number as described in the previous blog.

    The K20 does indeed contain 2,496 cores, but not all are available for double precision floating point math.  These cores are arranged into what are called Streaming Multiprocessor (SM) units.  SM units in a GPGPU on an nVidia card are analogous to CPUs in sockets on a motherboard.  Each SM does indeed contain 192 cores, all of which are available for single precision floating point math.  But unlike most CPUs, all GPU cores are not available for double precision floating point math.  On the nVidia K20 SM, 64 cores can perform double precision floating point math at a rate of 2 flops/clock.   

    There are 13 SM units in the K20, operating at a 706 MHz frequency  Here is the use of MHz and the reference in the previous blog.  706 MHz is 0.706 GHz.   Note that 13 SMs * 192 cores per SM is the quoted 2,496 cores total.  Also note in the math below that the 64 double precision core count is used and not the 192 (single precision) core count quoted.

    Here’s the peak theoretical floating point math for a K20:

    GFLOPS  =  13 SM/K20 * 64 cores/SM  *  0.706 GHz/core  * 2 GFLOPs/GHz

    GFLOPS  =  1,174.784

    I have seen this appear as 1.17 TFLOPS or 1,175 GFLOPS.

    Additionally, the nVidia K20x contains an additional SM unit for a total of 14 SM units and it operates at a slightly higher frequency of 732 MHz or 0.732 GHz.

    Here’s the peak theoretical floating point math for a K20x:

    GFLOPS  =  14 SM/K20 * 64 cores/SM  *  0.732 GHz/core  * 2 GFLOPs/GHz

    GFLOPS  =  1,311.744

    I have seen this appear as 1.31 TFLOPS or 1,312 GFLOPS.

    Hope that helps.  Compute the CPU performance as described in the previous blog.  Compute the GPU performance as described here.  The total system performance is the sum of these.    

    Remember that this is the peak theoretical floating point performance.  Since it is theoretical, it is the performance you are guaranteed to never see!  But we also already have a few blogs posted about real-world performance using GPUs:

    Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations: http://dell.to/JsWqWT 

    ANSYS Mechanical Simulations with the M2090 GPU on the Dell R720:  http://dell.to/JT79KF

    Faster Molecular Dynamics with GPUs: http://en.community.dell.com/techcenter/high-performance-computing/b/hpc_gpu_computing/archive/2012/08/07/faster-molecular-dynamics-with-gpus.aspx

    Accelerating High Performance Linpack (HPL) with GPUs:  http://en.community.dell.com/techcenter/high-performance-computing/b/hpc_gpu_computing/archive/2012/08/07/accelerating-high-performance-linpack-hpl-with-gpus.aspx

    If you have comments or can contribute additional information, please feel free to do so.  Thanks.  --Mark R. Fernandez, Ph.D.

    @MarkFatDell

    #Iwork4Dell

  • Faster Molecular Dynamics with GPUs

    Authors:  Saeed Iqbal and Shawn Gao

     

    General purpose Graphic Processor Units (GPUs) have proven their acceleration capacity across several HPC application classes; in general, they are very suitable for accelerating compute-intensive applications E.g., Computational Fluid Dynamics (CFD), Molecular Dynamics (MD), Quantum Chemistry (QC), Computational Finance (CF) and Oil & Gas applications etc. However among the available areas, Molecular Dynamics (MD) has benefitted tremendously due to GPU acceleration.  This is in-part due to the nature of its core algorithms being suitable for the hybrid CPU-GPU computing model and equally important, freely available sophisticated GPU-enabled molecular dynamics simulators. NAMD is such a GPU-enabled simulator. For more detailed information about NAMD and GPUs, please visit http://www.ks.uiuc.edu/Research/namd/ and http://www.nvidia.com/TeslaApps.

    In this blog we evaluate improved NAMD performance due to GPU accelerate compute nodes. Two proteins F1ATPASE and STMV, which consist of 327K and 1066K atoms respectively, are chosen due to their relatively large problem size. The performance measure is “days/ns”, that shows the number of days required to simulate 1 nanosecond of real-time.

    Figure 1: Relative performance of two NAMD benchmarks on the 8 node R720 cluster. F1ATPASE is accelerated about 1.1X and STMV about 2.8X.

    Figure 1, illustrates the relative performance of the two NAMD benchmarks on the 8 node R720 cluster, keeping the number of GPUs fixed at 16.  In both cases the benchmarks run faster due to GPUs, however the acceleration is very sensitive to problem size.  In the case of F1ATPASE we see a modest 1.1X acceleration and for STMV we observe 2.8X acceleration.  As expected the acceleration improves with problem size.  There seems to be a minimum threshold of 300K atoms to make GPUs feasible, as shown with the F1ATPASE model.  Figure 2 shows the additional power required for GPUs; there is a 1.6X increase in total power consumption. From the power efficiency point of view running STMV with dual internal GPUs is beneficial as the performance gain is 2.8X for an additional 1.6X power.

    In summary, GPUs can accelerate NAMD simulations.  Problem size is a key factor in determining how much a particular simulation gets accelerated.  In a previous study (http://en.community.dell.com/techcenter/high-performance-computing/w/wiki/namd-performance-on-pe-c6100-and-c410x.aspx) we found a similar sensitivity t problem/simulation size (number of atoms).  STMV, the largest simulation we have, is about 1 million atoms and accelerates much better than smaller simulations, it is expected that even larger simulations can be accelerated even more.

    Figure 2: Relative Power Consumption of two NAMD benchmarks on the 8 node R720 cluster.  In both cases there is about 1.6X increase in power consumption due to GPUs.

    Cluster Configuration

    The cluster consists of one master node and eight PowerEdge R720 compute nodes, as shown in Figure 3. The compute nodes can be configured with one or two of the internal Tesla M2090 GPUs, each node in our cluster has two M2090 GPUs for acceleration. The details of the hardware, software and NAMD parameter setup are given below: 

    Figure 3: The 8 node Power Edge R720 Cluster. Each compute node has 2 internal GPUs; total of 16 Tesla M2090 GPUs.

  • Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

    Authors:  Saeed Iqbal, Shawn Gao, Ty Mckercher

    Seismic processing algorithms are an important part of Oil and Gas workflows due to ever increasing demand for high resolution subsurface images that help delineate hydrocarbon reserves. In general, seismic processing algorithms are very compute intensive in nature.   Typically, these algorithms require clusters of powerful servers to process vast amounts of data and produce results in a timely fashion.  GPUs look very promising for further accelerating such Oil and Gas workflows by reducing the time required to nominate wells for a given prospect.  One or more GPUs can be attached to the compute nodes depending on the compute node architecture and ability of application to utilize them.

    One of the key factors in realizing the acceleration is the manner in which communication between GPUs takes place; it is desirable to have high bandwidth between GPUs.  GPUDirect 2.0 is a very useful feature in NVIDIA’s CUDA 4.1 toolkit, which supports peer-to-peer communication among GPUs effectively improving the GPU to GPU bandwidth (by reducing or eliminating extra host memory copies).

    Dell PowerEdge C410X provides a unique, flexible and powerful solution to take advantage of the GPUDirect 2.0 technology. First it enables eight GPUs to be connected to a single compute node, and second its architecture is ideally suited for enabling GPUDirect among them. The Dell PowerEdge C6145 is a powerful compute node used to attach to the PowerEdge C410X.  It offers a choice of using multiple HIC cards to connect to the C410X (see Figure 1). Depending on the bandwidth requirements the users can select the number of HIC cards, for most current application a single or dual HIC cards per compute node is sufficient.

    Figure 1: The C6145 connected to C410X using two HIC cards in the compute node. Up to eight GPUs can be connected to a single HIC.

    The configuration of the C6145 and GPU used in the study is shown in Table 1.

    Table 1: Hardware and Software Configuration

    PowerEdge C6145

    Processor

    4 AMD Opteron 6282 SE @ 2.6GHz

    Memory

    128GB 1333 MHz

    OS

    RHEL 6.1

    CUDA

    4.1

    GPU

    Model

    NVIDIA Tesla M2090

    GPU cores

    512

    GPU Memory

    6 GB

    GPU Memory bandwidth

    177 GB/s

    Peak Performance: Single Precision

    1331 GFLOPS

    Peak Performance: Double Precision

    665 GFLOPS

    Power Capping

    225W

    External GPU Chassis

    Power Edge C410X

    3U, sixteen GPUs

    GPU Communication Patterns

    GPUs are assumed to be connected in a chain like fashion.  Most seismic shot records are quite large, and require frequent communication between GPUs since a record may not fit in a single GPU memory space.  The shot record is streamed across multiple GPUs using boundary calculations on halo regions during processing.  The goal of the communication is that each GPU exchanges data with its neighbors (GPUs on the start and end of the chain have only one neighbor) so that computation can be occurring simultaneously while halo data is transferred between GPUs.   There can be several different communication patterns to achieve this data exchange.  We evaluate the performance of two such common communication patterns in this study.

    The first approach (Pattern A) uses communication patterns composed of two phases.  During the first phase all GPUs simultaneously send a message to their neighbor on the right, and during the second phase all GPUs send messages to their neighbor on the left.  This pattern is show in Figure 2 below. This pattern is also known as Left/Right Exchange.

    Figure 2: Communication Pattern A

    The second approach (Pattern B) also occurs in two phases as shown in Figure 3 below, and requires an even number of GPUs.  In the first phase, pairs of GPUs are selected as shown; all GPUs receive and send messages among the pairs.  In the second phase GPUs are paired differently, as shown in figure 3, to send and receive messages among the pairs again. This pattern is also called a Pairwise Exchange.  

    Figure 3: Communication Pattern B

    It is interesting to note that at the completion of Pattern A and Pattern B, the final result as far as data exchange is concern is the same. The difference is in the timing of particular message exchanges among GPUs. 

    Results

    The performance of above communication patterns with different number of GPUs is shown in Figure 4.  The bandwidth is also compared on with a single IOH (one HIC card) and dual IOH (two HIC cards) connected on the C6145.

    1. The communication pattern B (pairwise exchange) results in better bandwidth among GPUs compared to pattern A (left/right exchange) in the vast majority of cases.  The advantage of pattern B is more pronounced when dual IOH configuration is used.

     

    1. The single IOH configuration results in higher bandwidth among GPUs by about 50~70%. In the single HIC configuration, more communication occurs within the C410X and less data needs to move between C410X and C6145 (host). 

    The choice between a single HIC and dual HIC configuration depends on the associated applications and how much time is spent on compute compared to communication. GPUDirect can automatically find the best performance paths in both configurations at runtime.

    Figure 4: The Bandwidth for Communication Pattern A and B for difference number of GPUs in the chain and number of IOHs.

    For more information:

    NVIDIA GPUDirect Technology

  • NAMD Performance on PE C6100 and C410X

    Executive Summay

    • Given a single fully populated C410X with 16 M2070 GPUs, the recommended solution for NAMD is an 8 node configuration shown below. On STMV, a standard large NAMD benchmark, it shows a 3.5X performance compared to an equivalent CPU only cluster.
    • It is recommended to use X5650 processors on compute nodes for NAMD

    Figure 1: The Dell GPGPU Solution based on a C410X and two PEC6100s
    Introduction
    General Purpose Graphics Processing Units (GPUs) are a very suitable for accelerating molecular dynamic (MD) simulations. GPUs can give a quantum leap in performance for commonly used MD codes, making it possible for researchers to use more efficient and dense high performance computing architectures. NAMD is a very well know and commonly used MD simulator. It is a parallel molecular dynamics code designed for high-performance simulation of large bio-molecular systems; It is developed by the joint collaboration of the Theoretical and Computational Biophysics Group (TCB) and the Parallel Programming Laboratory (PPL) at the University of Illinois at Urbana-Champaign. NAMD is distributed free of charge with source code. NAMD has four benchmarks for varying problem size, table below gives the each benchmark and it problem size:
    NAMD Benchmark Problem Size in Number of Atoms
    ER-GRE 36K
    APOA1 92K
    F1ATPASE 327K
    STMV 1066K
    The performance of these benchmarks is measured in “day/ns”. On a given compute system, “days/ns” shows the number of compute days required to simulate 1 nanosecond of real-time. So, the lower the day/ns required for on a given architecture the better. The Dell HPC Engineering team has configured and evaluated GPU based solutions to help customers select solutions according to their specific needs. As shown in Figure 1, the current offering combines one or two PowerEdge C6100 servers as host server to the PowerEdge C410x, resulting in a 4 node or 8 node compute clusters. The GPU solution uses 16 NVIDA ™ Tesla M2070 GPUs and the CUDA 4.0 software stack. The NAMD code is run without any optimization; however to get good scaling during parallel runs, the following parameter values are changed (in accordance in the guidelines to run the benchmarks on parallel machines):
    NAMD Parameter Value
    Time Steps 500
    Output Energies 100
    Output Timing 100
    Hardware and Software Configuration
    Figure 2 shows the hardware configuration used, as shown in Figure 2, each compute node (PE C6100) is connected to the PE C410x using an iPASS Cable (red) and to the InfiniBand switch (blue) for internode communication. The details of the hardware and software components used for the 4 and 8 node NAMD configuration are given below:
    Figure 2: The PCIe Gen2 x16 iPASS cables and InfiniBand connection diagram. 8 compute nodes are connected to the C410x using an iPASS cable
    PowerEdge C410x GPGPUs Model NVIDIA Tesla M2070
    Number of GPGPUs 16
    iPASS Cables 8
    Mapping 2:1, 4:1
    1x (2x) PowerEdge C6100 4 (8) compute nodes Node Processor Two X5650 @ 2.66 GHz
    Memory 48 GB 1333 MHz
    OS RHEL 5.5, (2.6.18-194.e15)
    CUDA 4.0
    M2070 GPGPU Number of cores 448
    Memory 6 GB
    Memory bandwidth 150 GB/s
    Peak Performance: Single Precision 1030 GFLOPS
    Peak Performance: Double Precision 515 GFLOPS
    Benchmark NAMD v2.8b1 www.ks.uiuc.edu/Research/namd
    Sensitivity to Problem Size and Host Processor
    Figure 3 shows the performance of the four NAMD benchmarks on an 8 node cluster. The comparison is between a CPU only cluster and the same cluster with 2 GPU/node, also the host processor is changed to find its impact on the overall performance. As shown in the figure 3, for small problems the performance is better on CPU only cluster, however for the two larger problems performance is better on the GPU attached cluster. There seems to be a threshold between 100K -300K atoms when the advantage shifts from CPUs only to a cluster with GPUs. For larger problems there is clearly an advantage in using GPUs, the largest STMV shows a 3.5X speedup compared to the CPU only cluster (with the X5670 processors).
    Figure 3: Performance of NAMD benchmarks on the 8 node cluster. The performance is expressed in “day/ns” (lower is better)
    Using the faster X5670 2.93GHz processor improves the performance in all cases. However the impact of using the faster processor is more pronounced with the CPU only clusters. On the GPU attached cluster, the most compute intensive tasks are transferred to GPUs, hence the impact of using X5670 results in very similar performance compared to X5650. If we consider the two largest problems, on average the faster processors give 6.7% more performance at the cost of 7.0% more power and a higher cost. Considering the largest problem only, the faster processors gain 0.08% performance at the cost of 9.6% more power and higher cost. Based on these facts we recommend using X5650 processors in compute nodes, because for larger problem sizes (1 million atoms or more) the difference between the slower and faster is minimal. Also in this study, from here onwards, we have focused only on the larger NAMD benchmarks.


    Selecting the Cluster Size
    Figure 4 compares the performance of the two large NAMD benchmarks on a 4 node and 8 node clusters, while keeping the number of GPUs fixed at 16. As shown in Figure 4, F1ATPASE, about 327K atoms, does only slightly better on 8 nodes, but STMV, which is about 1066K atoms, runs about 35% faster on 8 nodes compared to 4 nodes.
    Figure 4: Comparing performance of two NAMD benchmarks on a 4 node and 8 node clusters. The performance is expressed in “day/ns” (lower is better)
    Figure 5: Comparing power consumption of two NAMD benchmarks on a 4 node and 8 node clusters
    Figure 5 compares the relevant power consumption of the two benchmarks on a 4 node and 8 node clusters. F1ATPASE consumes about 26% more power for about 2% gain in performance. STMV consumes about 32% more power for about 35% more performance. The choice between a 4 or 8 node cluster depends on problem size in number of atoms, for problem sizes of about 325K and below range the 4 node cluster might provide the best value , as it is less expensive and consumes less power while performing similar to the 8 node cluster. However for problem sizes of 1000K or larger the 8 node cluster may provide the best value.
  • HPL Performance on PE C6145 and C410x

    Executive Summary

    · With two PowerEdge C6145s attached to a PowerEdge C410x (Full Sandwich) configuration, the best performance achieved with HPL is 2891 GFLOPS (31% theoretical peak) and it consumes 5030 watts.
    · On a single PowerEdge C6145 attached to a C410x (Half Sandwich) configuration, the best performance with HPL is 1697 GFLOPS (19% theoretical peak) and it consumes 3802 watts.
    · The measured GFLOPS per watt show that the C6145 and C410x solution converts power to FLOPS up to 1.7X more efficiently compared a CPU only configuration.
    · GPGPUs offer a great potential of improving performance of suitable HPC applications.

    Introduction
    There is a lot of interest in the High Performance Computing (HPC) Community to use General Purpose Graphics Processing Units (GPGPUs) for accelerating compute intensive simulations. Dell HPC Engineering team has configured and evaluated GPU based solutions to help customers select the correct solutions according to their specific needs. Dell has introduced the PowerEdge C410x as a primary workhorse for GPU based number crunching and our solutions are built around it. The current offering combines one or two AMD-based PowerEdge C6145 servers as host servers to the C410x.

    Figure 1: Two PowerEdge C6145 host servers ”sandwich” a PowerEgde C410x.

    As show in the Figure 1, a Power Edge C410x is used with two PowerEdge C6145 hosts. The PowerEdge C410x is an external 3U PCI-e expansion chassis, with a space for 16 GPUs. Compute nodes connect to the C410x via a Host Interface Card (HIC) and an iPASS cable. All connected nodes are mapped to the available GPUs according to a user defined configuration. The exact way the 16 GPUs are allocated can be dynamically reconfigured easily using a web GUI, making the operation easier and faster. Currently, the available GPU to host ratios are 2:1, 4:1 and 8:1. So, a single compute node can access up to 8 GPUs! The design of the C410x allows for a high GPU density solution with efficient power utilization characteristics. Each C6145 has 2 compute nodes, giving a total of 4 compute nodes, all in a 7U rack space. Each compute node is configured with four AMD Opteron 6132 HE processors, 128 GB of DDR3 1,333 MHz memory, 4 PCIE connectors (SLOT 1, 2, 3 and MEZZ) and 1 PCIE external connector (iPASS). The total bandwidth (IOH 1 and IOH 2) between the node and external instruments is up to 10.4GT/s. SLOT 3 and MEZZ are connected to IOH1; the rest are attached to IOH2.

    Figure 2: iPASS cable and InfiniBand connection diagram

    Configuration
    As shown in Figure 2, each compute node is connected to the C410x using two iPASS Cables (red) and to the InfiniBand switch (blue) for internode communication. It is critical that computes nodes are connect to the C410x exactly as shown above since any other configuration may result in performance degradation. The details of the components used are given below:


    Power Edge C410x GPGPUs Model NVIDIA Tesla M2070
    Number of GPGPUs 16
    iPASS Cables 8
    Mapping 2:1, 4:1, 8:1
    PowerEdge C6145: Compute Node Processor 4 Opteron 6132 HE @ 2.2 GHz
    Memory 128 GB 1333 MHz
    BIOS 1.7.0 (4/13/11)
    BMC FW 1.02
    PIC FW [0116]
    OS RHEL 5.5, (2.6.18-194.e15)
    CUDA 4.0
    M2070 GPGPU Number of cores 448
    Memory 6 GB
    Memory bandwidth 150 GB/s
    Peak Performance: Single Precision 1030 GFLOPS
    Peak Performance: Double Precision 515 GFLOPS
    Benchmark GPU Enabled HPL from nVIDIA Version 11


    Best Practices for System Configuration
    · The inter-nodes connections through the InfiniBand switch should use the MEZZ card, which are installed on IOH 1 and share the bandwidth with GPUs connected on SLOT 3.
    · Based on measured bandwidth test, the best bandwidth utilization can be achieved if a single HIC connects to a maximum of two GPUs.
    . Using two HIC Cards per compute node is highly recommended with C6145 and C410x solution.
    · Due to the NUMA architecture of the C6145, special attention should be given to process to memory mapping. In general, using memory near the GPGPUs gives more performance.
    · Single compute node can’t work with more than 12 GPUs due to some system limitation.

    Performance

    Figure 3: Performance improvement due to GPGPUs.

    As shown in table 1, each M2070 GPGPU has a peak performance of 515 GFLOPs, giving a fully populated C410x with 16 GPUs a peak capacity of 8240 GFLOPs. Similarly, the peak compute capacity of a single C6145 compute node is 281.6 GFLOPs; all four nodes are rated at 1126.4 GFLOPS. The total peak performance of the GPGPU solution as show in figure 1 is 9369 GFLOPs (double precision). Figure 3 shows the improvement in HPL performance due to GPGPU acceleration. As a reference the blue bars show the measured performance with CPUs only. The red bars show performance improvement when a total of 16 GPGPUs are used for acceleration. Two C6145 are attached to the C410x, and the mapping per compute node is set to either 4:1 or 2:1. When all four compute nodes of the C6145 are used with no GPGPUs attached the performance is what? GFLOPS giving an efficiency of 72.1%. By using 4 GPUs/node, the performance increases to 2891.0 GFLOPS, which is 3.6X the performance with only CPUs. For HPL using the maximum number of 16 GPGPUs is beneficial in both cases. However keeping the mapping ratio to 2:1 for HPL gives 1.6X more performance compared to a mapping ratio of 4:1.

    Power Consumption and Efficient Power Utilization
    Compute intensive benchmarks like HPL typically consume a large amount of power because they stress the processor and memory subsystems. It is of interest from the datacenter design point of view to have accurate power consumption values. Figure 4 shows the associated solution power consumption of the GPGPU solution. When all four nodes are used with 16 GPGPUs the total power consumption is 5030.5 watts which is 2.1 X the power consumed for compute nodes without GPGPUs. The GFLOPS/watt metric is a measure of how efficiently the power consumed is converted to useful performance. Figure 5 show the GFLOPS/watt of the GPGPU solutions. When all four nodes are used, the GFLOPS/watts are 0.575 which is about 1.7 X the GFLOPS/watts when using a CPUs only solution.

    Figure 4: Power Consumption of the C410x and C6145 compute nodes

    Figure 5: Performance per Watt of the C410x and C6145 compute nodes