Authors:  Saeed Iqbal, Shawn Gao, Ty Mckercher

Seismic processing algorithms are an important part of Oil and Gas workflows due to ever increasing demand for high resolution subsurface images that help delineate hydrocarbon reserves. In general, seismic processing algorithms are very compute intensive in nature.   Typically, these algorithms require clusters of powerful servers to process vast amounts of data and produce results in a timely fashion.  GPUs look very promising for further accelerating such Oil and Gas workflows by reducing the time required to nominate wells for a given prospect.  One or more GPUs can be attached to the compute nodes depending on the compute node architecture and ability of application to utilize them.

One of the key factors in realizing the acceleration is the manner in which communication between GPUs takes place; it is desirable to have high bandwidth between GPUs.  GPUDirect 2.0 is a very useful feature in NVIDIA’s CUDA 4.1 toolkit, which supports peer-to-peer communication among GPUs effectively improving the GPU to GPU bandwidth (by reducing or eliminating extra host memory copies).

Dell PowerEdge C410X provides a unique, flexible and powerful solution to take advantage of the GPUDirect 2.0 technology. First it enables eight GPUs to be connected to a single compute node, and second its architecture is ideally suited for enabling GPUDirect among them. The Dell PowerEdge C6145 is a powerful compute node used to attach to the PowerEdge C410X.  It offers a choice of using multiple HIC cards to connect to the C410X (see Figure 1). Depending on the bandwidth requirements the users can select the number of HIC cards, for most current application a single or dual HIC cards per compute node is sufficient.

Figure 1: The C6145 connected to C410X using two HIC cards in the compute node. Up to eight GPUs can be connected to a single HIC.

The configuration of the C6145 and GPU used in the study is shown in Table 1.

Table 1: Hardware and Software Configuration

PowerEdge C6145

Processor

4 AMD Opteron 6282 SE @ 2.6GHz

Memory

128GB 1333 MHz

OS

RHEL 6.1

CUDA

4.1

GPU

Model

NVIDIA Tesla M2090

GPU cores

512

GPU Memory

6 GB

GPU Memory bandwidth

177 GB/s

Peak Performance: Single Precision

1331 GFLOPS

Peak Performance: Double Precision

665 GFLOPS

Power Capping

225W

External GPU Chassis

Power Edge C410X

3U, sixteen GPUs

GPU Communication Patterns

GPUs are assumed to be connected in a chain like fashion.  Most seismic shot records are quite large, and require frequent communication between GPUs since a record may not fit in a single GPU memory space.  The shot record is streamed across multiple GPUs using boundary calculations on halo regions during processing.  The goal of the communication is that each GPU exchanges data with its neighbors (GPUs on the start and end of the chain have only one neighbor) so that computation can be occurring simultaneously while halo data is transferred between GPUs.   There can be several different communication patterns to achieve this data exchange.  We evaluate the performance of two such common communication patterns in this study.

The first approach (Pattern A) uses communication patterns composed of two phases.  During the first phase all GPUs simultaneously send a message to their neighbor on the right, and during the second phase all GPUs send messages to their neighbor on the left.  This pattern is show in Figure 2 below. This pattern is also known as Left/Right Exchange.

Figure 2: Communication Pattern A

The second approach (Pattern B) also occurs in two phases as shown in Figure 3 below, and requires an even number of GPUs.  In the first phase, pairs of GPUs are selected as shown; all GPUs receive and send messages among the pairs.  In the second phase GPUs are paired differently, as shown in figure 3, to send and receive messages among the pairs again. This pattern is also called a Pairwise Exchange.  

Figure 3: Communication Pattern B

It is interesting to note that at the completion of Pattern A and Pattern B, the final result as far as data exchange is concern is the same. The difference is in the timing of particular message exchanges among GPUs. 

Results

The performance of above communication patterns with different number of GPUs is shown in Figure 4.  The bandwidth is also compared on with a single IOH (one HIC card) and dual IOH (two HIC cards) connected on the C6145.

  1. The communication pattern B (pairwise exchange) results in better bandwidth among GPUs compared to pattern A (left/right exchange) in the vast majority of cases.  The advantage of pattern B is more pronounced when dual IOH configuration is used.

 

  1. The single IOH configuration results in higher bandwidth among GPUs by about 50~70%. In the single HIC configuration, more communication occurs within the C410X and less data needs to move between C410X and C6145 (host). 

The choice between a single HIC and dual HIC configuration depends on the associated applications and how much time is spent on compute compared to communication. GPUDirect can automatically find the best performance paths in both configurations at runtime.

Figure 4: The Bandwidth for Communication Pattern A and B for difference number of GPUs in the chain and number of IOHs.

For more information:

NVIDIA GPUDirect Technology