Authors: Saeed Iqbal, Shawn Gao, Ty Mckercher
Seismic processing algorithms are an important part of Oil and Gas workflows due to ever increasing demand for high resolution subsurface images that help delineate hydrocarbon reserves. In general, seismic processing algorithms are very compute intensive in nature. Typically, these algorithms require clusters of powerful servers to process vast amounts of data and produce results in a timely fashion. GPUs look very promising for further accelerating such Oil and Gas workflows by reducing the time required to nominate wells for a given prospect. One or more GPUs can be attached to the compute nodes depending on the compute node architecture and ability of application to utilize them.
One of the key factors in realizing the acceleration is the manner in which communication between GPUs takes place; it is desirable to have high bandwidth between GPUs. GPUDirect 2.0 is a very useful feature in NVIDIA’s CUDA 4.1 toolkit, which supports peer-to-peer communication among GPUs effectively improving the GPU to GPU bandwidth (by reducing or eliminating extra host memory copies).
Dell PowerEdge C410X provides a unique, flexible and powerful solution to take advantage of the GPUDirect 2.0 technology. First it enables eight GPUs to be connected to a single compute node, and second its architecture is ideally suited for enabling GPUDirect among them. The Dell PowerEdge C6145 is a powerful compute node used to attach to the PowerEdge C410X. It offers a choice of using multiple HIC cards to connect to the C410X (see Figure 1). Depending on the bandwidth requirements the users can select the number of HIC cards, for most current application a single or dual HIC cards per compute node is sufficient.
Figure 1: The C6145 connected to C410X using two HIC cards in the compute node. Up to eight GPUs can be connected to a single HIC.
The configuration of the C6145 and GPU used in the study is shown in Table 1.
Table 1: Hardware and Software Configuration
4 AMD Opteron 6282 SE @ 2.6GHz
128GB 1333 MHz
NVIDIA Tesla M2090
GPU Memory bandwidth
Peak Performance: Single Precision
Peak Performance: Double Precision
External GPU Chassis
Power Edge C410X
3U, sixteen GPUs
GPUs are assumed to be connected in a chain like fashion. Most seismic shot records are quite large, and require frequent communication between GPUs since a record may not fit in a single GPU memory space. The shot record is streamed across multiple GPUs using boundary calculations on halo regions during processing. The goal of the communication is that each GPU exchanges data with its neighbors (GPUs on the start and end of the chain have only one neighbor) so that computation can be occurring simultaneously while halo data is transferred between GPUs. There can be several different communication patterns to achieve this data exchange. We evaluate the performance of two such common communication patterns in this study.
The first approach (Pattern A) uses communication patterns composed of two phases. During the first phase all GPUs simultaneously send a message to their neighbor on the right, and during the second phase all GPUs send messages to their neighbor on the left. This pattern is show in Figure 2 below. This pattern is also known as Left/Right Exchange.
Figure 2: Communication Pattern A
The second approach (Pattern B) also occurs in two phases as shown in Figure 3 below, and requires an even number of GPUs. In the first phase, pairs of GPUs are selected as shown; all GPUs receive and send messages among the pairs. In the second phase GPUs are paired differently, as shown in figure 3, to send and receive messages among the pairs again. This pattern is also called a Pairwise Exchange.
Figure 3: Communication Pattern B
It is interesting to note that at the completion of Pattern A and Pattern B, the final result as far as data exchange is concern is the same. The difference is in the timing of particular message exchanges among GPUs.
The performance of above communication patterns with different number of GPUs is shown in Figure 4. The bandwidth is also compared on with a single IOH (one HIC card) and dual IOH (two HIC cards) connected on the C6145.
The choice between a single HIC and dual HIC configuration depends on the associated applications and how much time is spent on compute compared to communication. GPUDirect can automatically find the best performance paths in both configurations at runtime.
Figure 4: The Bandwidth for Communication Pattern A and B for difference number of GPUs in the chain and number of IOHs.
For more information:
NVIDIA GPUDirect Technology