Blog Description RSS Feed

Start Blog Entry
GPUs can Talk to Each Other Now !!
10/24/2007 -- Comments

Saeed Iqbal - GPU - Figure 1
Figure 1: Overview of the GPUDirect technology in CUDA 4.0 (courtesy of NVIDA Corporation).

NVIDIA is about to release its latest CUDA 4.0 toolkit in the near future, and what a difference it is likely to make for some scientific computing applications using multiple GPUs! As a matter of fact, currently, GPUs were unable to communicate among each other directly. CUDA 4.0 among other enhancements, features support for GPUDirect v2.0 technology; enabling peer-to-peer communication among GPUs. As shown in the figure above, all GPU to GPU messages were sent via the host server memory, GPUDirect eliminates expensive and redundant memory copies by enabling direct GPU communication. In this study we evaluate the performance improvement due to GPUDirect v2.0 over a pure MPI implementation. The host used is a single node in the PowerEdge C6100 equipped with two Intel Xeon X5650 2.67GHz processors and 48 GB of DDR3 1,333 MHz memory, attached to a PowerEdge C410x PCI-e expansion chassis.

The PowerEdge C410x has a unique design when it comes to external PCI-e expansion chassis for GPUs. Basically, it is 3U enclosure with a place for 16 GPUs organized and ready to deliver outstanding computing power. Up to eight host servers, such as the PE C6100, can connect to the C410x via a Host Interface Card (HIC) in the C6100 and an iPASS cable. All connected hosts are mapped to the available GPUs among themselves according to a user defined configuration. The exact way the 16 GPUs are allocated can be dynamically reconfigured easily using a web GUI, making the operation easier and faster. Currently, the available GPU to host ratios are 2:1, 4:1 and 8:1. So, a single HIC can access up to 8 GPUs! The compute capacity of the C410x depends on the GPU used. For example, the NVIDIA Tesla M2050 theoretical peak compute capacity is 16.5 TFLOPS single precision and 8.25 TFLOPS double precision. The design of C410x allows for a high GPU density solution with efficient power utilization characteristics.

The C410x is suitable for HPC applications that need multiple GPUs. One such application uses GPUs to simulate a Heisenberg Spin Glass (HSG) Model and is specifically designed to take advantage of GPU peer to peer communication. This particular model was developed by researchers at the Istituto Applicazioni Calcolo, CNR and Department of Physics, University of Rome, In general, Spin Glass modeling is a technique used in statistical mechanics to simulate and predict the behavior of various physical phenomena.

Results below show a 15-30% performance improvement due to GPUDirect. The runtime of two problem sizes is compared. The 256x256x256 problem size is run at different host to GPU mappings, as show in figure below, the performance is better (faster) with GPUDirect. Similar a large problem size of 51x512x512 is evaluated; it requires much more memory and can only run with 8 GPUs. It is worth noting that currently GPUDirect requires all GPUs to be connected to the same IO Hub in the host server, therefore C410x provides the ideal platform to take advantage of GPU direct at this time.

Saeed Iqbal - GPU - Figure 2
Figure 2: Comparison of HSG model for two problem sizes of 256^3 and 512^3 (Lower is better).

End Blog Entry

Saeed IqbalDr. Saeed Iqbal, Ph.D | Dell Senior Systems Engineer, HPC Lead Engineer GPU Solutions Engineering
Saeed Iqbal is a senior systems engineer in the Global Solutions Engineering Group at Dell Inc. Currently, he is the lead engineer on integration of GPGPUs in the HPC clustering solution. He is also the lead engineer of the HPC Advisor online tool at This tool is used by HPC customers to design HPC clusters and associated high performance parallel storage clusters. Previously, he has lead the Virtualization Solution Advisor project, Beowulf HPC clustering software projects and Dell Grid Computing pilot project. His interests include, performance modeling and analysis of parallel and distributed architectures, economic and power-efficient system design of HPC clusters, high dimension optimization algorithms, DSP, neuro-computing, scheduling and resource management and load balancing algorithms. He holds a Ph.D in HPC from the University of Texas at Austin.