GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.

GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

Authored by: Saeed Iqbal, and Shawn Gao

NVIDIA has supported GPUDirect v2.0 technology since CUDA 4.0. GPUDirect enables peer-to-peer communication among GPUs. Peer-to-peer communication directives in CUDA allow GPUs to exchange messages directly with each other. Effective communication bandwidth attained in peer-to-peer mode depends on how GPUs are connected to the system. Given an application with a certain communication requirement and the available bandwidth developers can decide on how many GPUs to use and also which GPUs are most suitable for their particular case. Let us look at each of the cases below.

The PowerEdge C410x is a 3U enclosure which can hold 16 GPUs for example the M2090. Up to eight host servers, such as the PE C6145 or PE C6100 can connect to the C410x via a Host Interface Card (HIC) in hosts and an iPASS cable. The C410X has two layers of switches to connect iPASS cables to GPUs. The connected hosts (ports) are mapped to the 16 GPUs via an easy to use web interface.  It is important to mention the relative ease with which the GPUs attached to a server can be changed using the web interface and without requiring any alterations in cable connections. Currently, the available GPU to host ratios are 2:1, 4:1 and 8:1.  So, a single HIC can access up to 8 GPUs.

The figures below show the process of communication between GPUs conceptually. The C410X has 16 GPUs, each GPU is attached to one of the eight ports on the C410x through two layers of switches (these layers of switches are shown as single switch to simplify the diagram). The black lines represent the connection between IOHs in the hosts to switches in the C410X; the red lines show the two communicating GPUs.

Figure 1:  GPU to GPU communication via the host memory. The total Bandwidth attained is this case is about 3 GB/s.

Case 1: Figure 1 shows a scenario when the GPUs are connected to the system via separate IOHs. In this case peer to peer transfer is not supported and the messages have to go through the host memory.  The entire operation needs a Device-to-host   transfer followed by a host-to-device transfer.  The message is also stored in host memory during the transmission.

Figure 2: GPU to GPU communication via a shared IOH on the host compute node. The total Bandwidth attained is about 5 GB/s.

Case 2: Now consider two GPUs that share an IOH).  As shown in Figure 2 the GPUs are connected to the same IOH but on independent PCIe x16 links.  GPU Direct is beneficial here because the message avoids the copy to the host memory, instead it is routed directly through the IOH, to  the receiving GPUs memory.

Figure 3: GPU to GPU communication via a Switch on the C410X. The total bandwidth attained is about 6 GB/s

Case 3: Figure 3 shows peer-to-peer (P2P) communication among GPUs interconnected via a PCIe switch such as the case on a C410X. This is shortest path between two GPUs.  GPU Direct is very beneficial in this case because it does not need to be routed through IOHs and host memory.

Figure 4: The measured bandwidths for the three cases show the advantage of P2P communication.

Results are shown in figure 4, as the GPUs move “closer” to each other GPU Direct allows a faster mode of Peer-to-peer communications between them. Peer-to-Peer communication via an IOH improves the bandwidth by about 53%, and via a switch it improves by about 93%.  In addition the cudaMemcopy() function call can automatically select the best Peer-to-Peer method available between a given pair of GPUs. This feature allows the developer to use the cudaMemcopy() directive independent of the underlying system architecture.

As shown above the PowerEdge C410X is ideally suited to utilize the GPU Direct Technology. The PowerEdge C410x is unique for the compute power, density and flexibility it offers for designing GPU accelerated compute systems.

For more information:

  • Good morning,

    I have a few questions, as we are going to try this soon. What does "CPU [12]" represent in the above diagram: one multi-core processor, or one full compute node? If the answer is a full compute node, how are these nodes connected such that you are getting 3.3GB/s throughput?

    If the C410x is set to 8:1 mode, can all 8 GPUs of one CPU exchange data via P2P without the data leaving the C410x box?

    Thanks for your time,


  • Hi Matt,

    Thanks for your interest in our GPU solutions.

    Each of CPU 1 or CPU 2 represents one multi-core processor.

    Yes. If C410X is set to 8:1 mode, all 8 attached GPUs can exchange data via P2P without data leaving the C410X.