Dr. Jeff Layton
Dell's Dr. Jeff Layton

 Part 4 – GPU Configurations Using Dell Systems
GPUs are arguably one of the hottest trends in HPC. They can greatly improve performance while reducing the power consumption of systems. However, because the area of GPU Computing is still evolving both for application development and for tool set creation, GPU systems need to be as flexible as possible.

In the last blog I talked about the basic building blocks Dell has created for GPU systems. In this article I want to talk about possible GPU configurations and how they work well for the rapidly evolving GPU ecosystem.

Dell GPU Configurations
In the last article, I talked about the various Dell hardware components that can be used to create GPU computing solutions. These components were designed around a solution strategy that can be summed into a single word - flexibility. Solutions need to be created so that they can be easily adapted to evolving applications and to evolving GPU tools.

The PowerEdge C410x that I mentioned in the last blog is an external chassis that holds GPUs or really any PCIe device that meets the specific requirements of the chassis. To refresh your memory, it can accommodate up to 16 GPUs in PCIe G2 x16 slots in a 3U chassis. It also has 8 PCIe G2 x16 connections that connect the chassis to the host nodes. The current version of the chassis allows you to reduce the number of external connections to 4 so that each x16 connection into the chassis connects to 4 GPUs. It can also reduce the number of connections to 2 so that each x16 connection connects to 8 GPUs.

One approach that Dell has taken in the creation of GPU solutions is to develop solution “bricks” or solution “sandwiches” that combine the various components and form GPU compute solutions. For example, you could begin with the Dell PowerEdge C6100 that has four two-socket system boards in a 2U package, and connect two of these to a single Dell PowerEdge C410x. An example of this is shown below in Figure 1.

Two Dell PowerEdge C6100 (top and bottom) and one Dell PowerEdge C410x (center)
Figure 1 – Two Dell PowerEdge C6100 (top and bottom) and one Dell PowerEdge C410x (center)

In the middle of this group (or sandwich) is the C410x and you can see the 10 front sleds for the GPUs in Figure 1. Then above and below the C410x are the C6100 units (they each have 24 2.5” drives in this configuration but 3.5” drive configurations are also available). In total there are eight dual-socket Intel Westmere based systems, each with a PCIe x8 QDR Infiniband card and a PCIe x16 HIC card to connect to the C410x which has up to 16 GPUs.

Quick diversion – probably the best way to talk about GPU configuration is to use the nomenclature of,

(Number of GPUs) : (Number of x16 slots)

This is a better nomenclature than the number of GPUs per node or number of GPUs per socket since it describes how many GPUs are sharing a PCIe G2 x16 slot. This can be important because for applications that are early in their development cycle and are heavily dependent upon the data transfer efficiency from the CPU to the GPU, typically need lots of host-GPU bandwidth.

Using the configuration in Figure 1, we can create a 1:1 configuration (1 GPU per x16 HIC). This means we have a total of 8 GPUs in the C410x since we have 8 system boards in the two C6100s. This also gives each GPU the full x16 bandwidth if that is needed.

You could start with this configuration and add 8 GPUs at a later data when your application(s) need it. If you do this, you create a 2:1 configuration (2 GPUs per x16 slot). In the best case scenario, each GPU gets full access to the entire PCIe x16 bandwidth. In the worst case, both GPUs are communicating with the host over the PCIe bus completely saturating it resulting in each GPU only effectively getting PCIe x8 bandwidth. However, in our experience, except for a small number of applications, this rarely happens.

Recall that the C410x only has a maximum of 8 incoming HIC connections but that you don’t have to use all eight of them. So we can remove one of the C6100 units in Figure 1 and get Figure 2 so we have one C6100 that has four system boards to a single C410x.

Single C6100 (bottom) connected to a single C410x (top)
Figure 2: Single C6100 (bottom) connected to a single C410x (top)

If Figure 1 is sometimes called a “sandwich” you can think of Figure 2 as an “open-face sandwich”. With this configuration we can increase the number of GPUs assigned to each system board.

Using Figure 2, you could create a 3:1 configuration (3 GPUs to a single x16 HIC) by only populating 12 of the 16 slots in the C410x. This configuration also gives you some flexibility to add four more GPUs if you need them at a later date. This configuration can also be extended to 4:1 (4 GPUs per single x16 HIC) by using all 16 slots in the C410x where each system will have 4 GPUs in addition to the two Intel Westmere processors, a QDR IB card, and up to 6 2.5” drives per system board.

With the updated C410x system you can go to 8:1 (8 GPUs per x16 HIC) but this means that you would need two C410x systems to connect a single C6100 as shown below in Figure 3.

Single C6100 (middle) connected to two C410x units (top and bottom)

Figure 3: Single C6100 (middle) connected to two C410x units (top and bottom)

In keeping with the sandwich theme, you can call this a “reverse-sandwich” where the bread has become the filling and the filling the bread (although I don’t think I want to make a reverse peanut butter sandwich). In Figure 3, each x16 HIC connection connects to 8 GPUs. While this may sound extreme there are applications that can scale with the number of GPUs even with a single HIC. Nvidia’s new version of CUDA, Version 4.0, allows direct GPU to GPU communication so that the host-to-GPU bus isn’t such a bottleneck in overall application performance (look for upcoming blogs from Dell that discuss this).

One other comment I wanted to make about the configuration in Figure 3 – this is the largest number of GPUs you can currently have (8) on a single system. This is a limitation in the current CUDA driver but I’m betting that someday Nvidia will increase this to a larger number.

We can extend this sandwich idea of host node and GPUs to the new Dell PowerEdge C6145. If you remember from my previous article this is a 2U chassis with two four-socket AMD 61xx series processors and four PCIeG2 x16 slots (electrical and mechanical) as well as a PCIe G2 x8 slot for InfiniBand. However, we’re still limited to a total of 8 GPUs for a single C6145 board so we can connect either two GPUs to a single HIC or four GPUs to two HICs in the configuration in Figure 3. This demonstrates the flexibility of these configurations.

Contrast this approach with that of using internal GPUs. If they GPUs are internal you are stuck with the number of GPUs that come with it and that’s it. At best you could start with one GPU and add a second one at some point in the future. But to do that you have to open up the case, pop the second one in, hope you don’t cause any problems, and reboot. With the approach presented here, you just add more GPUs via a sled, assign them to a node, and reboot the node. You don’t have to open up a case. If you have experienced the joys of popping open a case then closing only to find it won’t boot, you know what I’m talking about.

GPU applications and tools are quickly evolving. During this evolutionary process, the hardware requirements that allow the application to run most efficiently will also change. To be the most cost effective your GPU configurations need to be flexible. This primary means being able to vary the number of GPUs per node and vary how they are connected. This translates into separating the GPUs from the CPUs.

Dell has developed a unique external PCIe chassis that allows you to put GPUs in a separate chassis that is optimized for the high power and cooling requirements of GPUs. This chassis, called the C410x, can be connected to host nodes via HIC cards and PCIe cables according to your requirements. This means that you can upgrade/change the host nodes or the GPUs independent of one another.

In this article I’ve tried to illustrate some GPU configurations that use Dell host nodes and the C410x. This combination allows a great deal of flexibility which is precisely what you need for the rapidly evolving world of GPU computing.

In the next article I will present some benchmarks of common GPU applications.

-- Dr. Jeffrey Layton