Dr. Jeff Layton
Dell's Dr. Jeff Layton

GPUs are arguably one of the hottest trends in HPC. They can greatly improve performance while reducing the power consumption of systems. However, because the techniques of GPU computing are still evolving both for application development and for tool-sets, GPU systems need to be as flexible as possible.

In the last two articles I talked about why flexibility is so important. In this article I want to talk about what Dell is doing to make GPU Computing more flexible and the building blocks Dell has created for building GPU systems.

Dell’s Overall Approach
I think the manta that has driven Dell’s overall approach to GPU Compute solutions is flexibility. The idea is to develop flexible configurations that allow components to be mixed and matched to meet application and user requirements so that you don’t have to use a forklift to remove hardware when something changes (we commonly call this the “forklift upgrade”).

A key approach in creating flexible GPU Compute solutions that Dell has pursued is to separate the GPUs from the host server. With the advent of good performing PCI-Express (PCIe) networks (for example read this article, (http://www.hpcwire.com/features/A-Case-for-PCI-Express-as-a-High-Performance-Cluster-Interconnect-114511164.html?viewAll=y), PCIe switches (PLX is a good example), and low latency on the switches, there is very little, if any, performance impact when using PCIe networks. These recent developments have allowed us to separate the GPUs from the host nodes using a PCIe switch to communicate between the two. This separation immediately gives you great flexibility and adaptability.

For example, you can now upgrade or change your compute nodes without having to change the GPUs. For example, if Intel comes out with a new CPU that requires you to upgrade your compute nodes, you don’t have to throw out your GPUs or pull GPUs from existing nodes and try to cram them into something new. It also allows you to buy more compute nodes and share GPU resources among them (more on how you do that later).

You can also do the reverse – upgrade, change, or add GPUs if the technology changes, all without having to change your compute nodes.

Perhaps the most important thing that it allows you to do is add or change the number of GPUs attached to each compute node as you need them. Remember that GPU applications and tools evolve and more mature applications can many times use more than 1 GPU per PCIe x16 connection. Separating the GPUs from the host nodes means that when an application is first written you can give certain compute nodes a single GPU per PCIe x16 slot because it’s likely to really stress the PCIe bus for data transfers between the host and GPU. Once the application is less dependent upon the PCIe bus, you could then give the compute node two GPU’s and run two copies of the application on the node (one on each CPU). And finally once the application runs on multiple GPUs you could give the compute nodes 4 or more GPUs.

This scenario of adding GPUs to compute nodes as the application evolves is almost impossible with a coupled CPU/GPU configuration such as those with internal GPUs. With this type of configuration you have a fixed number of GPUs and the only way to change the number of GPUs is to buy more, open up the case, and put them in (if you have room and can power and cool them in the chassis). But with a fixed configuration you are stuck with the number of GPUs it can handle that that’s it. By separating the host node and the GPUs we can adjust the number of GPUs assigned to specific compute nodes resulting in the best possible configuration for the application when it runs.

Another benefit of the separation of host node and GPUs is power and cooling. When nodes are designed, engineering teams spend a great deal of time designing the node for the best possible airflow and temperature distribution. This also means that power supplies are designed for typical power usage models which almost always mean when the node is under load (having idle HPC nodes is really not HPC anymore). But, if you have to design a node that may or may not have GPUs inside the node means your power draw, thermal loads, and airflow have a much wider range of possible values. This usually means that the node is not as efficient as it could be from a power and cooling perspective. Separating the CPUs and GPUs allows you to create optimized power and cooling designs for the host nodes (CPUs) and for the GPUs.

A final benefit that separating CPUs and GPUs creates is the ability to provide redundant power to both the host node and the GPUs. GPUs have tremendous computational ability but they also need a fair amount of power – typically 225W per card. The largest practical power supplies are about 1400W (there are some 1600W power supplies but they are relatively new). A typical two-socket host node can use anywhere from 200W-600W (or even more) so that leaves about 800W for GPUs. That means at best we could use four GPUs inside a host node before exceeding the power limitations of a single power supply. Anything above this leaves you without any power redundancy. To gain back any redundancy you need to add more power supplies which means you likely aren’t at the optimal point in the power curve plus you need more room for the power supplies as well as more airflow which limits the size of the chassis and adds cost.

Cramming all of that hardware into a single node is very difficult and you see that many manufacturers don’t have redundant power supplies in host systems with internal GPUs precisely because they don’t have room or don’t have the airflow. Combining expensive GPUs and host nodes in a single chassis and not using redundant power supplies are truly handicapping the performance potential because of possible down time.

So separating the GPUs from the host nodes allows the host node to retain its redundant power supplies that are so typical of enterprise-class host nodes, and also create a redundant power source for the GPUs.

If you like, one way to look at the benefit of separating the GPUs from the host node is investment protection. You don’t want to invest in a system that when applications or tools evolve require you to change out your hardware. Moreover, if CPU or GPU technology changes, and it always does, you are once again stuck with a “fork-lift” upgrade. Separating the GPUs from the host nodes allows to adapt to evolving applications and protect your hardware investment from very costly for-lift upgrades.

So how did Dell achieve the separation of GPUs and CPUs? What systems does Dell have that allow me to protect my GPU investment?

Dell Systems
Dell GPU Configurations have been designed for a great deal of flexibility while giving users a choice of host nodes. To achieve this, Dell created the PowerEdge C410x PCE-Express Expansion chassis:

Dell PowerEdge C410x
Dell PowerEdge C410x

It is a 3U high chassis that can hold up to 16 PCIe G2 hot-swap cards, which we call “sleds”, that each have a PCIe G2 x16 interface. You can connect up to 8 external hosts through PCIe x16 connectors on the chassis. Alternatively, you can assign up to 8 of the sleds to a single x16 connector allowing two hosts to access 8 slots each.

The chassis has 4 x 1400W hot-plug power supplies giving a maximum redundancy of N+1 and a total power draw of 3,600W. It has an on-board BMC (Baseboard Management Controller), IPMI 2.0 capability, and a dedicated management port.

There are no CPUs in this chassis – just PCIe slots. A friend of mine refers to it as “room and board” for GPUs. Currently it supports the Nvidia M2050, and M2070, but you aren’t limited to just GPUs since the C410x has general PCIe slots, many other cards could be supported including InfiniBand HCAs, Fibre Channel cards, and other PCIe devices. Future GPUs will be evaluated for inclusion as well. These cards are inserted into PCIe G2 x16 slots in the chassis by first being put into a “sled” shown below:

Sled for C410x
Sled for C410x

The sled slides into chassis with 10 sleds in the front and 6 in the back as shown in the two figures below.

Dell PowerEdge C410x Front view with 10 sleds
Dell PowerEdge C410x Front view with 10 sleds

Back view of the Dell PowerEdge C410x with 6 sleds
Back view of the Dell PowerEdge C410x with 6 sleds

In the back view you can see the four power supplies on left hand side of the unit. Just below those are the eight HIC connectors that connect the host nodes to the sleds (GPUs).

Recently, Dell has also developed a “Common Carrier” for the C410x. The goal is to make something fairly generic so that customer sleds don’t have to be designed for every PCIe card. The common carrier is shown below:

Dell PowerEdge C410x Common Carrier
Dell PowerEdge C410x Common Carrier

The sled supports low-profile, half-length single-width, PCI-Express cards with standard full-height brackets. You can also see that the sled allows external cabling so you can connect networks or video cables to the cards. Currently, the sled is qualified with Mellanox Inifiniband cards but more cards will follow. Empty sleds are also available but it is advisable to talk to Dell before using any PCIe cards to make sure they are cooled properly.

To connect a host node to the C410x you need to use a HIC (Host Interface Card) inside the host node just like you might use with the Nvidia S1070 or S2050 units. Then you connect the HIC(s) in the host node via a PCIe cable to a HIC port in the C410x. Then when you reboot the host node, it will recognize the card(s) that they are connected to as if they were actually plugged into a slot in the host node.

There are many options for connecting the GPUs in the C410x to host ports (HICs). The first thing you can do is put 8 GPUs in the C410x that correspond to each of the outgoing HIC ports in the back of the unit. Each outgoing HIC port is connected to a HIC in the host node. This gives you one GPU to one PCIe x16 slot which Dell calls 1:1.

The second obvious option is to put 16 GPUs in the C410x so that each outgoing HIC port on the eight host nodes has 2 GPUs each. We call this ratio 2:1 (2 GPUs per PCIe x16 slot).

These two solutions are easy to setup because you just plug in the sleds that contain the GPUs and you plug the corresponding HIC on the back of the chassis to the correct host node. You can use the management tool to log into the chassis to check that the correct GPUs are mapped to the correct HIC ports. But the tool also gives you the opportunity to do more. With the tool you can assign up to 8 of the GPUs to a specific HIC port on the C410x. This means you can give 1, 2, 3, 4, 5, 6, 7, or 8 GPUs to a specific HIC port which then runs to a host node. The ability to connect 6-8 GPUs to a specific HIC port is a new feature and requires an updated C410x chassis. Older chassis are limited to 4 GPUs to a specific HIC port.

A common question will be about the GPU configurations is the bandwidth from the GPU to the host. The best way to think about this is to “bound” the problem (i.e. best performance and lowest performance). Each GPU in the C410x is connected a PCIe x16 slot (electrical and physical). This means that even for 2 more GPUs per HIC port to the host, each GPU is capable of reaching the maximum throughput of the x16 slot if no other GPUs are communicating at the same time (it all depends upon the application). The slowest performance is reached when all of the GPUs are communicating at the same time with large amounts of data. In the case of 8 GPUs per HIC port, you will get 1/8 of the x16 bandwidth. In the case of 4 GPUs, you will get ¼ of the x16 bandwidth.

In our experience to this point, we rarely see applications that use multiple GPUs where each GPU is communicating with large amounts of data at the same time. Consequently, real application performance is in between the two bound mentioned previously.

Configuring the number of GPUs to a host node can be scripted by simple commands to the management tool so you could easily make this part of a job submitted to a job scheduler. However, there are two things you must pay attention to when configuring a system in this fashion:

1. You have to make sure the GPUs assigned to the host node are not in use with other host nodes. Otherwise the host node that they are originally connected to thinks that it has suffered a hardware failure (basically you have removed the GPU from the unit).
2. When the number of GPUs is changed you need to reboot the node because there are new PCIe devices, namely GPUs, and the node needs to go through the BIOS reboot cycle so the hardware is recognized.

This gives you the ability to also accommodate applications and users that are at various stages in application development. You might have a very seasoned and developed application used by several users that is multi-GPU capable using maybe 4-8 GPUs per PCIe x16 interface. But you also might have users with GPU applications that can only use one GPU per PCIe x16 slot. Using the C410x you can have the user request the number of GPUs they need per host node and the resource manager will create those resources for the user and reboot the nodes given the above two caveats.

It’s pretty obvious that the C410x gives you a great deal of flexibility as your applications and development tools evolve. Contrast this approach that allows you to grow and adapt your GPU compute configuration to one where the compute nodes simply have some internal GPUs slapped together in some sort of configuration. You are stuck with whatever configuration the node provides and cannot adapt to evolving applications and development tools.

Dell also has several host node options to connect to the C410x allowing many GPU application scenarios to be addressed. There are three primary host nodes to attach to the C410x:

The C6100 is a 2U chassis that contains four independent two-socket Intel Xeon 5600 systems as shown below.

HPC GPU Flexibility: The Evolution of GPU Applications and Systems  Part 3 – Dell’s Approach - The Dell TechCenter
Dell PowerEdge C6100 Rear View

The chassis used in the C6100 has dual redundant power supplies that power both of the system boards. The chassis has two hot-plug drive configurations: (1) 24 2.5” drives, (2) 12 3.5” drives. The drives are split evenly between the four systems.

Each of the four systems is a 2-socket Intel Xeon 5600 system with 12 DIMM slots, two GigE ports, a BMC that is IPMI 2.0 compliant, a PCIe x8 slot that can be used for RAID controllers or network cards, and a PCIe G2 x16 slot. There are 10GigE and InfiniBand cards for the x8 slot, while the PCIe x16 slot can be used for a HIC card to connect to the C410x chassis. This allows each system board in the C6100 to connect to 1-8 GPUs in the C410x.

The PowerEdge C6105 is similar to the C6100 but uses the AMD Opteron 4000 series of processors on the system board

The recently released PowerEdge C6145 is also a shared infrastructure system but with two four-socket AMD Opteron 6100 nodes in a 2U chassis.

Dell PowerEdge C6145 – Rear View
Dell PowerEdge C6145 – Rear View

The chassis is basically the same chassis as the C6100 and C6105 but has two 4-socket systems that use the AMD Opteron 6100 processors. Each system has 32 DIMM slots, two GigE ports, a PCIe G2 x8 mezzanine card, three PCIe G2 x16 slots, and a PCIe x16 HIC port (iPASS). The image below shows the rear of the C6145.
HPC GPU Flexibility: The Evolution of GPU Applications and Systems  Part 3 – Dell’s Approach - The Dell TechCenter
Dell PowerEdge C6145 – Rear View Showing Ports

The two redundant power supplies are on the left and the two boards on the right - one above the other. You can see that a single board has a row of external connectors at the bottom: Ethernet ports on left, BMC port on the right, video ports in the middle, USB toward the right and on the far right is the fixed x16 HIC port. Above this row of external connectors are three PCIe G2 x16 slots. On the far right in the same row as the x16 slots is the slot for the PCIe x8 mezz card (typically either a RAID controller or a network card such as 10GigE or InfiniBand).

The C6145 chassis has the same drive options as the C6100 and C6105: (1) up to 24 2.5” drives or (2) up to 12 3.5” drives in the front. The drives are split evenly between the two system boards.

With these three host nodes and the C410x you can create many different GPU configurations including some very flexible systems where GPUs are assigned to host nodes in response to user’s jobs. In the next article I will talk about some of the possible configurations. In the meantime, please let me know how you’re using GPUs in HPC and how you would like to use them.