GPUs are arguably one of the hottest trends in HPC. They can greatly improve performance while reducing the power consumption of systems. However, because the techniques of GPU computing are still evolving both for application development and for tool-sets, GPU systems need to be as flexible as possible.In the last two articles I talked about why flexibility is so important. In this article I want to talk about what Dell is doing to make GPU Computing more flexible and the building blocks Dell has created for building GPU systems.Dell’s Overall ApproachI think the manta that has driven Dell’s overall approach to GPU Compute solutions is flexibility. The idea is to develop flexible configurations that allow components to be mixed and matched to meet application and user requirements so that you don’t have to use a forklift to remove hardware when something changes (we commonly call this the “forklift upgrade”).A key approach in creating flexible GPU Compute solutions that Dell has pursued is to separate the GPUs from the host server. With the advent of good performing PCI-Express (PCIe) networks (for example read this article, (http://www.hpcwire.com/features/A-Case-for-PCI-Express-as-a-High-Performance-Cluster-Interconnect-114511164.html?viewAll=y), PCIe switches (PLX is a good example), and low latency on the switches, there is very little, if any, performance impact when using PCIe networks. These recent developments have allowed us to separate the GPUs from the host nodes using a PCIe switch to communicate between the two. This separation immediately gives you great flexibility and adaptability. For example, you can now upgrade or change your compute nodes without having to change the GPUs. For example, if Intel comes out with a new CPU that requires you to upgrade your compute nodes, you don’t have to throw out your GPUs or pull GPUs from existing nodes and try to cram them into something new. It also allows you to buy more compute nodes and share GPU resources among them (more on how you do that later). You can also do the reverse – upgrade, change, or add GPUs if the technology changes, all without having to change your compute nodes.Perhaps the most important thing that it allows you to do is add or change the number of GPUs attached to each compute node as you need them. Remember that GPU applications and tools evolve and more mature applications can many times use more than 1 GPU per PCIe x16 connection. Separating the GPUs from the host nodes means that when an application is first written you can give certain compute nodes a single GPU per PCIe x16 slot because it’s likely to really stress the PCIe bus for data transfers between the host and GPU. Once the application is less dependent upon the PCIe bus, you could then give the compute node two GPU’s and run two copies of the application on the node (one on each CPU). And finally once the application runs on multiple GPUs you could give the compute nodes 4 or more GPUs.This scenario of adding GPUs to compute nodes as the application evolves is almost impossible with a coupled CPU/GPU configuration such as those with internal GPUs. With this type of configuration you have a fixed number of GPUs and the only way to change the number of GPUs is to buy more, open up the case, and put them in (if you have room and can power and cool them in the chassis). But with a fixed configuration you are stuck with the number of GPUs it can handle that that’s it. By separating the host node and the GPUs we can adjust the number of GPUs assigned to specific compute nodes resulting in the best possible configuration for the application when it runs.Another benefit of the separation of host node and GPUs is power and cooling. When nodes are designed, engineering teams spend a great deal of time designing the node for the best possible airflow and temperature distribution. This also means that power supplies are designed for typical power usage models which almost always mean when the node is under load (having idle HPC nodes is really not HPC anymore). But, if you have to design a node that may or may not have GPUs inside the node means your power draw, thermal loads, and airflow have a much wider range of possible values. This usually means that the node is not as efficient as it could be from a power and cooling perspective. Separating the CPUs and GPUs allows you to create optimized power and cooling designs for the host nodes (CPUs) and for the GPUs.A final benefit that separating CPUs and GPUs creates is the ability to provide redundant power to both the host node and the GPUs. GPUs have tremendous computational ability but they also need a fair amount of power – typically 225W per card. The largest practical power supplies are about 1400W (there are some 1600W power supplies but they are relatively new). A typical two-socket host node can use anywhere from 200W-600W (or even more) so that leaves about 800W for GPUs. That means at best we could use four GPUs inside a host node before exceeding the power limitations of a single power supply. Anything above this leaves you without any power redundancy. To gain back any redundancy you need to add more power supplies which means you likely aren’t at the optimal point in the power curve plus you need more room for the power supplies as well as more airflow which limits the size of the chassis and adds cost.Cramming all of that hardware into a single node is very difficult and you see that many manufacturers don’t have redundant power supplies in host systems with internal GPUs precisely because they don’t have room or don’t have the airflow. Combining expensive GPUs and host nodes in a single chassis and not using redundant power supplies are truly handicapping the performance potential because of possible down time. So separating the GPUs from the host nodes allows the host node to retain its redundant power supplies that are so typical of enterprise-class host nodes, and also create a redundant power source for the GPUs.If you like, one way to look at the benefit of separating the GPUs from the host node is investment protection. You don’t want to invest in a system that when applications or tools evolve require you to change out your hardware. Moreover, if CPU or GPU technology changes, and it always does, you are once again stuck with a “fork-lift” upgrade. Separating the GPUs from the host nodes allows to adapt to evolving applications and protect your hardware investment from very costly for-lift upgrades.So how did Dell achieve the separation of GPUs and CPUs? What systems does Dell have that allow me to protect my GPU investment?Dell SystemsDell GPU Configurations have been designed for a great deal of flexibility while giving users a choice of host nodes. To achieve this, Dell created the PowerEdge C410x PCE-Express Expansion chassis:
It is a 3U high chassis that can hold up to 16 PCIe G2 hot-swap cards, which we call “sleds”, that each have a PCIe G2 x16 interface. You can connect up to 8 external hosts through PCIe x16 connectors on the chassis. Alternatively, you can assign up to 8 of the sleds to a single x16 connector allowing two hosts to access 8 slots each.The chassis has 4 x 1400W hot-plug power supplies giving a maximum redundancy of N+1 and a total power draw of 3,600W. It has an on-board BMC (Baseboard Management Controller), IPMI 2.0 capability, and a dedicated management port.There are no CPUs in this chassis – just PCIe slots. A friend of mine refers to it as “room and board” for GPUs. Currently it supports the Nvidia M2050, and M2070, but you aren’t limited to just GPUs since the C410x has general PCIe slots, many other cards could be supported including InfiniBand HCAs, Fibre Channel cards, and other PCIe devices. Future GPUs will be evaluated for inclusion as well. These cards are inserted into PCIe G2 x16 slots in the chassis by first being put into a “sled” shown below:
The sled slides into chassis with 10 sleds in the front and 6 in the back as shown in the two figures below.
The C6100 is a 2U chassis that contains four independent two-socket Intel Xeon 5600 systems as shown below.