HPCatDell home button

Dr. Jeff Layton
Dell's Dr. Jeff Layton

GPUs are arguably one of the hottest trends in HPC. They can greatly improve performance while reducing the power consumption of systems. However, because the area of GPU Computing is still evolving both for application development and for tool set creation, GPU systems need to be as flexible as possible.

This blog presents some benchmarks of the configurations presented in Part 4 of this series. In some cases the benchmarks are compared to a Supermicro system that has 2 GPUs in 1U (the first GPU Computing server released) and in some cases, the benchmarks discuss scalability of certain applications that only Dell GPU configurations can deliver.


First things first – Bandwidth Testing
When you first develop a new GPU Computing solution or buy one, the first likely benchmark you are going to run is a simple bandwidth test between the host and GPU (and vice versa). You should know how the host node and the GPU are connected so you have an idea of the maximum bandwidth.

For example, for a PCIe Generation 2 x16 connection has a theoretical bandwidth of 8 GB/s (http://en.wikipedia.org/wiki/PCI_Express) in one direction (which is exactly what we’re testing in the following bandwidth tests). Figure 1 below is a plot from a presentation by Dell’s Dr. Mark Fernandez at the 2010 Nvidia GTC Conference. It shows the bandwidth as a function of the number of lanes from a Dell C6100 with Intel processors and a single HIC connected to a single HIC and a single GPU in a Dell C410x.

Figure 1 – Host-GPU and GPU-Host Bandwidth Tests
Figure 1 – Host-GPU and GPU-Host Bandwidth Tests

In Figure 1, “H” refers to the host and “D” refers to the Device or GPU. At peak, the C6100 and C410x combination are reaching about 5.6 GB/s. This is about 70% efficient with 16 lanes (x16) and compares favorably to other systems with the GPU inside the chassis (so-called “direct connected”). There is some slight asymmetry to the results due to the PCIe switch configuration inside the Dell C410x.

Benchmarks
Dell has run a large number of benchmarks on GPU configurations but only a subset will be presented here. For more benchmarks, please read Dr. Mark Fernandez’ Nvidia GTC presentation from 2010. Also, watch the Dell HPC blogs at www.HPCatDell.com for up to date GPU results.

The first benchmark presented in this blog is NAMD (http://www.ks.uiuc.edu/Research/namd/). This is one of the most advanced MD (Molecular Dynamics) applications for GPUs. In this particular benchmark, the STMV data set is used. NAMD is run on a Dell C6100 connected to a Dell C410x that has two Nvidia M2050s sharing one x16 connection to the node. This is compared to a Supermicro 1U box with 2 GPUs (M2050s) with each GPU connected via a x16 slot. The details of the comparison are in Dr. Fernandez’ presentation but the CPUs, amount and type of memory, and software are identical between the two units.

Figure 2 – NAMD, STMV Benchmark
Figure 2 – NAMD, STMV Benchmark

You can see for this particular benchmark you can see that NAMD performs better when each GPU has a dedicated PCIe x16 slot rather than sharing a single PCIe x16 slot.

However, this simple benchmark does not give you the whole picture. Figure 3 below is the same test but in the case of the Dell C6100/C410x combination, we varied the number of GPUs from 1 to 4 for the single x16 connection between the PCIe chassis (C410x) and the host node (C6100). One of the advantages of the C410x is flexibility, allowing us to change the number of GPUs as needed which can also be done via scripting so no physical changes have to be made. You cannot do this with a node that has a fixed number of GPUs inside the chassis.

Figure 3 compares the performance in steps/s for the Dell configurations and the Supermicro configuration (2 GPUs), as well as using just the CPUs on the node (all 12 cores).

Figure 3 – NAMD, STMV benchmarks for various number of GPUs

Figure 3 – NAMD, STMV benchmarks for various number of GPUs


The key thing to notice is that while the Supermicro configuration is faster than the Dell combination for 2 GPUs due to the dedicated x16 slots in the Supermicro chassis, you can get even more performance by adding two more GPUs to the Dell C6100/C410x combination – something that you cannot do with the Supermicro configuration. We get almost 80% of the performance of the additional two GPUs (a total of 4 GPUs) to the Dell configuration. With the Dell configuration this can be done simply by putting two more GPUs in the C410x, assigning them to the host node (this is done via software, typically over the management network) and rebooting the C6100 host node. With the Supermicro configuration you have to buy a second unit to get 4 GPUs, costing more, and you still cannot share the GPUs between nodes. This is a perfect example that flexibility is a key for GPU computing.

Figure 2 illustrated that, in some cases, applications like a dedicated x16 slot for the best possible performance but there are applications where this is not necessarily the case. Figure 4 plots the performance of the CUDASW++ application (http://sourceforge.net/projects/cudasw/) over a range of query lengths. Just as in NAMD, the application is run on a Dell C6100 host node that has a single x16 PCIe connection to the C410x which has two Nvida M2050s. This effectively gives you 8 PCIe lanes per GPU in the worst case when both GPUs are communicating at the same time and using all of the bandwidth. This is compared to the Supermicro unit that has the same CPUs, the same type and amount of memory, and the same software, but with each GPU having a dedicated x16 slot (a total of two GPUs).

Figure 4 – CUDASW++ Benchmark as a function of Query Length
Figure 4 – CUDASW++ Benchmark as a function of Query Length

Notice that the performance of the two configurations is almost identical but for a couple of query lengths, the C6100/C410x combination is actually faster. This is true even though the two GPUs share a single x16 connection to the host node and the Supermicro GPUs have dedicated x16 slots. This illustrates that dedicated x16 slots are not necessarily needed for performance – it all depends upon the application.

Again, this illustrates the importance of flexibility in GPU configurations. For CUDASW++ having dedicated x16 slots does not give you any better performance than two GPUs sharing a single x16 slot. Furthermore, with a flexible configuration such as the C6100/C410x, you can easily adjust the number of GPUs for a host node without having to purchase additional host nodes or dedicated configurations.

Summary
These two benchmarks illustrate that performance is dependent upon the specific application. In the case of NAMD, you get a bit better performance when each GPU has a dedicated x16 slot. However, in the case of CUDASW++ having a dedicated x16 for each GPU is not necessary and you can still get good performance when two GPUs share a single x16 slot.

However, as was shown with the NAMD benchmarks the Dell combination of the C6100 and C410x allows you to add more GPUs to the host node by simply assigning them to the host node via the C410x interface and then rebooting the node. In the case of NAMD, when two more GPUs were added to the host node (for a total of 4 GPUs) about 80% of the additional performance was utilized by NAMD. If you are using nodes with internal GPUs the only way to add GPUs is to buy another node. This doesn’t allow nodes to share GPUs and severely limits the scalability of solution.

This illustrates the flexibility of the Dell GPU configurations, which can easily add GPUs to systems without having to buy completely new nodes. This is exactly is what is needed as GPU applications are created and evolve and proliferate. Some of the applications may need 1 or 2 GPUs with as much PCIe bandwidth as possible and other more mature applications may just need more GPUs. Having a GPU Computing configuration that is flexible to meet both requirements with the same set of hardware gives you the most ROI and the maximum use of your resources.

Dr. Jeff Layton