HPCatDell home button

Dr. Jeff Layton
Dr. Jeff Layton


Note: In a previous blog, Dell's Dr. Jeff Layton begins this new series exploring the past & future of GPU Computing in HPC - you can read that blog here: HPC GPU Flexibility: The Evolution of GPU Applications and Systems

Further Evolution: Think Multiple GPUs + MPI
At this point you have an application that performs well on a single GPU. Since we’re talking about HPC at this point you probably want to think about evolving the code even further to use multiple GPUs and to use multiple GPUs + MPI. Understanding how to efficiently write code for multiple GPUs as well as adding in MPI requires some time and effort. Some of the leading GPU application developers have successfully evolved their applications to this level. It took some work and experimentation, but the rewards have been quite remarkable.

The Tools: CUDA, OpenCL & Compilers
The evolution of applications is true regardless of the tools you use but this becomes an important issue for GPU applications. You can write you code using CUDA tools, OpenCL tools, or with the PGI compiler. Regardless of the tools there is a feedback loop between the person or persons writing the code and the performance of the code. These tool sets help in providing good feedback information but it is up to the code writers to modify the code. This basically means that there is no magic “-gpu” flag with compilers or tools that take an existing code and convert it to GPU code.

Dell C410x
Dell's C410x


However, I would be admonished if I didn’t mention that these tools are also evolving. Nvidia just released version 4.0 of their CUDA tool set, the Portland Group just updated their Accelerator Compiler, and OpenCL is evolving so rapidly I can’t seem to keep up. As of the writing of this blog, Nvidia has version 3.2 of their tools for OpenCL and version 1.1 of the OpenCL drivers. AMD (ATI) has version 2 of their OpenCL SDK (http://developer.amd.com/zones/OpenCLZone/pages/default.aspx).

An example of the evolution of the tool sets is that the latest version of CUDA (4.0) now has the capability of allowing multiple GPUs to communicate directly over the PCIe bus. Prior to this version if a GPU needed to send data to another GPU it first had to send the data to the CPU which then was likely copied to another CPU and then sent to the target GPU. This means that a great deal of data has to be moved across the PCIe bus which means it takes more time and the application runs slow. In addition, there is a memory copy involving at least two cores on the host system which also takes time and uses memory on the host system. All of this data movement and data copies results in reduced performance and increased host resources (i.e. you have to buy more memory).

Now with CUDA 4.0 it is possible to have one GPU send data to another GPU within the same host over the PCIe bus without the involvement of any CPUs. This cuts the data movement across the PCIe bus in half and eliminates the memory copy on the host system. There are other far-reaching consequences of this capability that we are likely to see in the coming years as GPU design evolves.

GPU System Configuration Design
Since both GPU applications evolve and development tool sets evolve, it is very probable that some applications will need one set of system characteristics (memory, processor, network, GPU configuration) and other applications will need a different set of system characteristics. And just for good measure we have CPU, network, and system bus hardware that is all evolving independently. How does one design a system for a moving target?

The importance of this question is amplified by the fact that there might be a range of users on the system who are at different points in their GPU application evolution and may be using different tool kits. For example, one may be using CUDA and one may be using OpenCL. Or, one user many be using Nvidia GPUs and one may be using AMD GPUs. It’s also possible that one user’s algorithm works best in combination with an Intel CPU and another user’s algorithm may work better with AMD CPUs. How do you adequately design the best system possible, or at least a good enough system, for these combinations? This question is also true of systems running one or two applications because these applications may be evolving as well as the tool sets. The answer is the title of this blog – maintain flexibility.

Maintaining flexibility is easier said than done of course. The basic assumption that you have to make is that all applications and tool sets will evolve so you need to be able to adapt to this evolution (hence the flexibility).

In the next blog I want to talk about Dell’s approach to GPU system design to improve the flexibility and lower overall costs.