I am a man of fixed and unbending principles, the first of which is to be flexible at all times.
Everett Dirksen
Stay committed to your decisions, but stay flexible in your approach.
Tony Robbins
Technology makes things faster and more cost-effective, but it's not perfect. It requires you to be as flexible as you can be.
John Phillips

GPUs (Graphical Processing Units) are arguably one of the hottest trends, (if not the hottest) in HPC. I’ve been following the trend for a while and writing about it for at least 4 years (http://www.linux-mag.com/id/4543/). One can even safely argue that the “Era of GPU Computing in HPC” is here and we are just starting to reap the benefits of it. If you aren’t experimenting with GPU Computing, you should because applications that taken advantage of GPUs are coming very quickly. These applications have shown performance improvements ranging from 2x (twice as fast) to over 100x (100 times faster). From these numbers it’s pretty obvious why people are so excited about GPUs.

If this is the case, then how should we build systems that utilize them effectively, are cost-effective, and not point solutions that satisfy the needs of today’s GPU applications when these applications are evolving so rapidly?

Evolution of GPU Applications:
GPUs require a different mindset, a different approach in writing algorithms. GPUs use an approach called SIMD (Single Instruction, Multi-Data) that basically requires the same instructions be applied to all of the data. This is a bit different than the usual approach and forces many people to rethink their algorithms.

For new applications this many not be such a problem because you haven’t written any code yet. This allows you to rethink how you write your application to take advantage of SIMD computations on GPUs. If your algorithm is amenable to GPUs and your algorithm and resulting code is efficient then you can get huge performance boosts in running on GPUs (use this link http://www.nvidia.com/object/cuda_apps_flash_new.html# for a list of applications and their associate speed improvements and papers discussing GPU computing).

If your code already exists then you have to go back and rewrite fairly large parts of your application with a SIMD mindset. You need to reexamine how your code works and what portions can be re-coded in a SIMD manner. This may result in what appears to be “larger” or more inefficient code, but running this code on GPUs can give you some really nice performance increases.

In my interactions with a great number of HPC customers, I’ve found one thing to be true: writing code for GPUs is an evolutionary process. Actually writing any code follows an evolutionary process, but it is more pronounced with GPU applications because small changes can have a large impact on performance. In addition, most people are not used to thinking “SIMD” about their applications. Typically application development begins with only parts of the code running on the GPU and with a fairly high dependence upon the host-GPU data transfer performance (i.e. how quickly can you transfer data to the GPU over the PCIe bus and back). At this stage in the evolution of the application, the overall performance is not very good and is extremely dependent upon the data transfer performance between the host and the GPU, but this is only a start.

As the code evolves it is updated to move more parts of the algorithm to the GPU, decreasing the amount of data that is transferred between the CPUs and the GPUs. This typically happens over several versions of the code to gain some confidence in the accuracy (output) of the application and to measure the impact of code changes on performance. After several code revisions it reaches a point where the code performs well and is much faster than the first version. The amount of time it takes to go from the first version to this “final” version varies, but some of the early GPU applications took about month or two and about 8 code revisions. However, the performance impact from the first version to this 8th version was quite dramatic. This is really expected since the more work or “compute intensity” you can put on the GPU the more dramatic the performance gains. Just realize that it may take you more than 8 revisions of the code or less than 8 revisions. The number 8 is just a typical number I’ve found from some customers.

This is all for now. In my next GPU blog, I’ll address equally important trends and insights regarding GPUs in HPC, including multiple GPUs + MPI, as well as vital tools like CUDA, OpenCL and compilers.

Dr. Jeff Layton- Dr. Jeffrey Layton