In a previous post, I described how to compute the peak theoretical floating point performance of a potential system.

http://www.delltechcenter.com/page/Nodes,+Sockets,+Cores+and+FLOPS,+Oh,+My#fbid=TkQxC6Vb2Bi  

In that post, I alluded to GPUs coming into the mix: “When might you need MHz these days, you ask? Think GPU speeds.” Well, that time has come!  The nVidia GTC conference is soon (www.gputechconf.com) and systems are now regularly shipping with GPUs such as the nVidia K20 and K20x which operate at MHz frequencies.

There are several references available that indicate that the new nVidia K20 contains 2,496 cores.  And the operating frequency is also available.  Do not attempt to use these 2 pieces of data to compute a peak theoretical floating point performance number as described in the previous blog.

The K20 does indeed contain 2,496 cores, but not all are available for double precision floating point math.  These cores are arranged into what are called Streaming Multiprocessor (SM) units.  SM units in a GPGPU on an nVidia card are analogous to CPUs in sockets on a motherboard.  Each SM does indeed contain 192 cores, all of which are available for single precision floating point math.  But unlike most CPUs, all GPU cores are not available for double precision floating point math.  On the nVidia K20 SM, 64 cores can perform double precision floating point math at a rate of 2 flops/clock.   

There are 13 SM units in the K20, operating at a 706 MHz frequency  Here is the use of MHz and the reference in the previous blog.  706 MHz is 0.706 GHz.   Note that 13 SMs * 192 cores per SM is the quoted 2,496 cores total.  Also note in the math below that the 64 double precision core count is used and not the 192 (single precision) core count quoted.

Here’s the peak theoretical floating point math for a K20:

GFLOPS  =  13 SM/K20 * 64 cores/SM  *  0.706 GHz/core  * 2 GFLOPs/GHz

GFLOPS  =  1,174.784

I have seen this appear as 1.17 TFLOPS or 1,175 GFLOPS.

Additionally, the nVidia K20x contains an additional SM unit for a total of 14 SM units and it operates at a slightly higher frequency of 732 MHz or 0.732 GHz.

Here’s the peak theoretical floating point math for a K20x:

GFLOPS  =  14 SM/K20 * 64 cores/SM  *  0.732 GHz/core  * 2 GFLOPs/GHz

GFLOPS  =  1,311.744

I have seen this appear as 1.31 TFLOPS or 1,312 GFLOPS.

Hope that helps.  Compute the CPU performance as described in the previous blog.  Compute the GPU performance as described here.  The total system performance is the sum of these.    

Remember that this is the peak theoretical floating point performance.  Since it is theoretical, it is the performance you are guaranteed to never see!  But we also already have a few blogs posted about real-world performance using GPUs:

Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations: http://dell.to/JsWqWT 

ANSYS Mechanical Simulations with the M2090 GPU on the Dell R720:  http://dell.to/JT79KF

Faster Molecular Dynamics with GPUs: http://en.community.dell.com/techcenter/high-performance-computing/b/hpc_gpu_computing/archive/2012/08/07/faster-molecular-dynamics-with-gpus.aspx

Accelerating High Performance Linpack (HPL) with GPUs:  http://en.community.dell.com/techcenter/high-performance-computing/b/hpc_gpu_computing/archive/2012/08/07/accelerating-high-performance-linpack-hpl-with-gpus.aspx

If you have comments or can contribute additional information, please feel free to do so.  Thanks.  --Mark R. Fernandez, Ph.D.

@MarkFatDell

#Iwork4Dell