Mark FernandesRecently, a fellow blogger here at HPCatDell, Dr. Jeff Layton, has been running a series on PetaFLOPS for the Common Man. In that series, he writes that in the November 2009 Top500 list there are actually two systems that achieves above one PetaFLOPS in sustained performance on the Top500 benchmark. However, there are five systems that have a theoretical performance above one PetaFLOPS.

Working in HPC, I am often asked how to compute the theoretical performance of a potential system.

Although this may be elementary for many of you, I thought I would take the time to document this and, maybe, in a subsequent writing explain the sustained performance numbers that go into the Top500 rankings - and why there is such a difference between theoretical and sustained.

To properly compute the theoretical performance of a system, we need to agree upon some common terms, or a taxonomy, if you will, of HPC compute components. Then, we simply do a dimensional analysis as we did in high school.

In the past, a chassis contained a single node. This chassis was a desktop computer or a tower version or a deskside unit or a rack-mounted pizza box server, etc. Within that thing you bought was a single node. A single node contained a single processor. A processor contained a single (CPU) core and fit into a single socket. But times change...

With recent "systems,” we can have a single chassis containing multiple nodes. And those nodes contain multiple sockets. And the processors in those sockets contain multiple (CPU) cores.

Therefore, let’s define a few terms.

1. A "chassis" houses one or more nodes.

2. A node contains one or more sockets.

3. A socket holds one processor.

4. A processor contains one or more (CPU) cores.

5. The cores perform FLOPS.

The "chassis" is that thing that houses one or more compute nodes. Note that the chassis may be a rack-mounted pizza box, or a blade enclosure or entire rack computer, which accepts plug-in compute nodes. One must buy one or more of these in order to have a computer system. Nonetheless, I call the piece of hardware that is a unit that houses compute nodes a chassis.

Nodes, usually a printed circuit board(s) of some type, are manufactured with (empty) sockets. There is not, in general, a node board for each available processor. The node boards are built to accommodate a family of processors. Depending upon your needs, your desires, or your budget, you select a specific processor to go into that socket. Today, within the same processor family, you can select between differing core counts, a wide range of frequencies and vastly differing memory cache structures.

Also note that the "thing" that Intel and AMD and other microprocessor companies sell is a processor. One cannot buy anything smaller than a processor. And they call it a processor with preceding adjectives, e.g., the ABC dual-core processor, or the XYZ quad-core processor.

Finally, the cores within the processor perform the actual mathematical computations. One sequence of these mathematical operations involves the exclusive use of floating point numbers and is called a FLOP or FLoating-point OPeration. The plural of FLOP is FLOPs, with a small “s,” like many things when made plural.

In general, a core can do a certain number of FLOPs or FLoating-point OPerations every time its internal clock ticks. These clock ticks are called cycles and measured in Hertz (Hz). Most microprocessors today can do four (4) FLOPs per clock cycle, that is, 4 FLOPs per Hz. Thus, depending upon the Hz frequency of the processor’s internal clock, the floating point operations per second or FLOPS can be calculated. Note the large “S” in FLOPS.

The internal clock speed of the core is known. It’s that GHz rating typical of today’s processor. For example, a 2.5-GHz processor ticks 2.5 billion times per second (Giga ~ billion). Therefore, a 2.5-GHz processor ticking 2.5 billion times per second and capable of performing 4 FLOPs each tick is rated with a theoretical performance of 10 billion FLOPs per second or 10 GFLOPS.

That’s probably more than anyone needs to know about the details of counting mathematical operations done by microprocessors. Fortunately, the final formula for computing theoretical performance of a system is quite simple and straightforward.

Here is a full and complete sample formula using dimensional analysis:

GFLOPS = #chassis * #nodes/chassis * #sockets/node * #cores/socket * GHz/core * FLOPs/cycle

Note that the use of a GHz processor yields GFLOPS of theoretical performance. Divide GFLOPS by 1000 to get TeraFLOPS or TFLOPS.

Likewise, MHz clocks used in the formula will yield MFLOPS, if you need that number. Similarly divide MFLOPS by 1000 to get GFLOPS. When might you need MHz these days, you ask? Think GPU speeds.

Note that for multi-rack systems, the formula may be improved by adding the number of chassis per rack as the first term.

Hope this helps.

-- Dr. Mark R. Fernandez, Ph.D.