Texas Advanced Computing Center (TACC)

By Tommy Minyard and Dan Stanzione, Texas Advanced Computing Center


In our last installment, we introduced the concept that performance was not all about clock rate and core count, and how in particular access speed to main memory often dominated performance. We also pointed out how the changes in architecture Intel had introduced with the Nehalem line of processors had dramatically changed the available bandwidth per core, and what a leap forward we had seen in the performance of typical applications.

Intel has continued to increase core count with the introduction of the Westmere processor, and upcoming processors will feature both more bandwidth and more cores (as will processors in competing lines). These changes will mean an overall improvement, of course, but also will require some deeper thinking about performance to figure out how to configure your applications to get maximum performance per socket.

"Conventional Wisdom" has always held that as you get more cores, you run more tasks. In fact, you run exactly as many tasks (or threads) as you have cores. Leaving a core idle is considered "wasteful". This is not surprising, but upon careful reflection doesn't make that much sense... No one considers it a "waste" if while running a job on every core of your machine, half your memory is empty, or half your network is unused, or you are only using half the available IOPS or bandwidth to your disk drive.

No one considers these resources "underutilized" in most situations partially because they aren't as visible, but mostly simply because they aren't what you think of as your measure of productivity. In fact, you should measure how productive your cluster (or even workstation) is by, well, how much work you produce. Productivity, like performance, is seldom as straightforward as a single benchmark, so there are several measures you can use to think about productivity. It might be how many jobs your cluster does in a given time (hour, day, year). It might be how fast *your* job runs on your cluster. For the sake of argument on this post, let's assume you use a cluster because you want to run big jobs that require a cluster... jobs that use multiple cores on a node, and may use multiple cores on multiple nodes. So, let's further assume that the metric we want to maximize for productivity is making these multi-core jobs run as fast and efficiently as they can on your hardware.

At this point, it's probably not going to be a surprise to you that what we'd like to propose in this post is that there are plenty of situations nowadays where leaving some cores idle is the smartest thing to do to maximize performance. This requires a little bit of a shift in mindset... You need to think of the cores on your processors as dynamic resources which you max out when you need them (like the network or the memory) rather than as something you need to utilize all the time to be productive. Hey, you spent money on a vacuum cleaner, and if you ran it every hour every day, that wouldn't be better utilization, it would simply be annoying. If you use it enough to keep your house clean, that's appropriate utilization. The important thing is it has the power you need when you turn it on. So, let's start thinking about all those cores on your processors the same way. They provide the power you need in some situations, but you don't need to use all of it all the time.

The processor (all the cores in one socket) gives you a certain amount of compute performance, a certain amount of cache space, and a certain amount of bandwidth to main memory. The trick is to use enough of each, without overloading any particular aspect of it. You might think that maxing out all dimensions is the right thing to do, but creating bottlenecks is a problem... you don't want to give so much load to the compute elements that you overload the memory bandwidth, or the whole system slows down. Think of it like (automotive) traffic. You can build a great highway, and you can build really fast cars... but if you put too many cars on the highway, everybody ends up slowing down. In our previous blog post, we showed how memory performance had leapt forward on the Nehalem processors. In the newer Westmere generation, we get this improved bandwidth, but we get a lot more processing power too in the form of the extra cores.

For some of your codes, you can take advantage of all this extra compute power. For instance, if you run something that looks like the Linpack benchmark, you can go from running 4 tasks per socket to 6, and your code will run about 50% faster. But for codes that rely heavily on memory bandwidth, this may not make as much sense. We've included some data from the WRF code below. WRF is the Weather Research and Forecasting code, and is the gold standard among climate codes. It's also known as one that doesn't tend to deliver anywhere near the theoretical floating point performance (on any platform), but is rather more sensitive to memory bandwidth. In the table below, we use same-speed 2.66GHz Nehalem and Westmere processors, and increase the number of cores (on a 2 socket node, so 8 cores in the Nehalem case, and 12 in the Westmere case).

WRF 3.2
2.66GHz Westmere 2.66GHz Nehalem
Cores Gflops/s Speedup Gflops/s Speedup
1 2.237662 1.00 2.239 1.00
2 4.477802 2.00 4.471 2.00
4 8.20954 3.67 8.176 3.65
8 13.31501 5.95 13.324 5.95
10 14.91807 6.67
12 15.33556 6.85 19.110 8.54



If you look at the performance and speedup, you see that the Nehalem keeps getting faster up to 8 cores per node, but in the Westmere case, when we go beyond 8 cores per node, we see diminishing returns of 12% performance improvement when increasing cores by 25% from 8 to 10. The drop off is more significant between 10 and 12 cores as the code only gets a 2.7% increase in performance with 20% more cores! Note that for reference, the speedup when using 12 cores of Nehalem on two nodes are shown to show what the performance could have been if the Westmere cores had not encountered a bottleneck getting data from memory. This simply means that for this code, the Nehalem's computing power (for WRF) didn't exceed the bandwidth available, but the Westmere has so much more computing power that you can create a traffic jam if you run WRF on all cores.

So, what does this mean? Well, maybe it tells the WRF developers that you can do a whole lot more computation between memory accesses essentially for free on the new processors. Maybe it says you can run some not-so-memory-intensive jobs alongside your WRF jobs on those extra cores essentially for free. But perhaps the most important thing it says is that to get maximum throughput nowadays, you shouldn't assume that the best and most efficient configuration is to use every core in every socket for your job. For some kinds of programs you will, for some kinds of programs you won't... but isn't it nice to have all that extra compute power lying around for the times that you need it? Remember, a new Westmere costs about the same as the Nehalem... so you get 50% more compute power effectively for free... if you are judicious about how you use it and don't create traffic jams!

The future of processors seems almost inevitably to be heading from multi-core to many-core. As me move to more and more cores, we need to shift our thinking about what using a core means. Perhaps we'll save some cores to run just the operating system, or just the user interface, and only use some of them for number crunching. As always, it will take the software a while to catch up with the new capabilities of the hardware. While we're waiting for that to happen, just remember you might have to balance things on your own. So, if you want to maximize your single job performance, take a little time to test the optimal number of cores per socket for your workload!

Dan Stanzione, Ph.D., TACC Dan Stanzione, Ph.D.
Deputy Director
Texas Advanced Computing Center
The University of Texas at Austin

Dr. Stanzione is the deputy director of the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. He is the principal investigator (PI) for several projects including “World Class Science through World Leadership in High Peformance Computing;” “Digital Scanning and Archive of Apollo Metric, Panoramic, and Handheld Photography;” “CLUE: Cloud Computing vs. Supercomputing—A Systematic Evaluation for Health Informatics Applications;” and GDBase: An Engine for Scalable Offline Debugging.”

In addition, Dr. Stanzione serves as Co-PI for “The iPlant Collaborative: A Cyberinfrastructure-Centered Community for a New Plant Biology,” an ambitious endeavor to build a multidisciplinary community of scientists, teachers and students who will develop cyberinfrastructure and apply computational approaches to make significant advances in plant science. He is also a Co-PI for TACC’s Ranger supercomputer, the first of the “Path to Petascale” systems supported by the National Science Foundation (NSF) deployed in February 2008.

Prior to joining TACC, Dr. Stanzione was the founding director of the Fulton High Performance Computing Institute (HPCI) at Arizona State University (ASU). Before ASU, he served as an AAAS Science Policy Fellow in the Division of Graduate Education NSF. Dr. Stanzione began his career at Clemson University, his alma mater, where he directed the supercomputing laboratory and served as an assistant research professor of electrical and computer engineering.

Dr. Stanzione's research focuses on such diverse topics as parallel programming, scientific computing, Beowulf clusters, scheduling in computational grids, alternative architectures for computational grids, reconfigurable/adaptive computing, and algorithms for high performance bioinformatics. He is strong advocate of engineering education, facilitates student research, and teaches specialized computation engineering courses.

Education
Ph.D., Computer Engineering, 2000; M.S., Computer Engineering, 1993; B.S. Electrical Engineering, 1991, Clemson University.
Tommy Minyard, Ph.D., TACC Tommy Minyard, Ph.D.
Director of Advanced Computing Systems
Texas Advanced Computing Center
The University of Texas at Austin

Dr. Minyard is the director of the Advanced Computing Systems group at the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. His group is responsible for operating and maintaining the center’s production systems and infrastructure; ensuring world-class science through HPC leadership; enhancing HPC research using clusters; performing fault tolerance for large-scale cluster environments; and conducting system performance measurement and benchmarking.

Dr. Minyard holds a doctorate in Aerospace Engineering from The University of Texas at Austin where he specialized in developing parallel algorithms for simulating high-speed turbulent flows with adaptive, unstructured meshes. While completing his doctoral research in aerospace engineering, Dr. Minyard worked at the NASA Ames Research Center and the Institute for Computer Applications in Science and Engineering. After continuing his research at UT Austin as a post-doctorate research assistant, he joined CD-Adapco as a software development specialist to continue his career in computational fluid dynamics. Dr. Minyard returned to UT Austin in 2003 to join the Texas Advanced Computing Center.

Education

Ph.D., Aerospace Engineering, 1997; M.S., Aerospace Engineering, 1993; B.S. Aerospace Engineering, 1991, The University of Texas at Austin.