Join us at Super Computing 2011!We invite you to visit us at the SC2011 conference in Seattle, Nov 14-17 at Booth #2040 See first-hand how we are enabling research discovery with Dell HPC solutions.
I think it’s fairly safe to say that so far PetaFLOPS class systems will be more fairly straight-forward to build in the next three years or so. The processing part of the problem, whether it be from CPUs or a hybrid of CPUs and GPUs, is gaining a fair amount of momentum and in 3 years we should enough compute horsepower to build PFLOPS class systems that are not considered “hero” systems. In other words, they will become common place. However, just like anything, the picture is perhaps not as rosy as one would expect. There can be some unique challenges to reaching PFLOPS that we need to discuss.In this article I’m going to discuss three topics surrounding PetaFLOPS:1. Power2. Programming3. Failure RateThese three topics encompass not all, but a good portion, of the issues surrounding PFLOPS class systems today.PowerOne of the biggest impediments to achieving PetaFLOPS is power. In part 2 of this article series (http://www.delltechcenter.com/page/02-24-2010+-PetaFLOPS+for+the+Common+Man+Pt+2%E2%80%93+What+do+current+PetaFLOPS+systems+look+like%3F) you saw how the two current systems require a great deal of power. Jaguar, the reigning #1 system on the Nov. 2009 Top500, needs about 6.95 MW of power. Roadrunner, the #2 system, requires 2.35 MW of power. This is an enormous amount of electrical power for the system and results in an enormous amount of cooling. One safe assumption is that it requires 1W of cooling for every watt of power. So that means that Jaguar needs about 13.9 MW of total power and Roadrunner needs about 4.7 MW of total power. I don’t know about you, but to me that is a great deal of power.Hybrid systems don’t make it too much easier today. The TH-1 system discussed in the last article (http://www.delltechcenter.com/page/PetaFLOPS+for+the+Common+Man-+Pt+4+What+could+Hybrid+PetaFLOPS+systems+look+like) uses about 1.48MW of power to reach about ½ PFLOPs on the Top500 benchmark. This too, is a great deal of power. But let’s look at perhaps a better measure, MFLOPS/W, or, the amount of computational power (MFLOPS) for a given amount of electrical power.Jack Dongarra, one of the fathers of the Top500 gave a presentation in Hong Kong (http://www6.cityu.edu.hk/cityu25/events/engineering/pdf/profdongarra.pdf) where he examined the top 10 systems on the Top500. A quick summary is here:
Carefully examine the last column. The performance per watt is lower for CPU only systems compared to Hybrid systems with the exception of IBM BlueGene systems (one could make the argument that BlueGene systems are not the normal CPU only systems using commodity processors but a specialized system focused on HPC).There is a list of the most power efficient systems called the Green500 (http://www.green500.org). If you look at the latest list (http://www.green500.org/lists/2009/11/top/list.php) you will see that the more efficient systems are hybrid systems until you reach #9 which is a BlueGene system. The #1 system is about twice as energy efficient as the #9 system (722.98 MFLOPS/W versus 378.77 MFLOPS/W). The most power efficient CPU based system that is not a BlueGene is #24 at 341.32 MFLOPS/W.I think this speaks volumes about the appeal of Hybrid systems for reaching higher levels of performance. It also illustrates IBM’s approach using a system specifically designed for HPC (BlueGene) and what it can do for power efficiency. As an experiment, let’s take the #252 system on the Top500 list. This system uses Intel Harpertown class processors at 2.0 GHz and Infiniband (the system is at the University of Oklahoma). It has 4,176 processors or 522 nodes. That system currently uses 361.61 kW of power to run the system and achieves about 28.03 TFLOPS. The energy efficiency equates to 80.87 MFLOPS/W.Now let’s assume we are running at the power efficiencies of TH-1 (380 MFLOPS/W). With this power efficiency we could achieve the following performance.Performance = (361,610 W * 380 MFLOPS/W * 1/1000000 (GFLOPS/MFLOPS) = 137.4118 TFLOPSSo this is about 14% of a 1 PFLOPS (0.137). Obviously we have to get better power efficiencies to reach 1 PFLOPS.Now let’s take the #1 system on the Green500 and determine what performance we could achieve.Performance = (361,610 W * 722.98 MFLOPS/W * 1/1000000 (GFLOPS/MFLOPS) = 261.436 TFLOPSWe’re getting closer, 0.261 PFLOPS, but we are still short. Now, let’s reverse the problem and determine what energy efficiency we need to achieve 1 PFLOPS.Efficiency = 1,000,000,000 (MFLOPS/PFLOPS) / 361,610 W = 2,765.41 MFLOPS/WSo efficiency has to get better by almost a factor of 3.5 to reach 1 PetaFLOPS. So what does this mean?What this means is that the average data center today, as represented by the University of Oklahoma does not have enough power to reach 1 PFLOPS. While it appears to be easy to actually construct a PFLOPS system as the evidence in the last articles points out, it’s another thing to actually power one in today’s data center. To achieve PetaFLOPS in “average” data centers we will need to improve our energy efficiency by about a factor of 3.5.ProgrammingThe second challenging aspect of PetaFLOPS systems is programming for them. I’m sure if you are reading this then you have either written or used HPC applications. One of the difficult problems many applications face is scaling. That is, as you increase the number of cores used in the application, the performance improves. Not all applications face this problem but if you look at the information that IDC has collected, users are telling them that 82.1% of the applications only scale to 32 cores (http://www.google.com/url?sa=t&source=web&ct=res&cd=2&ved=0CAkQFjAB&url=http%3A%2F%2Fwww.hpcuserforum.com%2Fpresentations%2FLondon%2FSTEVE%2520London%2520HPC%2520UF%25202008%2520slides%252010.6.2008.ppt&rct=j&q=IDC%2C+HPC+Application+scaling&ei=LhaxS-jxDMP7lwe244WRAQ&usg=AFQjCNHsol-jX3noMo9tgV5AkCQFv4-vlQ). How do we take applications that only scale to 32 cores and expand them to thousands of cores?On the other hand hybrid systems look to be a great way to achieve huge performance and save power and floor space at the same time. However, programming GPUs is not as easy as it seems. In this article (http://www.linux-mag.com/id/4543) some of the difficulties in programming GPUs are discussed. Fundamentally they require a different programming model than normal CPUs.Companies such as Nvidia and AMD are doing a good job in providing tools for programming GPUs, particularly Nvidia which has been providing CUDA for quite some time. However, the end result is that you have to rewrite or “port” parts of your code to use the GPUs. The end result of both approaches (CPU only PetaFLOPS systems and Hybrid PetaFLOPS systems) is that applications will have to be rewritten to take advantage of systems of this size. A quick summary would be:· Easy to program CPU based PetaFLOPS systems - but hard to scale applications· Harder to program Hybrid PetaFLOPS systemsFor either approach – it will take work.Failure RateThe last topic I want to cover is perhaps one that people don’t always consider, the failure rate in large systems. “Failure” can mean many things depending upon the situation. It can mean that a node in the system has gone unresponsive forcing a reboot. It can mean that a hardware component within a node has failed and needs to be replaced or fixed. It can also mean a software failure. For any of these failures, the result is that any application that was utilizing the failed node(s) will also fail. The rate at which components fail or applications fail is a very important measure to consider.For example, let’s look at Jaguar, the current #1 system on the Top500 (as of Nov. 2009). According to Jeff Vetters at Oak Ridge National Labs, Jaguar has the following failure statistics:
The mean time between interruptions on the system is only 32 hours - this includes both hardware and software failures – basically it means the amount of time an application can run without failure. Moreover, there is a hardware failure somewhere on the system every 56 hours (a little more than 2 days).It has been known that failure rates are a very important aspect of system design for some time. These failure rates can be used to extrapolate or understand what kind of failure rates future PetaFLOPS systems could have. The best source of data and methods for failure rates is from the Petascale Data Storage Institute (PDSI) at Carnegie Mellon University (http://www.pdsi-scidac.org/).The PDSI was established with one aspect of their charter being to collect failure rate data from very large HPC and non-HPC sites. To date the Institute has collected failure data from 30 systems at 6 sites. Some of these sites are HPC sites and some are Internet Service Providers ISPs). The data from the 30 systems listed the failure history of various components in the systems. In total, the data covered over 100,000 disk drives from at least 4 different vendors for time periods ranging from 1 month to 5 years and for SATA, SCSI, and FC drives.One of the challenges in analyzing the data was the definition of what constitutes a "failure" versus a "replacement." In reality, if the user considered the drive to be "failed" for whatever reason, they pulled the drive out and replaced it with a new one. This impacts the usability of the systems, so it really constitutes a "failure."Garth Gibson and Biana Schroerder at PDSI have examined the data that has been collected and made some interesting observations (http://www.cse.ohio-state.edu/~lai/icpp-2007/Gibson-PDSI-ICPP07-keynote.pdf). Figure 1 below, courtesy of Garth Gibson at PDSI, shows a descending list of replaced components for 3 of the systems (HPC1 is a 765 node HPC cluster with 4-way SMP nodes and 5 years worth of data on 10k rpm SCSI drives, COM1 is an ISP with 26,734 SCSI drives, and COM2 is an ISP with up to 9,232 servers and 39,039 SCSI drives).Figure 1 - Top Ten Replaced ComponentsNotice that disk drives are among the most frequently replaced hardware components reaching almost 50% of hardware failures in the COM2 data (recall that COM2 used SCSI drives). They also examined all of the data for disk replacement rates. Figure 2, again courtesy of Garth Gibson at PDSI, is a plot of the Annual Disk Replacement Rate (ARR) for 9 of the systems that span SCSI, SATA, and FC drives. Figure Two - Annual Disk Replacement RateNotice that the average ARR for the data is 3%. But the ARR derived from the disk manufacturers data is between 0.58% and 0.88% (the drive manufacturers state the Mean Time To Failure (MTTF) is between 1,000,000 and 1,500,000 hours). The ARR from the data indicates a MTTF that is a 2-10 times less than the manufacturer's number. In this study, there is poor evidence for the commonly held belief that SATA failure rates are higher than SCSI or FC. If anything, the SATA drives had a significantly lower failure rate than some SCSI (HPC6) or FC (COM3) drives.The researchers then looked at a set of 23 clusters at Los Alamos that have over 5,000 nodes, covering 9 years with 23,000 events resulting in application interruption. Los Alamos has a wide variety of clusters, some with the more typical 2/4 socket SMP boxes in commodity clusters with 100's to 1000's of nodes, and some with NUMA systems that have 128-256 processors per node and only 10's of nodes. This distribution of machines allowed the researchers to see if the application interruption rate followed some sort of pattern. At first blush, there wasn't much of a pattern until the number of failures was normalized by the number of processors. Figure 3, from Garth Gibson at PDSI, is a chart of the failures per year per processor for the various machines.Figure 3 - Failures per Year per ProcessorThe failures lie in between a rate of 0.1 and 0.25 per year per processor regardless of the configuration of the machine. Since the failure rate of the CPU itself is very small, the failures can be assumed to be tied to sockets rather than processors.These failure rates can be used to project how much time there will be in between application interruptions since the failure rate is a function of the number of sockets. This is particularly important because the number of sockets is growing quickly to reach the elusive PetaFLOPS level of performance. The first step is to plot the general trend of the number of sockets over time. Figure 4, below, courtesy of Garth Gibson, plots the number of sockets as a function of the years for three different assumptions that the number of sockets doubles every 18, 24, and 30 months (this model tracks the general trend in the Top500 where the peak FLOPS doubles annually). Figure 4 - Trend of Number of SocketsAssuming a very optimistic MTTI rate (Mean Time to Interruption) of 0.1 (compared to the historic 0.25), the time between interruptions can then be computed. Figure 5 below plots the MTTI for the three models of the growth of the number of sockets (18, 24, and 30 months), courtesy of Garth Gibson at PDSI. Figure 5 - MTTI (min.) TrendEven in the year 2006, the MTTI was only about 480 minutes (8 hours). In 2008, the models vary between 260 minutes (4.3 hours) to about 325 minutes (5.42 hours). In 2010 (now), the MTTI ranges from 110 minutes (1.83 hours) to 210 minutes (3.5 hours). The fundamental reason for the decrease in MTTI is the increase in the number of sockets.Comparing the projections to Jaguar numbers, the model is a bit pessimistic, but it does point out that as systems grow larger the MTTI decreases fairly dramatically. For users of PetaFLOPS systems, the failure rate really translate into “how long can I expect to run before my application crashes.” This discussion points out that as systems grow in terms of number of processors (really number of sockets), the MTTI decreases rapidly. But does this spell doom for PetaFLOPS systems?The secret weapon in the “fight” against MTTI is the number of cores per socket. We now have 12-core processors and quad-socket motherboards allowing us to easily fit 48 cores in a simple 4-socket server. So the number of sockets is the same as when the MTTI study was completed (2007), but we have increased the number of cores by about a factor 3-6! This means our computational power has increased by a power of 3-6 but the MTTI has stayed the same.SummaryThis article discusses some of the challenges in reaching PetaFLOPS for the Common man (or common data center). The three concepts presented are:1. Powering a PetaFLOPS systems2. Application scaling/programming a PetaFLOPS system3. Failure rates in PetaFLOPS systemsThe first challenge, powering a PetaFLOPS class system, points out that it fundamentally takes a great deal of electrical power to run one. The current number one and two systems on the Top500 require MegaWatts of power just to run the system. The average data center does not have this kind of power available. We took a typical data center in the middle of the Nov. 2009 Top500 list from the University of Oklahoma and used it to help us define how much more energy efficient systems would have to become to achieve one PFLOPS. The answer came back that future computational systems would have to be at least 3.5 times more energy efficient than the current most energy efficient system, to make PetaFLOPS a reality for the average data center. This is going to be a very difficult target to hit, but the general trend in using Hybrid systems could get us close to this target.The second challenge, programming a PetaFLOPS beast, is also a very difficult mountain to climb. IDC data indicates that that average users have applications that only scale to 32 cores. Yet, CPU only PetaFLOPS scale systems will have tens of thousands of even hundreds of thousands of cores. Writing or adapting (rewriting) applications to scale to such a large number of cores is definitely a challenge.Hybrid systems also present a challenge because GPUs are fundamentally “different” than normal CPUs. Consequently you will have to rewrite your application to use them effectively. So again, we are faced with having to rewrite our applications to scale to PetaFLOPS.The third challenge, failure rates, illustrates that the amount of time between interruptions on a PetaFLOPS system (the amount of time between some sort of system problem causing an application to fail) is not as large as one would think. The current #1 system in the world only has a MTTI (Mean Tim to Interruption) of about 32 hours.A simple model of failure rates, which is a function of the number of sockets in a system, illustrated that as systems grow, the MTTI is going to constantly decrease to the point where MTTI is measure in minutes. However, the secret weapon we can use to combat MTTI is increasing the number of cores per socket. This allows us to grow systems without really decreasing out MTTI.The challenges are great but if you watch system development over the last few years, there is definitely hope for PetaFLOPS scale systems in the common data center. -- Dr. Jeff Layton