A #Dell colleague, Dave Keller (@DaveKatDell), alerted me to a YouTube video featuring Vint Cerf, a founding father of the Internet and current Chief Internet Evangelist at Google.
In that video, Vin Cerf explains that devices connected to the Internet are given Internet addresses like phones are given phone numbers. That address, known as an IP address, is usually represented as a grouping of 4 numbers separated by dots, such as 192.168.0.1. This is the default IP address of many Netgear routers such as might be used in your home.
Each of those numbers in Version 4 of the Internet Protocol (IPv4) can be a number from 0 to 255. This means there are 256 choices for each of the 4 numbers.
256 * 256 * 256 * 256 = 4,294,967,296
So, there are about 4.3 billion addresses available for devices to connect to the Internet. In 1980, there were only about 280 million people in the entire United States. 4.3 billion sounds like plenty!
But how many do you use today? Cell phone? Laptop or Tablet? Home computer? Work computer? Home Internet router? TV?
I just named 6 possible ones. Without going into private networks, etc., I think it is safe to say that when you are connected to the Internet, you are using an IP address.
OK. So what’s the big deal? China has over a billion people. India has over a billion people. And according to the Vint Cerf in that same video, there are over 5 billion mobile devices in the world today. According to Government Technology (http://del.ly/6046XoZ8), in 2020 there will be 50 billion Internet-enabled devices in the world. To put that number in perspective, that equates to more than 6 connected devices per person. Oops!
But don’t worry. Internet Protocol Version 6 (IPv6) is rolling out. China is actually take a lead in this. Imagine why.
4.3 billion sounded big. Just how big is IPv6? Almost too big to explain or to even comprehend. It is well over one trillion times as large as IPv4. Or, with IPv6 those 4.3 billion address available from IPv4 are available to each and every person alive. I can almost understand and appreciate that. But in fact, it’s much larger: over a trillion-trillion-trillion total addresses. Or for the nerds out there, about a third of a google of addresses.
And according to Paul Gil over at About.com “These trillions of new IPv6 addresses will meet the internet demand for the foreseeable future.” I certainly hope that is an understatement!
If you have comments or can contribute additional information, please feel free to do so. Thanks. --Mark R. Fernandez, Ph.D.
Follow me on Twitter @MarkFatDell
by Garima Kochhar
It’s been an exciting week – Intel Haswell processors for two-socket servers, DDR4 memory and new Dell servers were just released. We’ve had a busy few months leading up to this announcement – our team had access to early server units for the HPC lab and we spent time kicking the tires, running benchmarks, and measuring performance. This blog describes our study and initial results and is part one of a three part series. The next blog will discuss the performance implications of some BIOS tuning options available on the new servers, and a third blog will compare performance and energy efficiency across different Haswell processor models.
Focusing on HPC applications, we ran two benchmarks and four applications on our server. Our interest was in seeing how the server performed and specifically how it compared to the previous generations.
The server in question is part of Dell’s PowerEdge 13th generation (13G) server line-up. These servers support DDR4 memory at up to 2133 MT/s and Intel’s latest Xeon® E5-2600 v3 Product Family processors (based on the architecture code-named Haswell). Haswell (HSW) is a net new micro-architecture when compared to the previous generation - Sandy Bridge/Ivy Bridge. HSW processors use a 22nm process technology, so there’s no process-shrink this time around. Note the “v3” in the Intel product name – that is what distinguishes a processor as one based on Haswell micro-architecture. You’ll recall that “E5-2600 v2” processors are based on the Ivy Bridge micro-architecture and plain E5-2600 series with no explicit version are Sandy Bridge based processors. Haswell based processors require a new server/new motherboard and DDR4 memory. The platform we used is a standard dual-socket rack server with two Haswell-EP based processors. Each socket has four memory channels and can support up to 3 DIMMs per channel (DPC). For our study we used 1 DPC for a total of eight DDR4 DIMMs in the server.
From an HPC point of view, one of the most interesting aspects is the Intel® AVX2 technology that allows the processor to execute 16 FLOP per cycle. The processor supports 256 bit registers, allows three-operand non-destructive operations (i.e. A = B+C vs. A = A+B), and a Fuse-Multiply-Add (FMA) instruction (A = A*B+C). The processor has two FMA units each of which can execute 4 double precision calculations per cycle. With two floating point operations per FMA instructions, HSW can execute 16 FLOP/cycle. This value is double of what was possible with Sandy Bridge/Ivy Bridge (SB/IVB)! There are many more instructions introduced with HSW and Intel® AVX2 and these are described in detail in this Intel programming reference or on other blogs.
Double the FLOP/cycle - does this mean that HSW will have 2x the theoretical performance of an equivalent IVB processor? Close but not quite - read on. In past generations, we've looked at the rated base frequency of a processor and the available Turbo bins/max Turbo frequency. For example, the Intel® Xeon® E5-2680 v2 has a base frequency of 2.8 GHz and a maximum of 300 MHz of turbo available when all cores are active. HSW processors will consume more power when running the new Intel® AVX2 instructions than when running non-AVX instructions. And so, starting with Haswell product family there will be two rated base frequencies provided. The first is the traditional base frequency, which is the frequency one could expect to run non-AVX workloads. The second frequency is the base frequency for workloads that are running AVX code, the “AVX base frequency”. For example, the HSW Xeon® E5-2697 v3 has a base frequency of 2.6 GHz and an AVX base of 2.2 GHz. Compare that with the Xeon® E5-2680 v2 IVB processor running at 2.6 GHz. For the 2.6 GHz IVB processor, we would calculate HPL theoretical maximum performance as (2.6 GHz * 8 FLOP/cycle * total number of cores). But for a HSW processor with the same rated base frequency of 2.6 GHz and an AVX base of 2.2 GHz, we now calculate HPL theoretical maximum using the AVX base as (2.2 GHz * 16 FLOP/cycle * total number of cores) since HPL is an AVX-enabled workload. In terms of FLOPs an HSW processor will perform much better than an IVB, close to 2x but not exactly 2x due to the lower AVX base frequency. Of course, enabling Turbo mode can allow higher core frequencies when there is power/thermal headroom. But due to the extra power consumption of AVX instructions, non-AVX codes/portions of the code may run at higher Turbo bins than the AVX portions.
The goal of communicating a separate “AVX base frequency” is two-fold – AVX codes will run “hotter” (consume more power) than non-AVX, so lower frequencies on those codes are expected and are by design. Secondly, by providing this secondary “AVX base frequency” Intel is providing a baseline expected frequency for highly optimized AVX workloads.
There are many other significant new aspects in this release – DDR4 memory technology, improvements in Dell server design and energy efficiency, and new features in systems management to name a few. All of these aspects are also factors in improvements in server performance in this generation. These factors are not discussed in this blog.
Now, getting down to the fun part - the results. Table 1 below details the applications we used. (Click on images to enlarge.)
Table 1 - Applications and benchmarks
For reference, Table 2 describes the test configuration on the new 13G server. Data for some of the tests on the previous generation systems was gathered with the most current versions available at that time. The performance improvements noted here are mainly due to architectural improvements generation-over-generation, the software versions are not a significant factor.
Table 2 - Server configuration
All the results shown here are based on single-server performance. The following metrics were used to compare performance:
Figure 1 shows the measured memory bandwidth as reported by Stream Triad on the 13G server when compared to previous generations. In the graphs below, “11G – WSM” denotes the Dell 11th generation (11G) servers that support Intel Westmere (WSM), i.e. Intel® Xeon® X5600 series processors. “12G – SB” and “12G – IVB” are the Dell 12th generation servers (12G) that support Intel Sandy Bridge and Ivy Bridge processors. Full system memory bandwidth is ~ 116 GB/s on the 2133 MT/s HSW system. This is an 18% improvement over the 12G-IVB system that could support memory speeds of up to 1866 MT/s. Even with the increased number of cores on HSW, memory bandwidth per core has remained mostly constant from the previous generation and is ~4.2 GB/s per core.
Figure 1 - Memory bandwidth
Figure 2 shows the HPL performance generation-over-generation. With the HSW system we measured close to 1 TFLOPS on a single server! This improvement in HPL performance is due to the increase in floating point capability. Note that the HPL efficiency of 93% for the HSW system is computed using the AVX base frequency and not the rated frequency as discussed above. For the processor used, the AVX base is 2.2 GHz.
Figure 2 - HPL performance
Figure 3 shows Ansys Fluent performance on 13G HSW when compared to 12G IVB. The Solver Rating as reported by Fluent is the metric used for performance. The 12G system used was a PowerEdge C6220 II with dual Intel® Xeon® E5-2680 v2 @ 2.8 GHz (10 cores each), 128 GB (8 x 16GB 1866MHz) memory. The 13G system was configured as described in Table 2. Note that the all-core Turbo on both the 12G IVB and the 13G HSW system is 3.1 GHz so this is a good comparison across the two generations.There is a significant performance improvement with HSW – 33% to 48% depending on the data set used. Since Ansys Fluent has a per-core license, we wanted to weight this performance with the increase in number of cores from 12G IVB to 13G HSW (20 cores to 28 cores, a 40% increase in cores.). The per core performance improvement depends on the data set used, with truck_poly_14m demonstrating a 6% performance improvement with 13G even when accounting for the increased number of cores.
Figure 3 - Ansys Fluent performance
Figure 4 shows WRF performance generation-over-generation for the Conus 12km data set. The average time step computed over the last 149 intervals for Conus 12km is the metric used for performance. There is a 40% improvement from 12G-IVB and a 3.2x improvement over the Westmere platform. The improvement with HSW is likely due to the better memory throughput and micro-architecture enhancements.
Figure 4 - WRF Conus 12km performance
MILC performance is shown in Figure 5. Total time as reported by the MILC application is the metric used for performance. We measured a 10% improvement in performance with HSW when compared to a 12G IVB system and a 30% improvement when compared to a 12G SB platform.
Figure 5 - MILC performance
LS DYNA with car2car and WRF with Conus 2.5km performance comparisons are shown in Figure 6. Elapsed time as reported by LS DYNA is used to compare performance, and average time step computed over the last 719 intervals for Conus 2.5km is the WRF metric used for performance. Both applications perform 38-37% better with 13G-HSW when compared to 12G-IVB.
Figure 6 - LS DYNA, WRF 2.5km performance
In conclusion, the new 13G servers show performance improvements across all the applications studied here; this study provides some early comparisons and quantifies these performance improvements.
Look out for the next two blogs in this series. Blog 2 will discuss the performance and energy efficiency implications of some BIOS tuning options available on the 13G servers, and the third blog will compare different Haswell processor models.
This blog discusses the performance and energy efficiency implications of BIOS tuning options available on the new Haswell-based servers for HPC workloads. Specifically we looked at memory snoop modes, performance profiles and Intel’s Hyper-Threading technology and their impact on HPC applications. This blog is part two of a three part series. Blog one provided some initial results on HPC applications and performance comparisons on these new servers and previous generations. The third blog in this series will compare performance and energy efficiency across different Haswell processor models.
We’re familiar with performance profiles including power management, Turbo Boost and C-states. Hyper-Threading or Logical Processor is a known feature as well. The new servers introduce three different memory snoop modes – Early Snoop, Home Snoop and Cluster On Die. Our interest was in quantifying the performance and power consumed across these different BIOS options.
The “System Profile Settings” category in the BIOS combines several performance and power related options into a “meta” option. Turbo Boost, C-states, C1E, CPU Power Management, Memory Frequency, Memory Patrol Scrub, Memory Refresh Rate, Uncore Frequency are some of the sub-options that are pre-set by this “meta” option. There are four pre-configured profiles, Performance Per Watt (DAPC), Performance Per Watt (OS), Performance and Dense Configuration, that can be used. The DAPC and OS profiles balance performance and energy efficiency options aiming for good performance while controlling the power consumption. With DAPC, the Power Management is handled by the Dell iDRAC and system level components. With the OS profile, the operating system controls the power management. In Linux this would be the cpuspeed service and cpufreq governors. The Performance profile optimizes for only performance – most power management options are turned off here. The Dense Configuration profile is aimed at dense memory configurations, memory patrol scrub is more frequent and the memory refresh rate is higher and Turbo Boost is disabled. Additionally if the four pre-set profiles do not meet the requirement, there is a fifth option “Custom” that allows each of the sub-options to be tuned individually. In this study we focus only on the DAPC and Performance profiles. Past studies have shown us that DAPC and OS perform similarly, and Dense Configuration performs lower for HPC workloads.
The Logical Processor feature is based on Intel® Hyper-Threading (HT) technology. HT enabled systems appear to the operating system as having twice as many processor cores as they actually do by ascribing two “logical” cores to each physical core. HT can improve performance by assigning threads to each logical core; logical cores execute their threads by sharing the physical cores’ resources.
Snoop Mode is a new category under Memory Setting. Coherence between sockets is maintained by way of “snooping” the other sockets. There are two mechanisms for maintaining coherence between sockets. Snoop broadcast (Snoopy) modes where the sockets are snooped for every memory transaction and directory support where some information is maintained in memory that gives guidance on whether there is a need to snoop.
The Intel® Xeon® Processor E5-2600 v3 Product Family (Haswell) supports three snoop modes in dual socket systems - Early Snoop, Home Snoop and Cluster On Die. Two of these modes are snoop broadcast modes.
In Early Snoop (ES) mode, the distributed cache ring stops can send a snoop probe or a request to another caching agent directly. Since the snoop is initiated by the distributed cache ring stops itself, this mode has lower latency. It is best for workloads that have shared data sets across threads and can benefit from a cache-to-cache transfer, or for workloads that are not NUMA optimized. This is the default mode on the servers.
With Home Snoop (HS) mode, the snoop is always spawned by the home agent (centralized ring stop) for the memory controller. Since every snoop request has to come to the home agent, this mode has higher local latencies than ES. HS mode supports additional features that provide extra resources for larger number of outstanding transactions. As a result, HS mode has slightly better memory bandwidth than ES - in ES mode there are a fixed number of credits for local and remote caching agents. HS mode is targeted at workloads that are bandwidth sensitive.
Cluster On Die (COD) mode is available only on processor models that have 10 cores or more. These processors are sourced from different dies compared to the 8 core and 6 core parts and have two home agents in a single CPU/socket. COD mode logically splits the socket into two NUMA domains that are exposed to the operating system. Each NUMA domain has half of the total number of cores, half the distributed last level cache and one home agent with equal number of cores cache slices in each numa domain. Each numa domain (cores plus home agent) is called a cluster. In the COD mode, the operating system will see two NUMA nodes per socket. COD has the best local latency. Each home agent sees requests from a fewer number of threads potentially offering higher memory bandwidth. COD mode has in memory directory bit support. This mode is best for highly NUMA optimized workloads.
With Haswell processors, the uncore frequency can now be controlled independent of the core frequency and C-states. This option is available under the System Profile options and is set as part of the pre-configured profiles.
There are several other BIOS options available, we first picked the ones that would be most interesting to HPC. Collaborative CPU Performance Control is an option that allows CPU power management to be controlled by the iDRAC along with hints from the operating system, a kind of hybrid between DAPC and OS. This is a feature we plan to look at in the future. Configurable TDP is an option under the Processor Settings section and allows the processor TDP to be set to a value lower than the maximum rated TDP. This is another feature to examine in our future work.
Focusing on HPC applications, we ran two benchmarks and four applications on our server. The server in question is part of Dell’s PowerEdge 13th generation (13G) server line-up. These servers support DDR4 memory at up to 2133 MT/s and Intel’s latest Xeon® E5-2600 v3 series processors (architecture code-named Haswell). Haswell is a net new micro-architecture when compared to the previous generation Sandy Bridge/Ivy Bridge. Haswell processors use a 22nm process technology, so there’s no process-shrink this time around. Note the “v3” in the Intel product name – that is what distinguishes a processor as one based on Haswell micro-architecture. You’ll recall that “E5-2600 v2” processors are based on the Ivy Bridge micro-architecture and plain E5-2600 series with no explicit version are Sandy Bridge based processors. Haswell processors require a new server/new motherboard and DDR4 memory. The platform we used is a standard dual-socket rack server with two Haswell-EP based processors. Each socket has four memory channels and can support up to 3 DIMMs per channel (DPC).
Table 1 below details the applications we used and Table 2 describes the test configuration on the new 13G server. (Click on images to enlarge.)
Table 1 - Applications and Benchmarks
Table 2 - Server Configuration
All the results shown here are based on single-server performance. The following metrics were used to compare performance.
Power was measured by using a power meter attached to the server and recording the power draw during the tests. The average steady state power is used as the power metric for each benchmark.
Energy efficiency (EE) computed as Performance per Watt (performance/power).
As described above, with Cluster On Die as the Memory Snoop mode the operating systems sees two NUMA nodes per socket for a total of four NUMA nodes in the system. Each NUMA node has three remote nodes, one on the same socket and two on the other socket. When using a 14core E5-2697 v3 processor, each NUMA node has 7 cores and one fourth of the total memory.
Figure 1 plots the Stream Triad memory bandwidth score in such a configuration. The full system memory bandwidth is ~116 GB/s. When 14 cores on a local socket access local memory, the memory bandwidth is ~ 58GB/s - half of the full system bandwidth. Half of this, ~29 GB/s, is the memory bandwidth of 7 threads on the same NUMA node accessing their local memory.
When 7 threads on one NUMA node access memory belonging to the other NUMA node on the same socket there is a 47% drop in memory bandwidth to ~15GB/s. This bandwidth drops a further 11% to ~14GB/s when the threads access remote memory across the QPI link on the remote socket. This tells us there is significant bandwidth penalty in COD mode when data is not local.
Figure 1 - Memory Bandwidth with COD Mode
Figure 2 and Figure 3 compare the three different snoop modes on two processor models across the different applications. The system profile was set to DAPC, HT disabled. All other options were at BIOS defaults.
The graphs plot relative performance of the three modes in the height of the bar. This is plotted on the y-axis on the left. The relative power consumed is plotted on the secondary y-axis on right and is noted by a marker. The text value noted in each bar is the energy efficiency, higher is better. The baseline used for comparison is HS mode.
For both the processor models, the performance difference between ES and HS is slight – within a couple of percentage points for most applications for these single-server tests. This difference is expected to be even smaller at the cluster-level. COD performs better than ES/HS for all the applications, up to 4% better in the best case.
In terms of power consumption, COD consumes less power than ES and HS in most cases. This combined with better performance gives COD the best energy efficiency of the three modes, again by a few percentage points. It will be interesting to see how this scales at the cluster level (more future work!).
Figure 2 - Snoop Modes - E5-2697 v3
Figure 3 - Snoop Modes - E5-2660 v3
Figures 4 and 5 compare the System Profile across the two processor models for the HS and COD modes. HT is disabled. All other options were at BIOS defaults.
The graphs plot relative performance in the height of the bar. This is plotted on the y-axis on the left. The relative power consumed is plotted on the secondary y-axis on right and is noted by a marker. The text value noted in each bar is the energy efficiency, higher is better. The baseline used for comparison is HS mode with DAPC profile.
For both the processor models the two profiles DAPC and Performance (Perf) show similar performance, within a couple of percentage points. From the graphs, HS.DAPC is similar to HS.Perf, COD.DAPC is similar to COD.Perf. The bigger differentiator in performance is HS vs. COD, going from DAPC to Perf improves performance by a smaller factor. WRF is the only application that shows better performance with DAPC when compared to Perf.
The Performance profile consumes more power than DAPC by design since many power management features are turned off. This is shown in markers plotted on the secondary-y-axis. As expected, energy efficiency is better with the DAPC profile since the performance improvement with the Perf profile is less than the additional power consumed.
Figure 4 - System Profiles - E5-2697 v3
Figure 5 - System Profiles – E5-2660 v3
Figures 6 and 7 evaluate the impact of Hyper-Threading. These tests were conducted with the HS mode and DAPC System Profile. All other options were at BIOS defaults.
The graphs plot relative performance in the height of the bar. The relative power consumed is plotted on the secondary y-axis on right and is noted by a marker. The text value noted in each bar is the energy efficiency, higher is better. The baseline used for comparison is HS mode, DAPC profile and HT off. Where used in the graph, “HT” implies Hyper-Threading is enabled.
For all applications except HPL, the HT enabled tests used all the available cores during the benchmark. For HPL, only the physical number of cores was used irrespective of HT enabled or disabled. This is because HPL is used as a system benchmark for stress tests and is known to have significantly lower performance when using all HT cores. HT enabled is not a typical use-case for HPL.
HT enabled benefits MILC and Fluent. Fluent is licensed per core, and the ~12% improvement in performance with hyper-threading enabled probably does not justify doubling the license cost for the logical cores.
We measured a 3% improvement in LS-DYNA with HT enabled on the 10c E5-2660 v3. Again, the extra cost for the HT cores probably does not justify this small improvement. The 14c E5-2697 v3 does not show any performance improvement for LS-DYNA with HT. The total memory bandwidth for both these processor models is similar; both support 2133 MT/s memory. With the 14c model, we’ve added 40% more cores when compared to the 10c model. It’s likely that the memory bandwidth per core with HT enabled on the 14c processor is smaller than LS-DYNA’s requirement and that is why there is no performance improvement with HT enabled on the 14c processor.
The power consumption with HT enabled was higher for all cases when compared to HT disabled. The EE therefore depended on whether the additional performance with HT enabled was on par with the higher power consumption and is noted as text values in the bars in the graph.
These application trends for HT are similar to what we have measured in the past on Dell’s 12th generation Sandy-Bridge based servers.
Figure 6 - Hyper-Threading - E5-2697 v3
Figure 7- Hyper-Threading - E5-2660 v3
Figures 8 and 9 plot the idle and peak power consumption across the different BIOS settings. Note this data was gathered on an early prototype unit running beta firmware. The power measurements shown here are for comparative purposes across profiles and not an absolute indicator of the server’s power requirements. Where used in the graph, “HT” implies Hyper-Threading is enabled.
The idle power consumption across different snoop modes is similar. The Performance profile adds 60-70 Watts over the DAPC profile for the configuration used in these tests.
The peak power consumption (during HPL initialization) is similar for the 14c E5-2697 v3 across the different BIOS configurations. On the 10c E5-2660v3 the Performance profile consumes ~5% more power than DAPC.
Figure 8 - Idle and Peak Power - E5-2697 v3
Figure 9 - Idle and Peak Power - E5-2660 v3
We expect the ES and HS snoop modes to perform similarly for most HPC applications. Note we have not studied latency sensitive applications here and the benefits of ES mode might be more applicable in that domain. The data sets used in this study show the advantage of COD mode, but the benefit of COD depends greatly on data locality (as shown in Figure 1). We look forward to hearing about COD with real-world use cases.
In terms of System Profile, DAPC appears to be a good choice providing performance similar to Performance profile but with some energy efficiency benefits. Note that the profile DAPC enables C-states and C1E, and will not be a good fit for latency sensitive workloads.
It is recommended that Hyper-Threading be turned off for general-purpose HPC clusters. Depending on the applications used, the benefit of this feature should be tested and enabled as appropriate.
We’ve evaluated a couple of Haswell processor models in this blog; look out for the third blog in this series that will compare performance and energy efficiency across four different Haswell processor models.
This week I was asked if I knew any reason that a Sandy Bridge system would run slower than an approximately equivalent Westmere system. [I would not normally blog about such a thing, but this is the third time in the last two weeks that this type of question has surfaced!] The Intel Sandy Bridge processor contains new instructions collectively called Advanced Vector Extensions, or AVX. AVX provides up to double the peak FLOPS performance when compared to previous processor generations such as the Westmere. To take advantage of these AVX instructions, the application *must* be re-compiled with a minimum compiler version that supports AVX. With Intel that is the Intel Compiler Suite starting with version 11.1 and starting with version 4.6 of GCC. If an application has not been re-compiled with an AVX-aware compiler, the application with not be able to take advantage of these Sandy Bridge instructions. And it will probably run slower than previously seen on older processors, even including Westmere processors with higher frequencies. Let me say this another way: A Westmere executable will run fine on a Sandy Bridge system due to Intel’s commitment and extensive work to maintain backwards compatibility, but it will probably run slower with no errors or any indications of why. Furthermore, re-compiling “on” the Sandy Bridge processor, but using an older compiler (pre icc 11.1 or pre ggc 4.6) does not help. Remember, use the latest compiler on those shiny new platforms! For just one example of Westmere vs. Sandy Bridge performance improvements that are possible, please see our blog at: HPC performance on the 12th Generation (12G) PowerEdge Servers: http://dell.to/zozohn I know there are some codes for legal, certification or other reasons that cannot be “changed.” But I certainly hope that this policy has not bled over into not even being able to re-compile apps to take advantage of new technologies. For additional information on AVX and re-compiling applications, see: Intel Advanced Vector Extensions: http://software.intel.com/en-us/avx/ How To Compile For Intel AVX: http://software.intel.com/en-us/articles/how-to-compile-for-intel-avx/ Optimizing for AVX Using MKL BLAS: http://software.intel.com/en-us/articles/optimize-for-intel-avx-using-intel-math-kernel-librarys-basic-linear-algebra-subprograms-blas-with-dgemm-routine/ If you have comments or can contribute additional information, please feel free to do so. Thanks. --Mark R. Fernandez, Ph.D.
by Ranga Balimidi, Ashish K. Singh, and Ishan Singh
What can you do with a big bad 4-socket machine with 60 cores with up to 6TB memory in HPC? To help answer that question, we conducted a performance study using several benchmark suites such as HPL, STREAM, WRF and Fluent. This blog describes some of our results that help illustrate the possibilities. The server that we used for this study is the Dell PowerEdge R920. This server supports the family of processors in the Intel architecture code named Ivy Bridge EX.
The server configuration table outlines the configuration details used for this study as well as the configurations from a previous study performed in June 2010 with the previous generation of technology. We use these two systems to compare performance across technology refresh.
Power Edge R920 Hardware
4 x Intel Xeon E7-4870v2 @ 2.3GHz (15 cores) 30M cache 130W
512 GB =32 * 16GB 1333MHz RDIMMs
PowerEdge R910 Hardware
4 x Intel Xeon X7550 @ 2.00GHz (8 cores) 18M cache 130W
128GB = 32 * 4GB 1066MHz RDIMMs
Software and Firmware for PowerEdge R920
Red Hat Enterprise Linux 6.5 (kernel version 2.6.32-431.el6 x86_64)
System Profile set to Performance
(Logical Processor disabled, Node Interleave disabled)
Benchmarks & Applications for PowerEdge R920
v2.1, From Intel MKL v11.1, Problem size 90% of total memory.
v5.10, Array Size 1800000000, Iterations 100
v3.5.1, Input Data Conus 12K, Netcdf-220.127.116.11
v15, Input Data: eddy_417k, truck_poly_14m, sedan_4m, aircraft_2m
For this study, we compared the two servers across the four benchmarks described below.
The aim of this comparison is to show the generation-over-generation changes in this four socket platform. Each server was configured with the optimal software and BIOS configurations at the time of the measurements. The biggest difference in performance between the two server generations is the improvement in system architecture, greater number of cores, and memory speed. The software versions are not a significant factor.
The STREAM benchmark is a simple synthetic program that measures sustained memory bandwidth in MB/s. It uses COPY, SCALE, SUM and TRIAD programs to evaluate the memory bandwidth. Operations of these programs are shown below:
COPY: a(i) = b(i) SCALE: a(i) = q*b(i) SUM: a(i) = b(i) + c(i) TRIAD: a(i) = b(i) + q*c(i)
The chart below compares STREAM performance results from this study with results from previous the generation. In this study, STREAM yields 231GB/s memory bandwidth which is twice the memory bandwidth measured from the previous study. This increase is because of the improvement in the number of memory channels and DIMM speed.
The graph also plots the local bandwidth and remote memory bandwidth. Local memory bandwidth is measured by binding processes to a socket and accessing only memory local to that socket (NUMA enabled, same NUMA node). Remote memory bandwidth is measured by binding processes to one socket and only accessing memory that is remote to that socket (remote NUMA node) where it has to go through QPI link to access this memory. The remote memory bandwidth is 72% lower than the local memory bandwidth due to the limitation of QPI link bandwidth.
The Linpack benchmark measures how fast a computer solves linear equations and measures a system's floating-point computing power. It requires a software library for performing numerical linear algebra on digital computers; for this study we used Intel’s Math Kernel Library. The following chart illustrates results from a single server HPL performance benchmark.
HPL yielded 4.67x sustained performance improvement in this study. This is primarily due to the substantial increase in the number of cores, increase in the FLOP/cycle of the processor and the overall improvement in the processor architecture.
The Weather Research and Forecasting (WRF) Model is a numerical weather prediction system designed to serve atmospheric research and weather forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture allowing for parallel computation and system extensibility.
We have taken the average time step as the metric to measure WRF performance. We used Conus 12km data set for this application.
In the graph above we've plotted the WRF performance results from this study relative to results from the previous generation. Since there is an increase in number of cores on Intel E7-4870 v2 processor, we have scaled up WRF to 60 cores and observed significant performance increase while scaling. Matching the number of cores used on both platforms at 32 cores, we observed significant performance improvement (2.9x) over the previous generation platform. When using the full capability of the server at 60c there is an additional 35% improvement. When it comes to server-to-server comparison, the PowerEdge R920 performs ~4x better than PowerEdge R910. This is due to the overall architecture improvements including processor and memory technology.
Ansys Fluent contains the broad physical modeling capabilities needed to model flow, turbulence, heat transfer, and reactions for industrial applications ranging from air flow over an aircraft wing to combustion in a furnace, from bubble columns to oil platforms, from blood flow to semiconductor manufacturing, and from clean room design to wastewater treatment plants.
In the charts below, we have plotted the performance results from this study relative to results from the previous generation platform.
We've used four input data sets for Fluent. We've considered “Solver Rating” (higher is better) as the performance metric for these test cases.
For all the test cases, Fluent scaled very well with 100% CPU utilization. Comparing generation-to-generation, for 32 core-to-core comparisons we observed the R920 performance results are approximately 2x better over the previous generation in all the test cases. When it comes to server-to-server comparison using all available cores, it performs 3-3.5x better.
These results were gathered by explicitly setting processor affinity at the MPI level. To do this, the following two configuration options were used:
(define (set-affinity argv) (display "set-affinity disabled"))
The PowerEdge R920 server outperforms its previous generation server in both benchmarks and applications comparisons studied in this exercise. The platform has its advantage over the previous generation platform in terms of latest processor support, increased memory speed and capacity support, and overall system architecture improvements. This platform is a good choice for HPC applications, which can scale-up with the high processor core count support (up to 60 cores) and large shared memory support (up to 6TB). It is also a great choice for memory intensive applications considering the large memory support.
The 26th annual HPCC Conference had the theme “Supercomputing: A Global Perspective,” and was held in Newport, RI at the end of March. The conference pulled together a variety of industry experts, including High Performance Computing (HPC) users, vendors, and other industry experts. This blog includes some of my observations from the event.
There were three main themes throughout the event;
John West, the Director for Department of Defense’s (DOD) High Performance Computing Modernization Program, kicked off the event discussing “The Missing Middle.” He postulated on how, “given this unalloyed good that is HPC, how come everybody isn’t using it?” You can watch his entire message here.
A New Focus on Manufacturing
New to this year’s Newport HPCC show was a focus on Manufacturing. Per the Conference Leadership team:
“Bringing HPC to manufacturing is an important initiative, in the U.S. and the rest of the world. Competitiveness is an elusive goal that requires continual refinement and adoption of new technologies. HPCC 2012 will highlight this critical area with discussions on the application of HPC to modern manufacturing to address what many refer to as the ‘missing middle’ – referring to the thousands of small and mid-size businesses not currently taking advantage of high performance computing in areas such as design, manufacturing, logistics, transportation, etc.”
Speakers in this area included:
Interesting Debates About Achieving Exascale
Panelists also discussed at length the quest for Exascale computing. How do we get there? What are the obstacles to achieving Exascale? What are the drivers that will get HPC in the United States to Exascale? What does Exascale computing provide the HPC market that current systems can’t achieve today?
Our very own Dr. Mark Fernandez was on a panel discussing this. In fact, this last day round table discussion included a “Lighting round” that really did a fantastic job of encapsulating the entire event’s worth of content into one 30-minute session.
HPC Analyst Crossfire – Live from the National HPCC Conference 2012
Other Interesting notables:
Please leave a comment, or add any additional insights from the event.
A recent NVIDIA blog highlighted the impressive results clients have realized using NVIDIA's Tesla K40 GPU accelerator.
The blog focuses on three very divergent applications: weather forecasting, Twitter trends, and financial risk analysis. Each of the applications has seen impressive improvement since using the accelerator.
You can read the NVIDIA blog and learn more about the Tesla K40 GPU results here.
A recent HPCwire story reported that the Texas Advanced Computing Center (TACC) at the University of Texas at Austin has released two new offerings: Agave API, a cloud-based science-as-a-service platform for gateway development; and Gateway DNA, a collection of open source components enabling the rapid development of science gateways.
While Agave seeks to spur innovation from day one in the next generation of science gateways by providing a synergetic set of services that developers can use to provide reliable, core science capabilities in their applications, as a collection of open source software and pre-built tools that users can mix and match to customize the gateway they need, Gateway DNA helps to eliminate potential barriers to entry for researchers.
Read the HPCwire story to learn more about Agave API and Gateway DNA.
Recently, insideHPC highlighted an interesting video featuring The George Washington University researchers using GPUs to study the aerodynamics of flying snakes!
Part of the funding for this research came from NVIDIA's Academic Partnership and its CUDA Fellowship Program.
You can learn more about their research and see the video staring flying snakes here.
In January 2013, Stampede, at the Texas Advanced Computing Center, became the first large-scale system to deploy the Intel® Xeon Phi™ Coprocessor on a massive-level.
HPCwire recently discussed TACC's experience with its acting director, Dan Stanzione.
On its way to the #7 spot on the Top 500 the Stampede experience has included:
You can read the HPCwire story and hear the audio interview with Dan Stanzione here.