High Performance Computing (HPC) at DellThanks for visiting our online HPC and technical computing community. We have an active group of contributors from our technical Dell engineers, to our industry leading customers, and worldwide partners.
By Nishanth Dandapanthula
What can you expect from the latest servers from Dell? What kind of performance and energy efficiency do all those speeds and feeds translate to? We spent the last several weeks in the HPC lab at Dell putting the 12G servers through some tests and this blog captures some of those results.
Dell’s all new 12th generation (12G) dual socket PowerEdge® servers feature the Intel® Xeon® E5-2600 series processors. These processors are based on the latest Intel micro-architecture, codenamed Sandy Bridge. 12G servers include many features beyond SandyBridge. There have been enhancements in systems management, power efficiency, network adapters and options to mix 1GbE and 10GbE, SSD drives and so on.
In this blog, we focus on compute performance and energy efficiency. We quantify the performance improvement provided by the 12G servers when compared to the previous 11th generation (11G) servers. The 11G servers, released in 2009, were based on the Intel Xeon 5500 and 5600 series processors (Nehalem-EP and Westmere-EP). We use a variety of applications and micro benchmarks for our comparison. This article gives a detailed account of single server performance evaluation comparing 11G and 12G servers.
The 11G servers included the Xeon X5600 series processors (Westmere). Table 1 describes the basic differences between Sandy Bridge and Westmere. With the increased number of cores, memory channels, QPI links etc., it is not hard to perceive that Sandy Bridge will have a profound impact on performance when compared to Westmere. Intel also introduced Advanced Vector Extensions (AVX)  in Sandy Bridge. Among the plethora of advantages provided by AVX, it doubles the number of FLOPS/cycle when compared to Westmere or Nehalem. This provides a huge boost in performance. A complete description of AVX is provided in .
Table 1: E5-2600 Vs. X5600
To quantify the performance advantage provided by 12G servers over 11G servers for HPC compute workloads, we compare the PowerEdge R620 (12G) server to the PowerEdge R610 (11G) server. Both these servers have a 1 U form factor and are 2 socket systems. Table 2 and Table 3 provide the server configurations of both the machines. The versions of BIOS and iDRAC (integrated Dell Remote Access Controller) used were the latest revisions at the time of the experiments. Table 4 provides the BIOS settings used for the experiments. Note that the evaluation R620 used for the tests was an engineering prototype machine, with the latest test firmware and BIOS at the time.
Table 2: R620 Configuration
Table 3: R610 Configuration
Our base processor on the R620 is the Intel Xeon E5 -2680, which is a 2.7GHz (C1 stepping - proto), 130W processor. To match the core speed of the base processor on the R610, we picked the X5660 which is a 2.8 GHz, 95W Westmere processor. We also use the X5690 which is a 3.46 GHz, 130W Westmere processor to match the wattage of the base processor.
Table 4: BIOS Settings
Figure 1: Stream Triad Memory Bandwidth
The Stream  benchmark is used to measure the memory bandwidth of the system. Figure 1 shows that, relative to the previous generation PowerEdge R610 server:
- There is an ~85 % increase in memory bandwidth when 1600 MHz DIMMs are used on the R620
- A ~61 % increase in memory bandwidth is measured when 1333 MHz DIMMs are used on the R620.
- Taking into account the additional cores on the R620, the memory bandwidth per core on the R620 is still better by 21-39% when compared to the R610.
To accomplish the HPL runs on both the server configurations, we use Intel MPI 4.0.3.008, Intel MKL 10.3. The problem size for each of the HPL iterations is maintained at a constant of 90% of the total server memory. HPL efficiency measured as the ratio of sustained performance to theoretical maximum performance shows a 6 % improvement on the R620 server when compared to the R610.
Figure 2 represents the results pertaining to a single server. Results are presented relative to the PowerEdge R610 configured with 2.8GHz processors.
- There is a 175 % increase in absolute performance when similar core speed processors are used (bars for R610, 2.8 GHz, 95W and R620, 2.7Gz, 130W, 1333MHZ) with the 12G server performing significantly better.
- There is not much of a difference in HPL performance on the 12G servers when 1333MHz DIMMS are switched with 1600MHz DIMMS. This validates the study made in  that shows that HPL is not sensitive to memory speed.
Figure 2: HPL Performance
Figure 3 shows the energy efficiency of a server when HPL is being run. Energy efficiency is measured as in terms of performance delivered for each watt of power consumed (Performance/W or GLFOPS/W). Results are presented relative to the PowerEdge R610 configured with 3.46GHz processors.
- There is a 100% increase in GFLOPS/Watt when compared to R610. That is, a 12G server provides double the performance when consuming the same amount of power as an 11G server. This impact can be attributed to the 33% more cores on the Sandy Bridge processors when compared to the Westmeres, and to increase in number of Floating Point operations executed every clock on the Sandy Bridge. It is also indicative of the overall energy efficient design of the Dell PowerEdge R620.
Figure 3: Power Consumption when running HPL
Figure 4: Idle Power Consumption
Idle power is measured as the power consumed by a server after it has reached a stable state (Boot process is complete) but is idle with no jobs running on the system. Most data centers tend to have some downtime during off peak hours. Idle power is an important metric to determine the energy efficiency of the data center when no jobs are running. Figure 4, depicts the relative idle power usage for different configurations of the servers. Results are presented relative to the PowerEdge R610 configured with 3.46GHz processors.
- The 12G servers consume 21% less power compared to a 11G server when processors with similar wattage are used (bars corresponding to the R610 with 3.46GHz, 130W processors, 1333 MHz DIMMs and the R620 with 2.7 GHz, 130 W processors, 1333 MHz DIMMs).
The idle power consumed by the 12G machines is startlingly low. This is due to the improvements made for energy efficiency not just in the Intel SandyBridge processors but also on the overall Dell platform. As mentioned before, 12G servers from Dell have several new features and enhancements beyond Intel’s processors.
Studies comparing a 12G server and an 11G server indicate that
The subsequent blogs will give a detailed account of the communication aspect of the 12G servers and their advantages when compared to 11G servers. In future, we also plan on following it up with a blog that will provide an insight into application level performance.
By Nishanth Dandapanthula and Garima Kochhar
HPC (High Performance Computing) mode is a new feature introduced in the BIOS which improves the performance of certain workloads on Dell servers based on AMD Interlagos processors. This blog describes how to enable and take advantage of the HPC mode and includes some performance results of the impact of HPC mode on a PowerEdge R815 server.
Enabling HPC mode through BIOS
The BIOS version which introduces HPC mode on the R815 is 2.8.2. To enable HPC mode through the BIOS, the BIOS must be set as shown in Table 1. Figure 1 and 2 show the screenshots of the steps involved in enabling HPC mode through the BIOS. Note that setting the HPC mode to “enabled” within the Processor Settings tab alone will not fully enable HPC mode.
Table 1: Enabling HPC mode through BIOS
Figure 1: Enable HPC mode in the Processor Settings tab
Figure 2: Change options in the Power Management Tab
Enabling HPC mode through DTK
To enable HPC mode in a cluster environment, Dell’s OpenManage Deployment Toolkit can be used. The power management settings which are listed above can be set using the existing syscfg command line. The new parameter for HPC mode is “--hpcmode”. This parameter is being introduced with DTK v4.1.Details are listed in Table 2 below.
Table 2: HPC mode through DTK
The Red Hat Enterprise Linux kernel 2.6.32-220.17.1.el6 or later is needed for HPC mode to function. Without the support enabled in this kernel, the server will kernel panic on boot when HPC mode in enabled in the BIOS.
Impact of HPC mode
To measure the impact of HPC mode on the performance of the server we used the High Performance Linpack (HPL) benchmark. The prebuilt HPL binaries were obtained from http://developer.amd.com/libraries/acml/downloads/pages/default.aspx. These binaries were built using Open64 compilers. Table 3 shows the test server configuration and Table 4 details the performance results. This evaluation was done on a single server.
Table 3: Test Server Configuration
Table 4: Impact of HPC mode
From Table 4, it can be seen that HPC mode provides up to 6% improvement in HPL efficiency. This increased performance is at the expense of higher power consumption and is recommended only for those environments where the power available can support this mode of operation. Another caveat to be noted is that the performance improvement provided by HPC mode for workloads other than HPL is minimal.
The table compares the results of the new “HPC mode” BIOS option to the previous “Max Performance” Power Management option. For the “HPC mode” BIOS option to take effect, it is mandatory that the “Power Management” option must be set to “Custom”, the “CPU Power and Performance Management” option must be set to “OS DBPM” and the “Fan Power and Performance Management” option must be set to “Maximum Performance”.
By: Nishanth Dandapanthula, Munira Hussain
The new generations Intel Sandy Bridge Servers have PCIe Generation 3 bus slots available that offer many benefits in terms of bandwidth and latency which is useful for inter-node communication in a High Performance Computing Cluster. In terms of transfer rate, PCIe Gen3 offers up to 8 GT/s rate versus 5 GT/s provided by the older generation PCIe Gen2 slots. Additionally, the PCIe Gen3 uses a different encoding scheme that results in lower overhead and delivers greater bandwidth and lower latency.
In this blog we will focus on the performance comparison between PCIe Gen2 versus PCIe Gen3 and the impact it has on bandwidth and latency. The performance improvement will be measured for both Quad Data Rate (QDR) and Fourteen Data Rate (FDR) InfiniBand Adapters on the 12G servers.
The Fourteen Data Rate (FDR) InfiniBand from Mellanox  are PCIe Gen3 based cards delivering up to 54 Gbits/s of theoretical bandwidth on a 4X link lane slot. The Quad Data Rate (QDR) Infiniband adapters are both Gen2 based and Gen3 adapters. The theoretical bandwidth for the QDR adapters is 32 Gbits/s on a slot with 4X link lanes.
With the introduction of FDR InfiniBand adapters and PCIe Gen3 slots, the 12G servers provide an enormous improvement in bandwidth and latency from the interconnect perspective. We used the experimental setup as shown in Table 1 (12G) and Table 2 (11G) to quantify the advantage provided by 12G servers at a micro benchmark level, when compared to the previous 11G servers. To obtain the best possible latency, the BIOS options have been set as mentioned in Table 3. The servers were connected back to back without a switch in order to demonstrate the absolute performance improvement without considering the overhead introduced by the switch.
Table 1: R620 Configuration
Table 2: R610 Configuration
Table 3: BIOS Settings
The following results were obtained by using MVAPICH 1.2  and OSU Micro benchmarks 3.1.1 . We compared the performance of three different interconnect speeds; FDR PCIe Gen3, QDR PCIe Gen3 and QDR PCIe Gen2 using the latency, bandwidth and bi-directional bandwidth benchmarks from the OSU benchmarks suite.
From Figure 1, we can infer that:
- An 87 % improvement in bandwidth was obtained when QDR PCIe Gen2 (11G) and FDR PCIe Gen3 (12G) are compared.
- A 16 % improvement when QDR PCIe Gen2 (11G) and QDR PCIe Gen3 (12G) are compared. This can be attributed to the benefits provided by the PCIe Gen3 Slot.
Figure 1: OSU Bandwidth
Figure 2 represents the performance comparison using the OSU Bidirectional Bandwidth benchmark:
- A 69% improvement is seen when QDR PCIe Gen2 (11G) and FDR PCIe Gen3 (12G) are compared.
- A 20 % improvement when QDR PCIe Gen2 (11G) and QDR PCIe Gen3 (12G) are compared.
Figure 2: OSU Bidirectional Bandwidth
Figure 3 and Figure 4 depict the OSU Latency benchmark comparison over different interconnect speeds. With the new 12G servers, we hit the lowest micro benchmark level latency, when the FDR PCIe Gen3 adapters are used.
- For small message sizes, the latency is better by 40 % when QDR PCIe Gen2 (11G) and FDR PCIe Gen3 (12G) are compared.
- The latency numbers for small message sizes show minimal difference between QDR PCIe Gen3 and FDR PCIe Gen3 performance. For large message sizes, the difference in performance is significant when QDR PCIe Gen3 and FDR PCIe Gen3 are compared.
Figure 3: OSU Latency (Small Message Size)
Figure 4: OSU Latency (Large Message Size)
Studies comparing an FDR PCIe Gen3 Adapter and a QDR PCIe Gen2 Adapter using the OSU benchmark suite indicate that the:
- Bandwidth is better by 87%
- Bi-directional Bandwidth is better by 69%
- Latency is lower by 40%
In subsequent blogs, we plan on application level studies on a larger cluster to understand the performance at scale.
Many HPC and high throughput computing (HTC) application environments are well served by gigabit Ethernet as the primary cluster interconnect. Increasing processor core counts and the availability of cost effective quad socket systems are growing the IO demands of compute nodes. When the available bandwidth of a gigabit connection is exceeded, IO wait cycles are introduced, the overall throughput of a compute node is constrained and CPU utilization drops. Transitioning to 10Gb Ethernet is one way to address the increasing IO demand. A challenge with 10Gb Ethernet networks for clusters is deploying a cost effective network that scales as you grow and minimizes cost. Using a conventional multi-tier design is one possible solution for building a scalable 10Gb network. However, the bandwidth available at the top tier will limit the size of the network that can be built and often the first tier switches introduce over-subscription into the network because the amount of uplink bandwidth is less than the bandwidth required by systems connected into the switches. The solution to these limitations is to build the network using a switch like the Dell Force10 Z9000TM in a “fat tree” topology taking advantage of the Z9000’s distributed core capabilities. The Distributed Core Architecture Using the Z9000 Core Switching System White Paper discusses the advantages of this approach and the various communications protocols used in implementation, but does not cover how you actually design a fat tree network. The Dell Force10 Z9000 TM is a line-speed, 32-port 40GbE, two rack unit, top of rack (TOR) switch. Each 40Gb QSFP port can be split into 4 10Gb SFP+ ports using a simple splitter cable.
A two tier fat tree network uses what are commonly called “leaf” and “spine” switches. Leaf switches are switches at the edge of the network that connect to the compute or storage node elements in the cluster. Spine switches make up the second tier of switches that connect the leaf switches together. I will cover non-blocking solution design with this posting and follow-up with some possibilities for oversubscribed designs at a later date.
A non-blocking leaf switch configuration using the Z9000 is made by splitting half of the Z9000’s 32 40Gb ports into 10Gb ports enabling the connection of up to 64 compute nodes (each 40Gb port is split into 4 10Gb ports). The remaining 16 40Gb ports will be used as uplink ports to connect into spine switches. There is an equal number of 40Gb ports used for connection to compute nodes and spine switches so this is a non-blocking configuration.
To complete the fat tree network, leaf switches are connected to spine switches. The number of spine switches needed is determined by counting the number of 40Gb uplinks from all leaf switches and dividing by 32 (since there are 32 40Gb ports in the Z9000). To connect the network leaf switch uplinks are evenly divided between the spine switches.
The following example has four leaf switches in non-blocking configuration which can support a 256-node cluster. There are a total of 64 40Gb uplinks spread across two Z9000 spine switches.
The maximum number of non-blocking 10Gb connected systems that can be configured into a single network fabric using the leaf switch configuration described here is 2048. 2048 comes from multiplying the number of ports in the spine switch times the number of nodes that can be connected to each leaf switch. If we change from using 40Gb connections to 10Gb connections in the spine and leaf switches, by splitting each QSFP port into 4 SFP+ ports, the Z9000 becomes a 128 port 10Gb switch and the maximum number of non-blocking 10Gb ports in a single network fabric grows to 8192 (128 spine switch ports * 64 leaf switch node connections).
Designing a conventional 10Gb multi-tier topology either limits the size of your network to 100s of ports or requires the introduction of oversubscription. Also, purchasing a core switch of sufficient size could cost hundreds of thousands of dollars. Using a fat tree topology basd on the Z9000 enables a massive number of nodes to be connected into a scalable, high performing network. You can start small and grow as your cluster grows and dramatically reduce the cost of the solution.
In part 2 I will cover how to design oversubscribed fat tree networks.
Dell recently updated the 12th generation PowerEdge server line with the Intel Xeon E5-2600 v2 series processors. In this blog we compare the performance of the Intel Xeon E5-2600 v2 processors against the previous E5-2600 series processors across a variety of HPC benchmarks and applications. We also compare the performance of 1600MT/s DIMMs with 1866MT/s DIMMs; 1866MT/s is only supported with Intel Xeon E5-2600 v2 series processors. Intel Xeon E5-2600 v2 series processors are supported on Dell PowerEdge R620, R720, M620, C6220 II, C8220 and C8220x platforms with the latest firmware and BIOS updates.
Intel Xeon E5-2600 series processors use a 32 nanometer based manufacturing process, CPU on planar double-gate transistors. They fall under the tock process of Intel’s tick-tock model of development and included a new microarchitecture (codenamed Sandy Bridge) to replace the Intel Xeon 5500 series processors that were built on the architecture code named Nehalem.
Intel Xeon E5-2600 v2 series processors (codenamed Ivy Bridge) are based on the 22 nm manufacturing process. There is a die shrink, known as the "tick" step of Intel’s tick-tock model and is based on 3D tri-gate transistors.
To maintain consistency across the server configurations having Intel Xeon E5-2695 v2 and Intel Xeon E5-2665 processors, we have used processors of the same frequency and wattage across both processor families.
Dual Intel Xeon E5-2665 2.4GHz (8 cores) 115W
Dual Intel Xeon E5-2695 v2 2.4GHz (12 cores) 115W
Total 16 cores per server
Total 24 cores per server
128GB memory, total per server
Configuration - 1-16GB Dual Rank DDR3 RDIMM per channel (8 * 1600MT/s 16GB DIMMs or 8 * 1866MT/s 16GB DIMMs)
Mellanox InfiniBand - ConnectX3 FDR connected back-to-back
Red Hat Enterprise Linux 6.4 (kernel version 2.6.32-358.el6 x86_64)
Bright Cluster Manager 6.1
Mellanox OFED 2.0.3
System Profile set to Max Performance
(Logical Processor disabled ,Turbo enabled, C states disabled, Node Interleave disabled)
v2.1 From Intel MKL v11.1
v5.10, Array Size 160000000, Iterations 100
NAS Parallel Benchmarks
v3.2, Problem Size=D Class
v2.2, Input Data Conus 12K
The Linpack benchmark measures how fast a computer solves linear equations and measures a system's floating-point computing power; requires a software library for performing numerical linear algebra on digital computers.
The STREAM benchmark is a simple synthetic program that measures sustained memory bandwidth in MegaBytes. It uses COPY, SCALE, SUM and TRIAD programs to evaluate the memory bandwidth. Operations of these programs are shown below:
COPY: a(i) = b(i) SCALE: a(i) = q*b(i) SUM: a(i) = b(i) + c(i) TRIAD: a(i) = b(i) + q*c(i)
The NAS Parallel Benchmarks (NPB) is a set of benchmarks targeting performance evaluation of highly parallel supercomputers.
LU: Solves a synthetic system of nonlinear PDEs using three different algorithms involving block tri-diagonal, scalar penta-diagonal and symmetric successive over-relaxation (SSOR) solver kernels, respectively.
EP: Generate independent Gaussian random deviates using the Marsaglia polar method.
FFT: Solve a three-dimensional partial differential equation using the fast Fourier transform (FFT).
The Weather Research and Forecasting (WRF) Model is a numerical weather prediction system designed to serve atmospheric research and weather forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture allowing for parallel computation and system extensibility.
The tests conducted with the Intel Xeon E5-2665 configuration are labeled SB-16c. Tests conducted with 16 cores on the Intel Xeon E5-2695 v2 are labeled IVB-16c. Finally, tests with all 24 cores on the Intel Xeon E5-2695 v2 are referenced as IVB-24c.
Tests that used 1600MT/s DIMMS are designated with the suffix 1600, while tests that used 1866MT/s DIMMs are designated with the suffix 1866.
Single Node Performance:
For single node runs, we have compared the performance obtained with the server’s default configurations with both SB and IVB processors, using all the cores available in the system. In addition, for WRF and NPB-EP, we have also compared the performance of the server after turning off 4 cores per processor for Intel Xeon E5-2600 v2 configurations.
The following graph shows the single node performance gain with the Intel Xeon E5-2600 v2 series when compared to then E5-2600 series. For HPL, only the out of box sustained performance is compared when utilizing all the cores in the server. Since NPB-LU and NPB-FT require processor cores to be in order of a power of 2 for them to run, their runs are not shown for IVB-24c.
Relative performance is plotted using the SB-16c-1600 configuration as the baseline.
HPL yielded 1.53x sustained performance on IVB-24c as compared to SB-16c. This is primarily due to the increase in the number of cores. NPB and WRF yielded up to ~7 – 10% improvement when executed on 16 cores on Intel Xeon E5-2695 v2 when compared to SB-16c. WRF performs 22% better with IVB-24c when compared to SB-16c; NPB-EP shows ~38% improvement with IVB-24c when compared to SB-16c. NPB-EP shows improved results compared to WRF, because of its parallel nature which requires less communication among MPI processes thus greatly benefitting from the increase in number of cores.
The performance increase of WRF, NPB-EP and NPB-FT on 1866MT/s DIMMs over 1600MT/s DIMMS is 2.35%, 0.26% and 2.73% respectively. NPB-LU shows 10% increase in performance. This behavior is due to the large problem size required for NPB-LU which helps it show a considerable performance increase with the faster memory as compared to NPB-EP, NPB-FT or WRF.
For dual node tests we have compared the performance obtained with server’s default configurations with both SB and IVB processors. In addition, for WRF and NPB-EP we have also compared the performance of the server after turning off 4 cores per processor for Intel Xeon E5-2600 v2 configurations.
The two-node cluster in these tests is connected back-to-back via InfiniBand FDR. All dual node tests were conducted with 1600MT/s memory DIMMs.
The following graph shows single node performance gains with the Intel Xeon E5-2600 V2 series when compared to then E5-2600 series processors plotted as IVB-48c and SB-32c respectively. Since E5-2600 has 8 cores per socket compared to 12 cores in Intel Xeon E5-2600 V2, one set of results was taken with four cores of Intel Xeon E5-2600 V2 shut down through BIOS and it is plotted as IVB-32c.
HPL was executed on a two node cluster with E5-2665 (total 32 cores, 16 cores per server) and E5-2695 v2 (total 48 cores, 24 cores per server). HPL yielded 1.52x sustained performance on IVB-48c as compared to SB-32c whereas WRF, NPB-EP and NPB-LU have shown a performance improvement of ~2.5%. There is ~7 %- 8% increase in the performance with WRF and NPB-EP on 32 cores. It is ~ 22 – 32% difference when compared between 48 cores E5-2695 v2 and 32 cores E5-2665 runs.
The graph compares the memory bandwidth of the E5-2600 v2 processor to its predecessor, the E5-2600. With E5-2600, the maximum supported memory speed is 1600MT/s. With E5-2600 v2, that maximum is 1866MT/s. We’ve compared 1600MT/s DIMMs for SB-16c and IVB-24c, and also plotted the improved memory bandwidth with 1866MT/s on IVB-24c.
The IVB-24c test shows up ~15% increase in memory bandwidth with 1866MT/s DIMMs when compared to IVB-24c with 1600MT/s DIMMs due to the higher frequency of the 1866MT/s DIMMs. And it shows ~27% increase when compared to SB-16c with 1600MT/s. This increase is because of the dual memory controllers on E5-2695 v2 processor that support 2 memory channels each as compared to the single memory controller with 4 Channels on E5-2665 processor.
In this study, we have found that E5-2600 v2 processors have significant performance improvement over E5-2600 processors. Increase in number of cores, Larger L3 cache and dual memory controller are contributing to performance. We could see huge improvement in performance with embarrassingly parallel applications like NPB-EP. We also see an increase in performance with 1866MT/s DIMMs over 1600MT/s DIMMs.
As happens every six months, the Top500 list of fastest supercomputers was released again during SC12 in Salt Lake City. The list represents a great way to track and measure trends, and celebrate the advancements in technological achievement at the very high-end of the high performance computing (HPC) industry.
One of the newest systems to be recognized by Top500 is the Stampede system at the University of Texas (UT) and the Texas Advanced Computing Center (TACC). We at Dell are excited because we view our partnership with TACC as very valuable in helping to advance not only the capabilities that researchers and students have access to at TACC, but also push the envelope of what is possible. Personally I enjoy more so to learn and understand what HPC is doing to help advance science and research worldwide. To learn more, be sure to watch the Dell Tech Center and www.HPCatDell.com to view TACC's Jay Boisseau deliver a great presentation from SC12 titled, Transforming Science with Stampede.
So congratulations to our friends at TACC, and the other 499 systems that made this November's list. To learn more about the Top500, or TACC, please see some links below to some recent news stories. Top500 Summary, Nov. 2012
Stampede supercomputer gives scientists a powerful new toolTACC Presents Petascale Systems and More at SC12 TACC Stampede Overview
By Munira Hussain and Anil Maurya
ROCKs+ 6.0.1 is based on the open source project ROCKS and is supported by StackIQ. The solution stack is tested, verified and validated on the latest Dell hardware based on Intel Sandy Bridge and AMD Interlagos. The solution stack is designed to automate, deploy and manage High Performance Computing Clusters. Additionally, it incorporates Dell recommended environment, parameters and scripts that are set and configured automatically during installs.
The main highlight of this release with ROCKS+ 6.0.1 is the addition of support for the following components:
- Support for Intel Sandy Bridge-EP Servers: R620, R720, M620 and C6220.
- Red Hat Enterprise Linux 6.2 (kernel -2.6.32-220.el6.x86_64).
- nVidia CUDA 4.1 roll for GPGPU for R720 and C6220/C6145 servers.
- Mellanox OFED 1.5.3-3 to support PCI-E Gen 3 Fourteen Data Rate (FDR) technology.
- Additional scripts and tools in the Dell roll to configure optimal BIOS settings and iDRAC/BMC settings specific for Dell hardware. (see below for more details)
- Introduction of a GUI based implementation to manage, monitor and run commands from the web console.
Rocks+ 6.0.1 has many features as a HPC software solution stack, including:
- Physical or virtual infrastructures can be quickly provisioned, deployed, monitored, and managed.
- Pre-packaged, automatically configured software stacks called “Rolls” are available to simplify Big Infrastructure deployments on physical or virtual servers. These rolls have both data center and cloud versions.
- syscfg commands are used to enforce HPCC recommended BIOS settings and console redirection on all the nodes.
- HPC roll has MPI libraries, so OpenMPI and mpich2 are natively supported.
Additionally, StackIQ has incorporated a new and enhanced web-based interface which:
- Allows administrator to monitor the status of nodes, for gathering the CPU and network stats.
- Helps in viewing the network interface settings of all the nodes.
- Can view and change attributes settings like hostname, etc. of all the nodes.
- If the ganglia roll is installed, administrator can use the ganglia web based interface to monitor the CPU, IO and network statistics data of the cluster.
- ROCKS+ 6.0.1 uses Avalanche installer to deploy nodes. A visualization of the deployment is displayed in the GUI when the Avalanche installer is deploying multiple systems. It displays the nodes pulling packages from each other, rather than from frontend, hence removing the I/O bottleneck on the Front End Installer or Headnode.
ROCKS+ 6.0.1 Monitoring with Ganglia Roll
The HPC Cluster is a hierarchical architecture with various components. The jumbo code from StackIQ is a bundled solution that incorporates various software components that tie together nicely to deploy and configure a Cluster. The package contains the following rolls; however additional rolls can be supplied or removed per configuration.
I got some e-mail after the previous blog (http://dell.to/144sqai) on 2GB/core recommendations for HPC compute nodes. It turns out that some of you know the memory capacity requirements of your workloads and it is currently 48GB per (2-socket) compute node. Kudos for determining the minimum amount of memory required!
But configuring to the minimum required memory assumes that less memory is “better": it costs less money and has less potential negatives. More on that later.
Continuing, the logic goes that 48GB/node is 24GB/socket on a 2-socket node. And since there are four (4) memory channels per socket on an Intel SandyBridge-EP processor (E5-2600) and one would like to maximize the memory bandwidth, one needs 4 x 6 GB DIMMs to achieve the required 24GB per socket. But, alas, there is no such thing as a 6 GB DIMM.
Hence, a 4 GB DIMM and a 2 GB DIMM are used on each memory channel. Several of you shared this configuration data with me. This does many things correctly:
Therefore, the memory configuration is balanced and a good one, technically speaking.
However, let’s dig deeper and take into account a few other things. One is my previous Rule #3: Always use 1 DPC (if possible to meet the required memory capacity). The others are to consider today’s price and tomorrow’s requirements.
As stated in the previous blog, I like to create the “best” memory configuration for a given compute node and then see if the memory/core capacity is sufficient. In other words, in high performance computing take memory performance into account (first) in addition to the age-old capacity requirements. And as usual, price comes into play. In this 48GB/node case, the price is indeed a driving factor.
To be consistent with the previous blog, we’ll use the same memory sizes and prices, based upon the Dell R620, a general purpose, workhorse, 1U, rack-mounted, 2-socket, Intel SandyBridge-EP (E5-2600) compute node platform. Below is that same snapshot of the memory options and their prices taken on 12-July-201
Here’s the layout of a 48GB/node configuration using 4 GB DIMMs and 2GB DIMMs. Also, in the figure is the total memory price for that configuration.
Here’s an alternate layout using 8 GB DIMMs. Also, in the figure is the total memory price for this configuration.
Here are the key features of the second configuration:
“Future proof? What does he mean by that?” Did you notice the memory per core in the figures above? The 48GB/node configuration using 4 GB DIMMs and 2GB DIMMs is 3GB/core for today’s mainstream 8-core processor. The 48GB/node specification may in fact be tied to the GB/core and the core count per processor. Today’s node may need 48GBs, but a node with more cores may need more memory.
We know from several public places (e.g., http://www.sqlskills.com/blogs/glenn/intel-xeon-e5-2600-v2-series-processors-ivy-bridge-ep-in-q3-2013/ ) that the follow-on to the Intel SandyBridge-EP processor (E5-2600), codenamed Ivy Bridge-EP, will officially be called the Intel Xeon E5-2600 v2. The mainstream v2 processor will feature ten (10) cores, compared to today’s 8 cores. With this future processor, the alternate memory configuration above using 8 x 8GB provides a total of 64GB/node. This 64GB/node on a 2-socket node with 20 cores is 3.2GB/core, still exceeding the 3GB/core of the 48GB node today.
If you have comments or can contribute additional information, please feel free to do so. Thanks. --Mark R. Fernandez, Ph.D.
High Performance Computing using cluster commodity components is generally agreed to have begun in 1993 with the Beowulf project at NASA lead by Thomas Sterling and Donald Becker. The recipe they perfected used Linux as the operating system. This choice lead to a ‘virtuous cycle’ of improvement with clusters and the cluster community contributing to Linux improving it as an operating system platform and the greater Linux community providing features to Linux that improved it as an operating system platform for cluster computing.
Near the tenth anniversary of Beowulf computing, I found and read the MIT press book “How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters”. The idea of using commodity parts to do High Performance Computing was an exciting trend and when combined with my on-going interest in parallel computing and long term interest in Linux, I committed my career to cluster based computing. As we approach the twentieth anniversary of cluster computing, I thought it would be interesting to point out that in addition to Linux based commodity clusters a second option is available for leveraging commodity hardware for HPC. That option is to use Windows NT based operating systems and the Windows ecosystem to build HPC solutions.
While some in the open source community view Windows as inappropriate for use in a cluster because it is a closed and proprietary software platform, I look at it as an alternative operating system with the volume to make it a commodity. I feel that the real spirit of Beowulf is leveraging commodity economics and accelerated progress that comes from a broad base of users and competition. I can assure you that Microsoft has felt the competitive pressure from UNIX and Linux and that this has led to many improvements in the Windows Operating system. It also led to a significant effort by Microsoft to create a full featured HPC platform based on Windows. The fruits of that effort are currently represented by a suite of products generally referred to as Windows HPC.
Windows HPC is actually implemented by combining several Windows server and client based systems into a cluster that is managed and coordinated via the Windows HPC Pack. This Windows based HPC solution is now in its third generation and provides a rich platform for doing cluster computing in a Windows environment. The latest version of Windows HPC run on dedicated on-premises hardware and can also incorporate resources in the cloud or run entirely in cloud.
Yes, VIRGINA, there is a Windows HPC solution. In future posts I will dive more deeply into the details of what the various parts of Windows HPC are and how they can be combined to meet the need for HPC in a Windows environment.
Next up: “Windows HPC – A Recipe for Success”
In my first post on designing scalable 10Gb Ethernet networks I discussed some of the motivations for migrating HPC and HTC computing environments from 1Gb Ethernet to 10Gb Ethernet and how to design scalable, non-blocking 10Gb networks using the Dell Force10 Z9000 TM. It may be the case that your current computing environment needs more bandwidth than 1Gb Ethernet offers but 10Gb is overkill and will be underutilized. It is possible to design scalable 10Gb networks that meet lower IO throughput requirements by introducing oversubscription into the architecture.
Oversubscription exists when the theoretical peak bandwidth needs of the systems connected to a switch exceeds the theoretical peak bandwidth of the uplinks out of the switch. Oversubscription is expressed as the ratio of inputs to outputs (ex. 3:1), or as a percent which is calculated (1 – (# outputs / # inputs)). For example (1 – (1 output / 3 inputs)) = 67% oversubscribed). It is important to remember that oversubscription in a network is not inherently bad. It is a feature that must be designed as a part of an overall computing solution. Note that the oversubscription of a network describes the worst case bandwidth for an environment. If all the servers connected to a leaf switch are not saturating their individual 10Gb links at the same time, the actual bandwidth delivered to a server will be higher than the oversubscribed value and may even be the full 10Gb of bandwidth. The actual available bandwidth will depend on the IO access demands across all servers connected to a switch.
Using the Dell Force10 Z9000TM to design oversubscribed distributed core 10Gb networks is very similar to designing non-blocking networks. Simply use more of the 40Gb ports of the switch to connect to servers than to connect to spine switches. Suppose it was determined that a 3:1 oversubscribed 10Gb network was the ideal configuration for an environment. How do you figure out how many downlinks you need to servers and uplinks to spine switches to achieve 3:1 oversubscription? I know this sounds weird and too easy to be true, but the port counts are determined by dividing the total number of ports in the switch by the sum of the number of inputs and number of outputs in the oversubscription ratio. The result of this formula is the number of uplink ports.
Applying the formula to the Z9000, the number of uplink ports in a 3:1 oversubscribed leaf switch configuration is eight. Splitting the remaining 24 40Gb ports into 4 10Gb ports each results in 96 10Gb SFP+ links for connection to servers.
As in a non-blocking configuration, leaf switches are connected to spine switches to complete the fat tree network. The number of spine switches needed is determined by same method used for a non-blocking network. Count the number of 40Gb uplinks from all leaf switches and divide by 32 (since there are 32 40Gb ports in the Z9000). Network leaf switch uplinks are evenly divided between the spine switches. To achieve balanced performance you should have the same number of uplinks from a leaf switch connected to each of the spine switches. Mathematically, the number of uplinks you have should be evenly divisible by the number of spine switches.
It is also possible to use the Dell Force10 S4810TM when building distributed core networks. The S4810 has 48 10Gb SFP+ ports plus four 40Gb uplink ports and may be used as both leaf and spine switches or can be used as a leaf switch in combination with Z9000 switches for the spine.
As I stated in my previous post, using a fat tree topology and the distributed core capabilities of the Dell Force10 Z9000TM and Dell Force10 S4810TM switches enables a massive number of nodes to be connected into a scalable, high performing 10Gb network. Start small and grow as your cluster grows.