High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • Cambridge University's Wilkes Air-Cooled Supercomputer Among "Greenest"

    According to a story in HPCwire, the number two supercomputer on the Green500 list is a little different from all the other honorees - it's the only one that is air-cooled.  

    Cambridge University's Wilkes - as the supercomputer has been named - is based on 128 Dell T620 servers, which provide greater energy efficiencies.

    Capable of 240 teraflops LINPACK, Wilkes  is a part of Cambridge's "Square Kilometer Array (SKA) Open Architecture Lab," which is participating in a multinational collaboration to build the world's largest radio telescope.  As was mentioned in his SC13 video, it's his work with SKA that has Paul Calleja's son convinced dad is looking for aliens in outer space!

    You can learn more about Cambridge's air-cooled supercomputer at HPCwire.

  • Accelerating High Performance LINPACK (HPL) with Kepler K20X GPUs

    by Saeed Iqbal and Shawn Gao 

    The NVIDIA Tesla K20 GPU has proven performance and power efficiency across many HPC applications in the industry.  The K20 is based on the latest Kepler GK110 architecture by NVIDIA, and incorporates several innovative features and micro-architectural enhancements implemented in the Kepler design. Since the K20 release, NVIDIA has launched an upgrade to K20 called the K20X.  The K20X has a higher number of processing units and higher memory and memory bandwidth. This blog quantifies the performance and power efficiency improvements of K20X compared to K20. The information presented in this blog is beneficial in making an informed decision between the two powerful GPU options.

    High Performance LINPACK (HPL) is an industry standard compute intensive benchmark. HPL is traditionally used to stress the compute and memory subsystem.  Now, with the increasingly common use of GPUs, a GPU-enabled version of HPL is developed and maintained by NVIDIA.  The GPU-enabled version of HPL utilizes the traditional compute subsystem of CPUs and compute accelerator of GPUs. We used the Kepler GPU-enabled HPL version 2.0 for this study.

    We use the Dell PowerEdge R720 for the performance comparisons.  The PowerEdge R720 is a dual socket server and can have up to two internal GPUs installed. We keep the standard test configuration to be two GPU per server.  The PowerEdge R720 is a versatile full-featured server with a large memory capacity.    

    Hardware Configuration and Results

    The Server and GPU configuration details are compared in the tables below.

    Table 1:  Server Configuration

    Server

    Model

    PowerEdge R720

     

    Processor

    Two Intel Xeon E5-2670 @ 2.6 GHz

     

    Memory

    128GB ( 16x8G)  1600MHz  2 DPC

     

    GPUs

    NVIDIA Tesla K20 and K20X

     

    Number of GPUs installed

    2

     

    BIOS

    1.6

    Software

    Benchmark : GPU-accelerated HPL

    HPLHhhHPLHPLHPL

    Version 2.1

     

    CUDA, Driver

    5.0, 304.54

     

    OS

    RHEL 6.4

     Table 2:  K20 and K20X: Relevant parameter comparison

    GPU Model

    K20X

    K20

    Improvement (K20X)

    Number of cores

    2,688

    2,496

    7.6%

    Memory (VRAM)

    6 GB

    5 GB

    20.0%

    Memory bandwidth

    250 GB/s

    208 GB/s

    20.2%

    Peak Performance(SP)

    3.95 TFLOPS

    3.52 TFLOPS

    12.2%

    Peak  Performance(DP)

    1.31 TFLOPS

    1.17 TFLOPS

    11.9%

    TDP

    235W

    225W

    4.4%

    Figure 1: HPL performance and efficiency on R720 for K20X and K20 GPUs. 

    Figure 1 illustrates the HPL performance on the PowerEdge R720. The CPU-only performance is shown for reference. Clearly, there is a performance improvement with K20X of about 11.2% on HPL GFLOPS, compared to K20. Compared to the CPU-only configuration the HPL acceleration with K20X GPUs is 7.7X. Similarly with the K20 GPUs, it is 6.9X.  In addition to improved performance, the compute efficiency on K20X is slightly better than K20. As shown in Figure 1, K20X has a compute efficiency of 82.6% and K20 an efficiency of 82.1%.  It is typical for CPU-only configurations to have higher efficiency than heterogeneous CPU+GPU configurations, as in Figure 1.  The CPU-only configuration is 94.6%, and the CPU+GPU configurations are in the lower 80s. 

    Figure 2: Total Power and Power Efficiency on PowerEdge R720 for K20 and K20X GPUs. 

    Figure 2 illustrates the total system power consumption of the different configurations of the PowerEdge R720 server.  The first thing to note from Figure 2 is that GPUs consume substantial power.  The CPU-only configuration power consumption is about 450W, which increases to above 800W when K20/K20X GPUs are installed in the server. This represents an increase of up to 80% in power consumption. This should be taken into account during the power budgeting of large installations and the power system supply.   However, once the power is delivered to the GPUs, they are much better than CPUs alone in converting the energy to useful work. This is clear from the improved performance per watt numbers shown in Figure 2.  The K20X shows a performance per watt of 2.79 GFLOPS/W, which is about 4X better than the CPU-only configuration.  Similarly, the K20 has 2.68 GFLOPS/W power efficiency, which is about 3.8X better than the CPU-only configuration.    It is interesting to note that K20X shows a 7% improvement over its predecessor K20.

    Summary

    The K20X delivers about 11% higher performance and consumes 7% more power than the K20 for the HPL benchmark.   These results are in line with the expected increase when the theoretical parameters are compared.

     

  • Dell HPC Solution Refresh: Intel Xeon Ivy Bridge-EP, 1866 DDR3 memory and RHEL 6.4

    by Calvin Jacob and Ishan Singh

    Support for Intel Xeon Ivy Bridge-EP processors, 1866 DDR3 memory and RHEL 6.4 has been added to the current Dell HPC Solution. This solution is based on Bright Cluster Manager 6.1, with RHEL 6.4 as the base OS supported on Ivy Bridge processors. BCM is a complete HPC solution from Dell which can automate, deploy and manage an HPCC. Recommended BIOS settings for the supported platforms, along with BMC/iDRAC settings are scripted and made available to the user, if he chooses them. Dell system management tools are bundled with Bright Cluster Manager and used to set, configure and manage Dell hardware.

     The highlights of this release are additional support for:

    1.       Intel Xeon Ivy Bridge-EP (E5-26xx v2) processors.

    2.       Redhat Enterprise Linux 6.4 (kernel-2.6.32-358.el6.x86_64).

    3.       Mellanox OFED 2.0-3.

    4.       CUDA 5.5.

    5.       PEC Tools for systems management of PE-C servers.

    6.       Hardware Match Check by BCM. 

    Ivy Bridge-EP processors

    Support for Intel Xeon Ivy Bridge-EP (E5-26xx v2) processors has been added to the refreshed and existing servers: R620, R720, M620 and C8000 series and C6220II. Intel Ivy bridge-EP processors have Tri-Gate transistors which use 3-D (non-planar) architecture to package more transistors into less space. These processors have up to 12 cores, 30MB Last-Level Cache (LLC), DDR3 memory with speeds up to 1866 MHz, QPI speeds of 8 GT/s, up to 40 PCIe 3.0 lanes and a TDP up to 130W. The previous generation Intel Xeon Sandy Bridge processors used 32 nm technology, whereas the new Intel Xeon Ivy Bridge-EP are a 22 nm thus delivering higher density and more performance within the same power envelope. 

    Redhat Enterprise Linux 6.4 (kernel-2.6.32-358.el6.x86_64)

    RHEL 6.4 (kernel 2.6.32-358.el6.x86_64) is the minimum supported Operating System for Intel Ivy Bridge-EP processors. Some of the highlights of RHEL 6.4:

    1.       Updated Resource Management Capabilities.

    2.       New Tools and Improved Productivity Support.

    3.       Updated network drivers and fixes for supported Intel, Broadcom network adapters. 

    Mellanox OFED 2.0-3

    Support for Mellanox OFED 2.0-3 has been added. MLNX OFED 2.0-3 OFED includes drivers for mlx4 devices (ConnectX3 and ConnectX2) and mlx5 devices (ConnectIB). Devices that would be officially supported are ConnectX3 and ConnectX2 with signaling rates of 20, 40 and 56Gbps in the Infiniband mode. 

    PEC Tools for systems management of PE-C servers

    PEC Tools are the official tools for systems management on PowerEdge-C servers. These tools can be used for configuring BMC and BIOS parameters. The tools are included in the BCM 6.1 under the folder /opt/dell/pec.

    Examples of tool usage:

    /opt/dell/pec/setupbios setting save > filename (To save the BIOS settings)

    /opt/dell/pec/setupbios setting readfile filename (To read and apply settings from a saved file)

    /opt/dell/pec/setupbios setting set [setting] [value] (To set value for a particular option)  

    Hardware Match Check by BCM

    This tool is available with the Bright Cluster Manager. It is used to compare a large number of nodes that want to be added to a cluster. BCM provides this feature to check the consistency of the hardware across nodes. This is done by comparing the hardware profile of one of the nodes with the rest of the nodes in the same node group. The hardware match check can be automated, in case of any mismatch, the cluster admin is notified accordingly. The monitoring sweeping rate on how often the alerts should be visible can be adjusted.

  • America's Cup: Real Time HPC Aims To Build a Faster Boat

    It wasn't THAT long ago, that going to college clearly delivered an advantage to kids growing up and looking to enter the workplace. Today, it seems attending college has become a basic requirement to even be considered for many professions. You could argue the same is true in the use of technology in competition.

    Photo Credit: Emirates Team New Zealand website.

    Take America's Cup as an example. The America's Cup is a sailing race of yachts that features the world's best sailors, and the world's fastest ships. Below is the definition I pulled from Wikipedia: 

    The America's Cup, affectionately known as the "Auld Mug", is a trophy awarded to the winner of the America's Cup match races between two sailing yachts. One yacht, known as the defender, represents the yacht club that currently holds the America's Cup and the second yacht, known as the challenger, represents the yacht club that is challenging for the cup. The timing of each match is determined by an agreement between the defender and the challenger. The America's Cup is the oldest international sporting trophy.

    I can only imagine the level of skill, and training required to become a sailor on one of the two competing yachts. It's truly amazing to watch the sailors in America's Cup to respond with such strength and precision during the race. The other part of this competition is the challenge of building the fastest boat. Now this is where things have gotten interesting - and where high performance computing (HPC) is playing a large role. 

    Emirates Team New Zealand needed to design an entirely new boat, based on new multihull requirements of this year's America's Cup. Amazingly, with the help of a Dell HPC cluster, Emirates Team was able to perform 300-400 accurate computer test boat designs in their quest to build the fastest boat possible. This is in contrast to the 30-40 physical designs they were able to do for the 2007 competition! Simply amazing.

    The use of HPC clearly paid off early in the design phases, as the computer modeling allowed Emirates Team to develop a boat that could hydrofoil, providing an advantage by allowing the boat to lift out of the water while staying within the regulations of the competition. The competition later was also able to achieve hydrofoil, but later than Emirates Team, which allowed them to focus on other boat design models. 

    Photo Credit: ANSYS Website.

    The key tools used by Emirates Team to design the boat included ANSYS simulation software running on the Dell HPC Cluster, and Latitude laptops. This marked the first time they were able to rely completely on numerical analysis and digital prototyping, which the team believes has helped them create a boat design that was 30-40 percent faster than their original concepts. 

    In the end, Oracle Team USA was able to make the biggest comeback in history of America's Cup, winning an unbelievable eight straight races. This competition is where technology and design, meet the talent, experience, and strength of sailors. I would guess that the competition for next year's America's Cup is probably already underway. With the stakes so high, and the difference between winning and losing so slim, it's probably a good bet that HPC will play a larger role in the future.  

    Other news coverage:

    TechHive: The America's Cup: nerves, skill, and computer design

    Video: ANSYS CFD in Action: Emirates Team New Zealand Profile 

    HPC Wire: America's Cup Challenger Emirates Team New Zealand Transform Boat Design with Dell Solution

    Attached Case Study: Team New Zealand takes on the America’s Cup with game-changing technology (see bottom of blog post to download)

    If you haven't seen how exciting America's Cup can be to watch, I embedded a short video below. 

     

  • Using Strace to Understand HPC Application Performance

    In a recent ADMIN HPC magazine article, Dr. Jeffrey Layton, a colleague of mine at Dell, introduces a three-piece series focused on helping HPC organizations improve I/O performance.

    The article series will focus on the following three codes:

    • C; 
    • Fortran 90; and 
    • Python

    In the first article,Understanding I/O Patterns with C and strace, Jeff takes us on a walk through tour of of several application runs, and reviews the results using strace, a debugging utility for Linux and some Unix. Jeff noted he started with C, and not C++ because he still runs into C often in HPC environments, which I found interesting.

    After several examples, Jeff demonstrates how monitoring application performance via strace can deliver valuable insight with the goal of gaining application performance improvements. 

    You can read the first article here: http://www.admin-magazine.com/HPC/Articles/Tuning-Application-I-O-Patterns

    We'll be watching for the next article focused on Fortran 90. 

  • Is On-demand Supercomputing Possible Through the Cloud?

    Over the past decade, supercomputers have become more affordable and accessible as high performance computing (HPC) clusters were proven as a viable alternative to traditional SMP proprietary systems. The price/performance improvement delivered with these new systems built upon standard off-the-shelf computer hardware components, and open source software, represented a seismic shift in the landscape of supercomputing, and the amount of computing power that is utilized. The HPC industry has continued to grow in terms of money spent on these systems - even though the cost of these supercomputers dropped more than 10x. One of the questions has been- are the same players just gaining more computing power - or has the price/performance improvements allowed users of HPC systems to grow?

    It's an interesting question. Now there is a hope of making HPC more accessible to more organizations by delivering HPC resources to users via the cloud. So by removing the requirement that organizations acquire, manage and maintain these supercomputing systems, would it allow new organizations to gain access to supercomputing power? It's possible the next seismic shift in HPC is currently underway with the promise of HPC via the cloud. In a recent Scientific Computing World article called Meeting demand, it explores that very concept. While not a new idea, it's finally come to fruition for some organizations. Whether you call it utility HPC, or cloud bursting, the ability to access massive amounts of computing power through the internet is a reality. Today. 

    Many of the case study examples of this so-called cloud bursting, features organizations with existing HPC cluster systems on site. In these cases utility HPC using the cloud is implemented where they need an extra push - or when they’ve reached maximum capacity of internal HPC systems.

    The article provides some guidance for when purchasing and owning makes sense, versus accessing HPC systems via the cloud. A simple evaluation of predicted utilization can help organizations determine where the break-even point exists. So for organizations that have constant demand for HPC systems, clearly owning and managing these resources internally makes sense. On the flip side, when HPC is used more to support special project work where utilization varies, investing in expensive infrastructure, and HPC hardware that can quickly become outdated, may not make sense.  

    However, the most realistic situation is more of a combined approach. Perhaps you can call it a hybrid, where internal systems are acquired to be running near maximum capacity at all times, and when demand is greater than those internal HPC resources can handle, this is when utility supercomputing comes into play. 

    In the article, Dell's own Bart Mellenbergh, EMEA Director of HPC, outlined some of the challenges to wider adoption of this so-called cloud bursting, or utility HPC. Some of these include licensing, HPC knowledge and the ability to run apps over the cloud, as well as data security.

    So it seems the majority of the traditional HPC sites will continue using their internally-owned HPC systems. Some visionary organizations have already set in motion a path that will allow them to gain access to cloud-based HPC when demand exceeds internal capacity.

    However, the question remains, will the promise of more accessible HPC through the cloud actually attract new users? In theory, any organization that stands to benefit from the powerful simulations and number crunching associated with HPC systems, should be considering how they can leverage HPC via the cloud. Better data and better information, leads to better decisions. What organization today wouldn't benefit from that?

    For more information on this article, please read:

    Scientific Computing World: Meeting Demand, by Beth Harlen http://content.yudu.com/A2byyn/SCWAUGSEP13/resources/18.htm

  • From Amdahl's Law to I/O, learn some reasons behind the limits of HPC scalable performance

    Theory rarely translates exactly into reality. There are too many variables. This is notably true in High Performance Computing (HPC) where system performance measurements of Theoretical Peak and Real Peak are commonplace. A colleague of my from Dell Dr. Jeff Layton recently wrote an article in Admin HPC called A Failure to Scale that addressing scalability limits. The limits to improved performance and scalability can run counter to logic. For example, many HPC administrators have added cores and CPUs, but stop receiving performance boosts.

    Jeff points out Amdahl's Law, which seemingly injects logic to this illogical behavior:

    Underlying the scalability limit is something called Amdahl’s Law, proposed by Gene Amdahl in 1967. ... This law illustrates the theoretical speedup of an application when running with more processes and how it is limited by the serial performance of the application. At some point, your application will not run appreciable faster unless you have a very small serial portion.

    Gene Amdahl

    As always, Jeff's article is loaded with detailed information on the subject, back up formulas and charts, as well as tidbits of interesting information. Reading this article will outline where the term FUD came from, as an example. Most importantly, Jeff provides guidance to help system administrators who are facing performance and scalability limits due to Amdahl's Law, or even I/O.

    You can read the full article here: A Failure to Scale.

    Enjoy!

  • TACC's Stampede Achieves 6th Fastest in the World, in the Name of Advancing Open Science Research

    The team down at the Texas Advanced Computing Center (TACC) just doesn’t seem to want to rest. In fact, they’ve done just the opposite, pressing down harder on the accelerator pedal by improving the performance of its Stampede supercomputer by nearly doubling the Rmax of the system – in just six months. The system now delivers an impressive 5.18 petaFLOPS of performance in 182 cabinets while using 75 miles of Infiniband cables!

    STAMPEDE FROM TEXAS ADVANCED COMPUTING CENTER (TACC) - PHOTO: TACC

    This performance boost was enough to move Stampede from the seventh fastest supercomputer on Earth, according to the Top500.org list, to number six. I think that demonstrates the amazing leaps the high performance computing (HPC) industry is making, and the speed in which this is all occurring is mind-boggling.

    Of course the world’s attention is always capitaved by the speed of these Super Computers, and the Top500 List does a great job of doing that. However, the real excitement comes from the research and science that is enabled. TACC rightfully so is proud of not only the power of its computing resources, but also how much tools like Stampede will help scientists do more. Under the theme of open science research, researchers at any U.S. institution can have access to utilize Stampede through a proposal and approval process.

    We no longer need to imagine a day when researchers across the U.S. have access to the most powerful tools in modern science – high performance computing.

    Congratulations again to all of the newest members of the Top500 – and cheers to what we will discover and accomplish with this great computing power!

    Below are some interesting facts about Stampede:

    • 6,400 Dell C8220z nodes
    • 12,800 Intel Xeon E5-2680 processors
    • Over 6,400 Intel Xeon Phi Cards
    • 272 TB of RAM
    • 10 PF Theoretical Peak
    • 14 PB of Dell DCS Scorpion Storage
    • 182 racks
    • Weight:  500,000 pounds
    • 75 miles of cabling
    • Power usage:  Rated at 5 megawatts, but only is running at 3 megawatts, saving UT TACC over a $1M a year in electricity
    • Physical size:  10,000 square feet data center
    • 10 Quadrillion mathematical calculations per second
    • Estimated 6,000 users world-wide with over 1,000 research projects

    Additional Resources

    About TACC Stampede http://www.tacc.utexas.edu/resources/hpc/stampede
    Top 10 Supercomputers Illustrated June 2013 http://www.datacenterknowledge.com/top-10-supercomputers-illustrated-june-2013/2/
    TACC's Stampede Rumbles the Ground: Operational Jan. 7, 2013 http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2013/01/10/tacc-s-stampede-rumbles-the-ground-operational-jan-7-2013.aspx
    TACC's Stampede Gallops to #7 Fastest Computer in the World http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2012/11/14/tacc-s-stampede-gallops-to-7-fastest-computer-in-the-world.aspx
    Phinally, the Phull Phi Phamily is Announced http://en.community.dell.com/techcenter/high-performance-computing/b/weblog/archive/2013/06/18/phinally-the-phull-phi-phamily-is-announced.aspx

  • Phinally, the Phull Phi Phamily is Announced

    Intel has announced the full line-up of the now officially named “Intel® Xeon® Phi™ coprocessor x100 family”.   Whew!  What a mouthful.   I call it Phi for short. And we in HPC have been waiting a long time for Larrabee, uh, MIC, er, Knights Corner, I mean Phi to be announced and available to help advance our research.  

    I am very excited to be able to phinally talk more openly about this accelerator for HPC.  In a previous blog, I briefly described the already available 5110 model of the Phi coprocessor and how to compute its peak theoretical performance.

    Phi..., Nodes, Sockets, Cores and FLOPS, Oh, My!
    http://dell.to/YjFuN0

    STAMPEDE, Texas Advanced Computing Center (TACC)

    I also shared that the Dell TACC Stampede system used an early-access, special edition Phi called the SE10.  Stampede, which was ranked #7 on the November 2012 Top500 list, now moves up to #6 with the release of the June 2013 Top500 list (www.Top500.org).   Congrats to Tommy Minyard and the folks at TACC for the improved number.

    The production version of the Phi SE10 used in Stampede is called the 7120 and features a bit more performance than the special edition SE10 version.  The 7120 was announced at the 2013 International Supercomputing Conference (ISC’13 http://www.isc-events.com/isc13/), along with other details about the rest of the Phi models.

    For those that don’t have the time to read another blog or don’t want to spend the effort to do the math, here’s the summary of the peak performance of the three Phi models announced:

    • 3120:  1.00 TFLOPS (57 cores/Phi  *  1.1 GHz/core  * 16 GFLOPs/GHz  =  1,003.2 GFLOPS)
    • 5110:  1.01 TFLOPS  (60 cores/Phi  *  1.053 GHz/core  * 16 GFLOPs/GHz  =  1,010.88 GFLOPS)
    • SE10:  1.07 TFLOPS  (61 cores/Phi  *  1.1 GHz/core  * 16 GFLOPs/GHz  =  1,073.60 GFLOPS)  (Note: not available)
    • 7120:  1.20 TFLOPS  (61 cores/Phi  *  1.238 GHz/core  * 16 GFLOPs/GHz  =  1,208.28 GFLOPS)



    So, what does all this mean and how does it help HPC and Research Computing?  In short, we now have another 3 arrows in our quiver to attack the wide range of important problems that we face.

    How has the presence of Phi already affected HPC and Research Computing?  Well, the #1 system on the June 2013 Top500 list is using 48,000 Xeon Phi coprocessors.  Yes, 48 thousand.   See the www.Top500.org list for more details.  Of note is the fact that both TACC’s Stampede with 6,400 Phi coprocessors and the #1 system with 48,000 Phi coprocessors are operating at about 60% efficiency. That’s a consistent number over a wide range of coprocessors.

    If you have not yet had a chance to experiment with Phi, then, as usual,  I recommend a platform that is more suitable to test-and-development than a  production platform such as those deployed at TACC for example.  As such, Dell also announced at ISC’13 support for Phi in the PowerEdge R720 and T620, both of which are excellent development platforms for both GPUs and Phi coprocessors.  For more information about installing and configuring a Phi, see this posting:

    Deploying and Configuring the Intel Xeon Phi Coprocessor in a HPC Solution
    http://dell.to/14GtFRv


    When deploying larger quantities of Phi or GPU cards, the production platform used by TACC’s Stampede, the C8220x, is an option.

    To get you going on the software side with Phi, be sure to read and bookmark these:

    Additionally, on the software side, if you are already using Intel’s Cluster Studio XE (http://software.intel.com/en-us/intel-cluster-studio-xe), support for Phi is included.

    What does the future hold?   Personally, comparing and contrasting the performance of Phi coprocessors and GPUs is still on my list for a future blog.  Now that Phi is announced, I may be able to get to that sooner!

    Secondly, there is an upcoming whitepaper from Saeed Iqbal, Shawn Gao, and Kevin Tubbs from Dell’s HPC Engineering Team.  They present a performance analysis of the 7120 Phi in the R720.  Preliminary results indicate about a 6X speedup and 2X the energy efficiency compared to Xeon CPUs on LINPACK.  I’ll possibly update this blog with that link and definitely tweet about it as soon as it is available.

    Finally, Intel also revealed that the next-gen of Phi is code-named Knights Landing and will be available not only as a PCIe card version as today but also as a “host processor” directly installed in the motherboard socket. They also shared that the memory bandwidth will be improved.  This might help with the efficiency mentioned previously.

    CPUs, GPUs, Coprocessors and soon, “host processors”.  Interesting times ahead.   I’ll be following those developments and sharing critical information as it becomes available.

    If you have comments or can contribute additional information, please feel free to do so.  Thanks.  --Mark R. Fernandez, Ph.D.

    #Iwork4Dell
    Follow me on Twitter @MarkFatDell

  • Deploying and Configuring the Intel Xeon Phi Coprocessor in an HPC Solution

    By: Munira Hussain and Kevin Tubbs

     

    The Intel Xeon Phi Coprocessor boosts and aggregates the parallel processing power of computation in a cluster. It is designed to expand on parallel programming model of Intel Xeon processors and benefit the applications that are able to scale. Similar to the processor on the systems that are cache coherent and share memory, the Intel Xeon Phi Coprocessor is SMP on a chip and connects to other devices in the system via the PCIe bus.

     

    To install and configure Intel Xeon Phi Coprocessors, the administrator must install the Intel Manycore Platform Software Stack (MPSS) and provide initial configuration of all the coprocessors in a cluster. Installing and configuring a new piece of technology can be complex and time-consuming.  This blog provides detailed steps and best practices to get started with Xeon Phi.

    Note that such a solution setup has also been simplified with the Bright Cluster Manager software. The drivers and software needed for Intel Xeon Phi are integrated in the software stack for ease of deployment and provisioning.

     

    Pre- Install Configuration:

    If the Xeon Phi coprocessor is detected by the hardware (“lspci”) but is not recognized by the Intel tools, confirm that the BIOS has the following setting enabled.

     Enable large BAR setting: Integrated Devices >> Memory I/O larger than 4GB in Bios Settings >> Enabled

    This can also be done with Dell Deployment Tool Kit 4.2 or higher using the “syscfg” tool from the operating system.

    >> /opt/dell/toolkit/bin/syscfg  -- MmioAbove4Gb=enable

     

    Setting up Host Nodes:

    Install the Host nodes with Bright Cluster Manager 6.1 that includes the Intel MPSS. The host nodes refer to the nodes to which the Intel Xeon Phi are connected via PCI slots.

     

    Installation:

    1. Install Intel MPSS package on the compute node that owns the Intel Phi. Bright Cluster Manager packages the software for easy installation
    2. The main drivers and tools are bundled in rpm format in Bright Cluster Manager. These are extracted from Intel MPSS and made easy to deploy. Four main components: intel-mic-cross, intel-mic-driver, intel-mic-ofed, intel-mic-flash and intel-mic-runtime are installed on the host nodes.

      The k1om packages are meant to run inside the Intel Xeon Phi. The packages deployed are libgcrypt-k1o, slurm-client-k1om, strace-k1om, munge-k1om, libgpg-error-k1om

    3. Once the host node is installed, verify with “lspci” command on the OS to ensure that the hardware can detect the coprocessor.
    4. Load the module on the host : module add intel/mic/runtime/<2.x.version> which loads the modules and provides access.
    5. Update the bootloader and flash on the Intel Xeon Phi. This can be done for all the Intel Xeon Phi’s using Bright Cluster Manager rather than going to individual nodes.
      1. Stop the Intel mpss service before proceeding to flash the Phi.(service mpss stop) -- (From the CMGUI or cmsh)
      2. Reset the Intel Xeon Phi cards using the following parameters (If micctrl command is not available, you will need to load the modules as in step 4) ) micctrl –r –f –w  (this resets the Phi and then puts the Xeon Phi in wait stage after it resets) At the point the mic is in a blank state and ready to be updated with an image/firmware
      3. Update the respective boot loader and firmware image on the mic: The main drivers and tools are bundled in rpm format in Bright Cluster Manager. These are extracted from Intel MPSS and made easy to deploy. Four main components: intel-mic-cross, intel-mic-driver, intel-mic-ofed, intel-mic-flash and intel-mic-runtime are installed on the host nodes. The k1om packages are meant to run inside the Intel Xeon Phi. The packages deployed are libgcrypt-k1o, slurm-client-k1om, strace-k1om, munge-k1om, libgpg-error-k1om
      4. Once the Phi has been flashed, start the mpss service on the host/compute node. (service mpss start)
      5. Reboot the compute node. Make sure it is a power reset rather than an OS reboot.
    6. Once the host nodes comes up.
      1. a.Go into CMGUI and setup MIC nodes using the MIC setup wizard. Using the tool below you can easily attach the respective numbers of Intel Phi’s represented in each host node.
    7.  micflash -v -update –noreboot -device all  (this updates the smc bootloader for all Intel Phi attached to the specific host/compute node)

    b. Then configure the network bridge IP that is between the host node and Intel Phi for communication purpose. Bright Cluster Manager tool automatically checks IP conflicts across the Intel Xeon Phi in the cluster that are connected.

    c. This tool automatically creates and assigns IP to the Intel Xeon Phi. At this stage the Intel Xeon Phi can now be recognized and managed/monitored from the Bright Cluster Manager.

    Dell’s HPC offerings with the Xeon Phi are supported on the PowerEdge R720 and C8220x servers.  More info:

    Dell HPC Solutions: http://www.dellhpcsolutions.com/

    Bright Computing: http://info.brightcomputing.com/intel-xeon-phi/

    Intel Corporation: http://www.dellhpcsolutions.com/dellhpcsolutions/static/XeonPhi.html