High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • Newest South African Student Team Visits TACC

    All of the student teams attending this summer's HPC Advisory Council's International Supercomputing Conference (HPCAC-ISC) have the same goal in mind: win the student cluster competition. But the new team from South Africa may feel some added pressure when they arrive in Frankfurt this July. They hope to become the third consecutive champions from their country.

    Team "Wits-A" from the University of Witwatersrand in Johannesburg won the right to defend South Africa's title at ISC '15 during the South African Center for High Performance Computing's (CHPC) Ninth National Meeting held in December at Kruger National Park. The  students bested 7 other teams from around South Africa.

    As part of their victory, the South Africans recently traveled to the United States. On their itinerary was a tour of the Texas Advanced Computing Center (TACC) where they had the opportunity to see the Visualization Laboratory (Vislab) and Stampede Supercomputer, while gaining insights about how to best compete at the ISC '15 Student Cluster Challenge in July. Also on the itinerary was a Texas tradition - sampling some  down home BBQ!

    Hoping for that three-peat win are Ari Croock, James Allingham, Sasha Naidoo, Robert Clucas, Paul Osel Sekyere, and Jenalea Miller, with reserve team members are comprised of Vyacheslav Schevchenko and Nabeel Rajab.

    You can learn more about Team South Africa here.

  • New Collaboration Saving the Lives of Kids with Cancer

    by Suzanne Tracy

    Some 4,100 genetic diseases affect humans. Tragically, they are also the primary cause of death in infants, but identifying which specific genetic disease is affecting an inflicted child is a monumental task. Increasingly, however, medical teams are turning to high performance computing and big data to uncover the genetic cause of pediatric illnesses.

    Through the adoption of HPC and big data, clinicians are now able to accelerate the delivery of new diagnostic and personalized medical treatment options. Successful personalized medicine is the result of analyzing genetic and molecular data from both patient and research databases. The usage of high performance computing allows clinicians to quickly run the complex algorithms needed to analyze the terabytes of associated data.

    The marriage of personalized medicine and high performance computing is now helping to save the lives of pediatric cancer patients thanks to a collaboration between Translational Genomics Research Institute (TGen) and the Neuroblastoma and Medulloblastoma Translational Research Consortium (NMTRC).

    The NMTRC conducts various medical trials, generating literally hundreds of measurements per patient, which then must be analyzed and stored. Through a ground-breaking collaboration between TGen, Dell and Intel, NMTRC is now using TGen’s highly-specialized software and tools, which include Dell’s Genomic Data Analysis Platform and cloud technology, to decrease the data analysis time from 10 days to as little as six hours. With this information, clinicians are able to quickly treat their patients, and dramatically improve the efficacy of their trials.

    Thanks to the collaboration, NMTRC has launched personalized pediatric cancer medical trials to provide near real-time information on individual patients' tumors. This allows clinicians to make faster and more accurate diagnoses, while determining the most effective medications to treat each young patient. Clinicians are now able to target the exact malignant tumor, while limiting  any potential residual harm to the patient.

    You can read more about this inspiring collaboration here.

     

     

  • Accelerating HPL Using the Intel Xeon Phi 7120P Coprocessors

    by Saeed Iqbal and Deepthi Cherlopalle

    The Intel Xeon Phi Series can be used to accelerate HPC applications in the C4130. The highly parallel architecture on Phi Coprocessors can boost the parallel applications. These coprocessors work seamlessly with the standard Xeon E5 processors series to provide additional parallel hardware to boost parallel applications. A key benefit of the Xeon Phi series is that these don’t require redesigning the application, only compiler directives are required to be able to use the Xeon Phi coprocessor.

    Fundamentally, the Intel Xeon series are many-core parallel processors, with each core having a dedicated L2 cache. The cores are connected through a bi-directional ring interconnects. Intel offers a complete set of development, performance monitoring and tuning tools through its Parallel Studio and VTune. The goal is to enable HPC users to get advantage from the parallel hardware with minimal changes to the code.

    The Xeon Phi has two modes of operation, the offload mode and native mode. In the offload mode designed parts of the application are “offloaded” to the Xeon Phi, if available in the server. Required code and data is copied from a host to the coprocessor, processing is done parallel in the Phi coprocessor and results move back to the host. There are two kinds of offload modes, non-shared and virtual-shared memory modes. Each offload mode offers different levels of user control on data movement to and from the coprocessor and incurs different types of overheads. In the native mode, the application runs on both host and Xeon Phi simultaneously, communication required data among themselves as needed. A good reference on Xeon Phi and modes can be found here.

    The Intel Xeon Phi 7120P coprocessor has the highest performance among the Phi series. It has 61 cores and is rated at 1.2 TFLOPS and can handle 244 threads. The 7120P also has the Intel Turbo Boost technology.  The bulk of the compute intensive calculations are done on the coprocessors.

    The PowerEdge C4130 offers five configurations “A” through “E”.  Among these configurations there are two balanced configurations. The two balanced configurations “C” and “D” are considered for acceleration in this blog. Configuration “C” is the balanced four coprocessor option with two coprocessors attached to each host processor, and configuration “D” has a single Xeon Phi attached to the each host processor. Table 1 gives more details of these configurations. The details of the two configurations are shown in Table 1 and the block diagram (Figure 1) below.

    This blog shows the results of acceleration observed on the C4130 with Intel Xeon Phi 7120P in configuration “C” and “D”. (Click on images to enlarge.)

    Table 1: Two Balanced C4130 Configurations C and D


     Figure 1: PE C4130 Configuration Block Diagram

    Table 2 gives more information about the hardware configuration used for the tests.

    Table 2: Hardware Configuration


     

    Figure 2: HPL Acceleration (FLOPS compared to CPU only)  and Efficiency on the C4130 Configurations 

    Figure 2 illustrates the HPL performance on the PowerEdge C4130 Server. The Offload execution mode was used for all the runs. In this mode the application splits the workload where highly-parallel code is offloaded to the coprocessor, and the Xeon host processors primarily run serial code. Configuration C has 2 Phis connected to each CPU, and configuration D has single Phi connected to each CPU. ECC is enabled and the turbo mode is disabled across all the runs.

    Intel Xeon Phi coprocessor provides more efficient performance for highly parallel applications like HPL. In the above graphs the CPU only performance is shown for reference. The compute efficiency for CPU-only configuration is 91.6% whereas Configuration C has a compute efficiency of 75.6% and configuration D has 81.2%. It is already known that the CPU-only configurations in general have higher efficiency when compared to CPU plus Phi configurations. Higher efficiency is observed in configuration D compared to C. Compared to the CPU-only configuration, the HPL acceleration for configuration C with 4 Xeon Phis is 5.3X and for configuration D with 2 Xeon Phis, it is 3.3X. 

          

    Figure 3: Total power and performance/watt on the C4130 configurations

    Figure 3 shows the associated power consumption data of the HPL runs for CPU-only configuration, Configuration C and D. In general, accelerators can consume substantial power when loaded with compute-intensive workloads. The power consumption of CPU-only configurations is 520W whereas the power consumption increases for configurations C and D. Each Intel Xeon PHI 7120P co-processor can consume power up to 300 watts. The power consumption for configurations C and D is 3.3X and 2.1X respectively when compared to the CPU-only configuration.

    The Intel Xeon Phi 7120P co-processor provides high performance, memory capacity and good performance-per-watt metrics. Configuration C shows a performance-per-watt of 2.44 GFLOPS/w and configuration D shows 2.34 GFLOPS/w whereas the CPU-only configuration gives 1.56 GFLOPS/w.

     

  • A Panel Discussion on Data-Intensive Computing

    Our second panel discussion at SC14 focused on data-intensive computing. Entitled Data Intensive Computing: The Gorilla Behind the Computation, it was an intriguing look at dealing with the ability to move and store data. Moderated by Rick Brueckner of insideHPC, we are honored to have had the participation so many industry luminaries. Our distinguished panel included:

    • Niall Gaffney, Texas Advanced Computing Center
    • Kenneth Buetow, Ph.D., Arizona State University
    • William Law, Stanford University
    • Erik Deumens, Ph.D., University of Florida

    Thank you again to Rick and all the panelists!

    If you missed the panel, you can view it here.

  • A Panel Discussion on HPC in the Cloud

    At SC14, we invited some of the foremost thought leaders in the industry to join panel discussions in our booth. Once again, the panels were moderated by Rich Brueckner of insideHPC.

    HPC in the Cloud: The Overcast has Cleared was our panel discussion focusing on whether private and public clouds are now being seen as the de facto way things are accomplished in HPC. This  proved to be a popular topic for those in attendance, and we were honored to have had the participation of:

    • Larry Smarr, Ph.D., University of California, San Diego
    • Muhammad Atif, Ph.D., National Computational Infrastructure, Australian National University
    • Roger Rintala, Intelligent Light
    • Boyd Wilson, Clemson University and Omnibond

    Thanks again to everyone for their participation!

    If you missed the panel, you can view it here.

  • The University of Cambridge Adds Phi to Ramp Up Research

    By David Detweiler

    Modern day research is increasingly dependent on high performance computing to provide faster, less expensive, and accurate research discoveries. Since 2010, the University of Cambridge HPC Solution Center has looked into the real-world challenges researchers face and provided viable solutions. In order to help the wider research community make the most of new discoveries, the university's HPC Solution Center is expanding its focus to data analytics and cloud platforms.

    Recently, the University announced that researchers will now have access to larger HPC clusters based on Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. These additions will help the Center accomplish its additional focus on large scale, data-centric HPC, data analytics and multi-tenanted cloud HPC provision. This will help the Center meet the growing demand for parallel processing application by a wide variety of university departments, including genomics and astronomy. Additionally, increasing numbers of businesses with large compute demands are turning to the Center for computational help.

    The addition of Phi will also provide ease of use, and helps the Center to improve power efficiency, which is an important consideration for many of the University's workloads, especially those in astrophysics and genome processing.

    The Center, which is in the process of moving to a new data environment, currently includes 600 Dell servers, with a total of 9,600 processing cores on Sandy Bridge–generation Xeon chips. A GPU environment consists of a 128 node, 256 card Nvidia K20 GPU cluster. It is believed to be the fastest in the United Kingdom.

    We are honored to be a part of the University of Cambridge's HPC Solution Center, and look forward to the exciting discoveries this addition will help researchers make.

    More information about this exciting collaboration is available at insideHPC.

     

  • Unsarling Traffic Jams at TACC

    by Stephen Sofhauser

    Nobody enjoys being stuck in traffic. Congestion is an increasing reality for people living in cities with fast growing populations and transportation infrastructures ill-equipped to meet mounting demand. But that may not have to be a reality anymore.

    A group of researchers within the Center for Transportation Research (CTR) at the University of Texas at Austin are turning to high performance computing to help solve the city's growing traffic woes.

    Using the Stampede supercomputer at the Texas Advanced Computing Center (TACC) to run advanced transportation models, CTR is helping local transportation agencies to better understand, compare and evaluate a variety of solutions to traffic problems in Austin and the surrounding region. By teaming with TACC, they have seen computations run 5 to 10 times faster on Stampede.

    Researchers are using dynamic traffic assignments to support city planners by offering greater insights into how travel demand, traffic control and changes in the existing infrastructure combine to impact transportation in the region.

    To help with better decision making, an interactive web-based visualization tool has been developed. It allows researchers to see the results of the various traffic simulations and associated data in a multitude of ways. By providing various ways to view the area's transportation network, researchers can gain greater clarity into traffic, how their models are performing, and what the impact of suggested transportation strategies might be.

    You can learn about UT's efforts to unclog Austin's roads in at TACC's news site.

  • Enhanced Molecular Dynamics Performance with K80 GPUs

    By: Saeed Iqbal & Nishanth Dandapanthula

    The advent of hardware accelerators in general has impacted Molecular Dynamics by reducing the time to results and therefore providing a tremendous boost in simulation capacity (e.g., previous NAMD blogs). Over the course of time, applications from several domains including Molecular Dynamics have been optimized for GPUS. A comprehensive (although a constantly growing) list can be found here. LAMMPS and GROMACS are two open source Molecular Dynamics (MD) applications which can take advantage of these hardware accelerators.

    LAMMPS stands for “Large-scale Atomic/Molecular Massively Parallel Simulator” and can be used to model solid state materials and soft matter. GROMACS is short for “GROningen MAchine for Chemical Simulations”. The primary usage for GROMACS is simulations for biochemical molecules (bonded interactions) but because of its efficiency in calculating non-bonded interactions (atoms not linked by covalent bonds), the user base is expanding to non-biological systems.

    NVIDIA’s K80 offers significant improvements over the previous model the K40. From the HPC prospective the most important improvement is the 1.87 TFLOPs (double precision) compute capacity, which is about 30% more than K40. The auto-boost feature in K80 automatically provides additional performance if additional power head room is available. The internal GPUs are based on the GK210 architecture and have a total of 4,992 cores, which represent a 73% improvement over K40. The K80 has a total memory of 24GBs, which is divided equally between the two internal GPUs; this is a 100% more memory capacity compared to the K40. The memory bandwidth in K80 is improved to 480 GB/s. The rated power consumption of a single K80 card is a maximum of 300 watts.

    Dell has introduced a new high density GPU server, PowerEdge C4130, it offers five configurations, noted here as “A” through “E”.  Part of the goal of this blog is to find out which configuration is best suited for LAMMPS and GROMACS. The three quad GPU configurations “A”, “B” and “C” are compared. Also the two dual GPU configurations “D” and “E” are compared for users interested in lower GPU density of 2 GPU per 1 rack unit. The first two quad GPU configurations (“A” & “B”) have an internal PCIe switch module which allows seamless peer to peer GPU communication. We also want to understand the impact of the switch module on LAMMPS and GROMACS. Figure 1 below shows the block diagrams for configurations A to E.

    Combining K80s with the PowerEdge C4130, results in an extra-ordinarily powerful compute node. The C4130 can be configured with up to four K40 or K80 GPUs in a 1U form factor. Also the uniqueness  of PowerEdge C4130 is that it offers several workload specific configurations, potentially making it a better fit, for MD codes in general, and specifically for LAMMPS and GROMACS. 

    Figure 1: C4130 Configuration Block Diagram

    (Click on images to enlarge.)

    Recently we have evaluated the performance of NVIDIA’s Tesla K80 GPUs on Dell’s PowerEdge C4130 server on standard benchmarks and applications (HPL and NAMD).

    Performance Evaluation with LAMMPS and GROMACS

    In this blog, we quantify the performance of two of the molecular dynamics applications; LAMMPS and GROMACS by comparing their performance on K80s to a CPU only. The performance is measured as  “Jobs/day” and “ns/day” (inverse of the number of days required to simulate 1 nanosecond of real-time) for LAMMPS and GROMACS respectively. Higher is better for both cases.Table 1 gives more information about the hardware configuration and application details used for the tests.

    Table 1: Hardware Configuration and Application Details

    Figure 2: LAMMPS Performance on K80s Relative to CPUs

    Figure 2 quantifies the performance of LAMMPS over the five configurations mentioned above and compares them to the CPU-only server (CPU only server => performance of application on a server with two CPUS). The graph can be described as follows:

    • Configurations A and B are the switched configurations with the only difference being that B has an extra CPU. Since LAMMPS just uses the GPU cores, the extra CPU does not offset the scale in terms of performance.
    • Configurations “A”, “B” and “C” are four GPU configurations. Configuration C performs better than A and B. This can be attributed to the PCIe switch in configurations A and B which introduces an extra hop latency when compared to “C” which is a more balanced configuration.
    • Among the two GPU configurations are D and E. Configuration D performs slightly better than E and this could again be attributed to the balanced nature of D. As mentioned previously, LAMMPS is not offset by the extra CPU in D.
    • An interesting observation here is that when moving from 2 K80s to 4 K80s (i.e. comparing D and C configurations in  Figure 2) the performance almost quadruples. This shows that for each extra K80 added (2 GPUs per K80) the performance doubles. This can be partially attributed to the size of the dataset used.


    Figure 3: GROMACS Performance on K80s relative to CPUs

    Figure 3 shows the performance of GROMACS among the five configurations and the CPU-only configuration. The explanation is as follows.

    • Among the quad CPU configurations (A, B and C), B performs the best. In addition to the 4 GPUs attached to CPU1, GROMACS also used the whole second CPU2 making B the best performing configuration. It seems GROMACS benefits from the second CPU as well as the switch, it’s likely that application has substantial GPU to GPU communication.
    • Configuration C outperforms A. This can be attributed to the more balanced nature of C. Another contributing factor may be the latency hit because of the PCIe switch in A.
    • Even in the dual GPU configurations (D and E), D which is the balanced of both, slightly outperforms E.

    Performance is not the only criteria when a performance optimized server as dense as the Dell PowerEdge C4130 with 4 x 300 Watt accelerators is used. The other dominating factor is how much power these platforms consume. Figures 4 answers questions pertaining to power.

    • In case of LAMMPS the order of power consumption is as follows. B > A >= C > D > E
      • Configuration B is a switched configuration and has an extra CPU then Configuration A.
      • Configuration A incurs a slight overhead of the switch and thus takes up slightly more power than C.
      • Configuration D is a dual GPU, dual CPU configuration and thus takes up more power than E, which is a single CPU dual GPU configuration.
    •  In case of GROMACS, the order is still the same, but B takes up considerably more power than A and C when compared to LAMMPS. This is because GROMACS uses the extra CPU in B while LAMMPS does not.

    In conclusion, both GROMACS and LAMMPS benefit greatly from Dell’s PowerEdge C4130 servers and NVIDIAs K80s. In the case of LAMMPS, we see a 16x improvement with only a 2.6x more power. In case of GROMCAS, we see a 3.3x improvement in performance while talking up 2.6x more power. The comparisons in this case are with a dual CPU only configuration. Obviously, there are a lot of other factors which come into play when scaling these results to multiple nodes; GPU direct, interconnect, size of the dataset/simulation are just a few of those.

     

     

  • Using HPC to Improve Competitiveness

    by Onur Celebioglu

    A story from insideHPC took a look at how the National Center for Super Computing Applications (NCSA) is helping organizations in the manufacturing and engineering industries to use high performance computing to improve their competitiveness. The posting suggests those companies utilizing supercomputing resources such as the iForge HPC cluster, which is based on Dell and Intel technologies, may realize some important benefits, including:  

    • Large memory nodes performing 53 percent faster on demanding simulations
    • Benchmarking to predict performance and scalability benefits from larger software licenses
    • The ability to solve finite element analysis and computational fluid dynamics problem

    The manufacturing industry is becoming increasingly competitive making the role high performance computing plays even more important. The ability to simulate models, for example, allows research and design to be conducted with greater cost efficiency. No longer is it necessary to build, test and repeat.Thus manufacturers are able to deliver a safer product to the market in shorter period of time.

    As Dell CTO and Senior Fellow, Jimmy Pike, wrote in a his blog, there is also a growing demand for iterative research - the emerging ability to change variables along the way without having to run a new batch - which will continue to decrease the speed to market of products while decreasing costs. 

    And that's what competition is all about.

     

  • The Cypress Supercomputer Takes Root at Tulane

    In August of 2005, Hurricane Katrina ravaged the city of New Orleans and nearly destroyed the almost 180-year old Tulane University. But along with the intrepid people of the city, Tulane's leaders refused to give up. It rose out of the ashes, rebuilt and moved forward. Few accomplishments exemplify Tulane's rebirth than the new Cypress supercomputer.

    In the years immediately following Katrina, Tulane's IT infrastructure lacked the power and capacity to meet the demand of a world-class university. The network was frequently clogged, and afternoon email slowdowns were a daily occurrence.

    Under the guidance of Charlie McMahon, Ph.D., Vice President of Information Technology, Tulane set out not only to develop a new, more powerful system, but one befitting a place of learning as impressive as Tulane itself. 

    The crowning achievement is the Cypress supercomputer, which will be used for a wide-range of workloads from sea-level calculations to molecular docking supporting pharmaceutical discovery. Tulane has even contracted with the National Football League Players' Association to conduct long-term tracking of players, who have a higher risk of traumatic brain injury.

    You can learn more about Tulane's remarkable journey to recovery in this video.

    Cypress arrived just in the nick of time: next year, Tulane is expected to see its largest, and arguably most diverse, graduating class in its history, with greater numbers of potential students applying every year.

    We are very proud of our partnership with Tulane, and are humbled to have played a small part in this amazing institution's bright future. You can read a case study of Dell's work with Tulane here, and learn more about Cypress at HPCwire.