High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • The Democratization of HPC

    The democratization of HPC is under way. Removing the complexities traditionally associated with HPC, and focusing on making insightful data more easily accessible to a company’s users are the lynchpins to greater adoption of high performance computing for organizations beyond the more traditional groups.

    HPC is no longer simply about crunching information. The science has evolved to include predicting and developing actionable insights. That is where the smaller, newer adopters uncover the true value of HPC.

    However, these organizations can become overwhelmed by the amount, size, and types of information they’re collecting, storing, and analyzing. Increasingly, these enterprises are identifying HPC as an efficient and cost effective solution to quickly glean valuable insights from their big data applications.

    That cost-effective efficiency can yield impressive measureable results. In just one example, Onur Celebioglu, Dell’s director of HPC & SAP HANA solutions, Engineered Solutions and Cloud, cited how HPC has allowed life sciences using big data to slash genetic sequencing from four days to just four hours per patient. That reduction has provided an untold improvement in treatment plans, which has bettered the lives of patients and their families.

    Greater democratization also occurs when companies realize it is possible to leverage HPC, cloud, and big data to benefit their business without abandoning their existing systems. Having the ability to build onto an existing system as business needs warrant, allows more organizations that otherwise couldn’t reap the benefits of HPC to do so.

    You can read more about the democratization of HPC at EnterpriseTech.

     

  • New IT Academy in South Africa Helps Students Pursuing HPC Careers

    Promising students in South Africa will now have an exciting new opportunity to obtain greater, more in-depth experiences in high performance computing (HPC). A partnership between the South African Department of Trade and Industry (DTI), the Center for High Performance Computing (CHPC), and Dell Computers has resulted in a new IT academy.

    Slated to open in January of 2016, each year the Khulisa IT Academy will play host to promising students from economically disadvantaged areas throughout the country. "Khulisa" translates as "nurturing" in the isiZulu language.

    The purpose of the academy is to grow the skill set and experience of young South Africans pursuing careers in HPC. During their two-year terms at the academy, students will be able to marry the theoretical aspects of HPC they have learned in the classroom with real-life, practical experiences offered through various industry internships.

    To allow the students to concentrate on their education and future professions, each will receive a stipend for the duration of their time at the academy. Upon graduation, these rising HPC stars will be ready to enter into careers in any number of industries.

    Dell is honored to be able to play a small role in helping these worthy students. The company is investing financially in the academy, as well as offering startup funding for the ventures of students with proven entrepreneurial skills.

     

     

  • The Democratization of Genomics Continues: How Health IT Professionals Can Enable Genomic-Driven Precision Medicine

    by Seth Feder

    Genomics is no longer solely the domain of university research labs and clinical trials. Commercial entities such as tertiary care hospitals, cancer centers, and large diagnostics labs are now sequencing genomes. Perhaps ahead of the science, consumers are seeing direct marketing messages about genomic tumor assessments on TV.  Not surprising, venture capitalists are looking for their slice of the pie, last year investing approximately $248 million in personalized medicine startups. 

    So how can health IT professionals get involved? As in the past, technology coupled with innovation (and the right use-case) can drive new initiatives to widespread adoption. In this case, genomic medicine has the right use-case and IT innovation is driving adoption.   

    While the actual DNA and RNA sequencing takes place inside very sophisticated instrumentation, sequencing is just one step in the process. The raw data has to be processed, analyzed, interpreted, reported, shared, and then stored for later use.  Sound familiar?  It should, because we have seen this before in such fields as digital imaging which drove the wide spread deployment of Picture Archiving and Communicating Systems (PACS) in just about every hospital and imaging clinic around the world.  

    As in PACS, those in clinical IT must implement, operationalize, and support the workflow. The processing and analysis of genomic data is essentially a big data problem, solved by immense amounts of computing power.  In the past, these resources were housed inside large exotic supercomputers only available to elite institutions. But today HPC built on scale-out x86 architectures with multi core processors have made this power attainable to the masses – and thus democratized.  Parallel file systems that support HPC are much easier to implement and support, as are standard high bandwidth InfiniBand and Ethernet networks. Further, public cloud is emerging as a supplement to on-premise computing power.  Some organizations are exploring off-loading part of the work beyond their own firewall, either for added compute resources or as a location for long term data storage.

    For example, in 2012 myself and others at Dell worked with the Translational Genomics Research Institute (TGen) to tune its system for genomics input/output demands by scaling its existing HPC cluster to include more servers, storage and networking bandwidth. This allowed researchers to get the IT resources they needed faster without having to depend on shared systems. TGen worked with the Neuroblastoma and Medulloblastoma Translational Research Consortium (NMTRC) to develop methodology for fast sequencing of childhood cancer tumors, allowing NMTRC doctors to quickly identify appropriate treatments for young patients. 

    You can now get pre-configured HPCs to work with genomic software toolsets, which enabled clinical and translational research centers like TGen to do large-scale sequencing projects. The ROI and price per performance is compelling for anyone doing heavy genomic workloads.  Essentially, with one rack of gear, any clinical lab now has all the compute power needed to process and analyze multiple genome sequences per day, which is a clinically relevant pace. 

    Genomic medicine is here, and within a few years will become standard care to sequence many diseases in order to determine proper treatment.  As the science advances, the HPC community will be ready contribute in making this a reality. You can learn more here.

     

  • SDSC Transitions to Early Operations Stages of Comet

    by Tom Raisor

    The San Diego Supercomputer Center (SDSC) at the University of California, San Diego has transitioned into the early operations stages of its new Comet supercomputer. When it is fully functioning, the new cluster will have an overall peak performance approaching two petaflops.

    Comet has been designed as a solution for the "long tail" of science, which refers to the significant amount of research that is computationally-based, but modest-sized. Together, these projects represent a great amount of research and potential scientific impact. Much of this research is being conducted in disciplines that are new to high performance computing such as economics, genomics and social sciences.

    The Comet cluster includes:

    • An Intel Xeon® Processor E5-2600 v3 family, with two processors per node, and 12 cores per processor running at 2.5GHz.
    • 128 gigabytes (GB) of traditional DRAM and 320 GB of local flash memory on each compute node.
    • 27 racks of 72 nodes each (1,728 cores) with a full bisection InfiniBand FDR interconnect from Mellanox, and a 4:1 over-subscription across the racks.
    •  A total of 1,944 nodes or 46,656 cores

    You can learn more about Comet and its mission to serve the long tail of science here.

  • Unlocking the Value of Big Data

    Having the ability to quickly and effectively react to customer needs and market demands is invaluable to a business. Yet too many decision makers are stymied by a lack of useful insight into their data. However, agility and efficacy in analytics is possible. With the right mindset, tools and technologies, organizations can become much more adroit about how they use the power of analytics to improve decision making.

    A recent survey indicated that an impressive 61% of organizations around the globe have data waiting to be processed. Unfortunately, a mere 39% felt they understood how to extrapolate the value from that data.

    In order to unlock the value found in data, organizations must have:

    • The right analytics tools - thanks to our every increasingly connected world, data miners in companies have access to greater amounts of data than ever before. That means your organization must be able to aggregate the various sources to produce a full understanding of what customers and market conditions are revealing.
    • Leadership dedicated to following the data - the point of analytics agility is to quickly altar your direction if your business decision is flawed. Disagreeing with the data or hoping for different results isn't making the most of your data.
    • Empowered IT teams - IT teams free to continually and consistently collect and managing data can help guarantee that gathered data is properly aggregated and analyzed to provide for a single, correct version of what it is telling you.

     The analytics tools needed to drive fast and flexible business decisions are available. However, it also takes the right mindset for the power of analytics to improve decision making.

     You can read more about what IT decision makers are thinking about a variety of data-related topics here.

  • The Advantages of Using Intel Enterprise Edition for Lustre

    When it comes to processing big data platforms, Hadoop has become the go-to platform. It allows vast amounts of data, especially unstructured or very diverse data, to be quickly processed. As the de facto open sources parallel file system for HPC environments, Lustre provides compute clusters with efficient storage and fast access to large data sets. Together these technologies help to solve big data problems. However, they also present some disadvantages, including a need for HTTP calls, added overhead, reduced efficiency, slower speed, and a requirement for fairly large local storage on each Hadoop node.

    There is, however, a way to overcome those obstacles. As a Hadoop software adaptor, Intel Enterprise Edition for Lustre (IEEL) provides direct access to Lustre during MapReduce computations, improving performance.

    A presentation by J. Mario Gallegos, at the Recent LUG 15 conference highlighted some of the advantages gained and some of the best practices to follow when adding IEEL.

    Among the advantages observed:

    • Using Lustre is more efficient for accessing data - HDFS file transfers rely on the HTTP protocol, which results in higher overhead and slower access.
    • Centralized access from Lustre allows data availability to  all compute nodes  - By eliminating transfers during the MapReduce ”shuffle” phase, users gain better performance, such as  higher jobs throughput.
    • Lustre allows convergence of HPC infrastructure with big data applications - The existing HPC cluster has limited storage on each compute node.

    You can read about Mario's other findings and see his LUG presentation here.

     

  • WRF benchmarking on 4 nodes cluster with Intel Xeon Phi 7120P Coprocessors

    by Ashish Kumar Singh

    This blog explores performance analysis of WRF (Weather Research and Forecasting) model on a cluster of PowerEdge R730 servers with Intel Xeon Phi 7120Ps Coprocessors. All the runs were carried out with Hyper Threading (logical Processors) disabled.

    The WRF (Weather Research and Forecasting) model is a next-generation mesoscale numerical weather prediction system designed to serve both atmospheric research and operational forecasting needs. The model serves a wide range of metrological applications across scales from tens of meters to thousands of kilometers. WRF allows for atmospheric simulations based on real data (observations, analysis) or idealized conditions to be generated.

    Test Cluster Configuration:

    The test cluster consisted of four PowerEdge R730 servers with two Intel Xeon Phi 7120P co-processors each. Each PowerEdge R730 had two Intel Xeon E5-2695v3 @ 2.3GHz CPU and eight 16GB DIMMS of 2133MHz making it a total of 128GB of memory. Each PowerEdge R730 consisted of one Mellanox FDR Infiniband HCA card in the low-profile x8 PCIe Gen3 slot (Linked with CPU2).

                   Compute node configuration


    The BIOS options selected for this blog were as below:

    WRF performance analysis was run for Conus-2.5km data. The Conus-2.5km data set was a single domain,the large size 2.5KM is equal to the continental US, which had the final 3hr simulation for hours 3-6, starting from a provided restart file. It may also be performed for the full 6hrs starting from a cold start.


    All the runs on CPU with Intel Xeon Phi configuration were performed in symmetric mode. For single node CPUs-only configuration, the average time was 7.425 seconds. However on CPUs and two Intel Xeon Phi configurations, the average time taken was 6.093 seconds, which showed improvement of 1.2 times. With a two node cluster of CPUs and Intel Xeon Phi, the average time was 2.309 seconds, an improvement of 3.2 times. For a four node cluster of CPUs and Intel Xeon Phi configuration, a performance improvement was increased to 5.7 times.

    The power consumption analysis for WRF with Conus-2.5KM benchmark is shown below. On single node, with CPU only configuration, the power consumption was 395.4 watts. On CPUs with one Intel Xeon Phi configuration, power consumption was at 526.3 watts, while on CPUs with two Intel Xeon Phi configuration, the power consumption was 688.2 watts. 

    Results showed power consumption increase in addition of Intel Xeon Phi. However, results also showed increase in performance per watt to the order of 2.6 times on a CPUs with two Intel Xeon Phi configuration.

    Conclusion:

    The configuration of CPUs with Intel Xeon Phi 7120P showed sustained performance and power-efficiency gains in comparison to CPUs-only configuration. With two Intel Xeon Phi 7120Ps WRF with Conus-2.5KM benchmark showed 1.2 fold increase and performance per watt improved by more than 2.6 times too, resulting in a powerful, easy-to-use and energy efficient HPC platform. 

     

  • NAMD benchmarking on 4 nodes cluster with Intel Xeon Phi 7120P Coprocessors

    by Ashish Kumar Singh

    This blog explores the application performance analysis of NAMD (NAnoscale Molecular Dynamics) for large data sets on cluster of PowerEdge R730 servers with Intel Xeon Phi 7120Ps. All the runs were carried out with Hyper Threading (logical processors) disabled. IB verbs version of NAMD was used for all the runs. 

    Test Cluster Configuration:

    The test cluster consisted of four PowerEdge R730 servers with two Intel Xeon Phi 7120P co-processors each. Each PowerEdge R730 had two Intel Xeon E5-2695v3 @ 2.3GHz CPU and eight 16GB DIMMS of 2133MHz making it a total of 128GB of memory per server. Each PowerEdge R730 consisted of one Mellanox FDR Infiniband HCA card in the low-profile x8 PCIe Gen3 slot (Linked with CPU2).

                     Compute node configuration

    The BIOS options selected for this blog are as below:

    NAMD (NAnoscale Molecular Dynamics) is a parallel, object-oriented simulation package written using the Charm++ parallel programming model, designed for high performance simulation of large bimolecular systems. Charm++ is developed with simplified parallel programming and also provides automatic load balancing, which is crucial to the performance of NAMD.

    All the runs with STMV (virus) benchmark were run with ibverbs version of NAMD. The performance analysis with STMV benchmark shown below. STMV (Satellite Tobacco Mosaic Virus) is a small, icosahedral plant virus. On single node, we observed performance improvement of 2.5 times on CPUs with Intel Xeon Phi configuration in comparison to CPUs-only configuration. 

     

    STMV showed performance of 0.2ns/day with CPUs-only configuration. With CPUs and two Intel Xeon Phi performance was 0.5ns/day, which showed performance increase of 2.5 times. While on a four node cluster with the CPUs and Intel Xeon Phi 7120P performance increase was 8.5 times. Scaling from one node to four node resulted in almost 3.5 times scale-up.

    The Power analysis was done for single node among CPUs-only configuration, CPUs with one Intel Xeon Phi 7120P configuration and CPUs with two Intel Xeon Phi 7120P configuration. With CPUs and two Intel Xeon Phi configuration, the power consumption increased along with the performance per watt, which was 2.4 times in comparison to CPU-only configuration. The power efficiency increase showed in below picture. 

    Conclusion:

    With CPUs and two Intel Xeon Phi 7120Ps, the STMV benchmark demonstrated increase of 2.5 times in performance and 2.4 times in power efficiency when compared to CPUs-only configuration, resulting in a powerful and energy efficient HPC platform. 

     

  • LINPACK benchmarking on a 4 nodes cluster with Intel Xeon Phi 7120P Coprocessors


    This blog explores the HPL (High Performance LINPACK) performance and power analysis on Intel Xeon Phi 7120P cluster with current generation PowerEdge R730 servers. All the runs were carried out with Hyper Threading (logical Processors) disabled.

    Test Cluster Configuration:

    The test cluster consisted of four PowerEdge R730 servers with two Intel Xeon Phi 7120P co-processors each. Each PowerEdge R730 had two Intel Xeon E5-2695v3 @ 2.3GHz CPU and eight 16GB DIMMS of 2133MHz making it a total of 128GB of memory. Each PowerEdge R730 consisted of one Mellanox FDR Infiniband HCA card in the low-profile x8 PCIe Gen3 slot (Linked with CPU2).

                                       Compute node configuration

    The BIOS options selected for this blog were as below:

    High Performance LINPACK is a benchmark that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed memory systems. HPL performed with block size of NB=192 for CPU only and NB=1280 for Intel Xeon Phi (offload) with different problem sizes of N=118272 (NB=1280) for single node N=172032 (NB=1280) for two node and N=215040 (NB=1280) for four node cluster runs.

    Compared to the Intel CPU only configuration, the acceleration was about 3 times with Intel Xeon Phi 7120Ps.

    On a single node, with CPUs only, the PowerEdge R730 achieved 802.09 GFOLPS, while with two 7120Ps it was 2.553 TFLOPS. So the 7120P provides 3.26X performance increase. Similarly, two node and four node demonstrated performance increase of 3.25X.

    The HPL power consumption analysis is shown among CPU only, CPU with one Intel Xeon Phi and CPU with two Intel Xeon Phi. 

    The power consumption of single node CPUs-only was about 398.72 watts. With two 7120Ps and CPUs, it was increased to 983.5 watts. It showed the power consumption of the CPUs-only configuration was lower than system with Intel Xeon Phi. while the performance per watt for the configurations with Intel Xeon Phi was 1.31 times of CPUs-only configuration.

    Conclusion:

    The Intel Xeon Phi 7120P showed sustained performance and power-efficiency gains in comparison to CPUs only. With two Intel Xeon Phi 7120Ps, HPL benchmark showed three fold performance increase in comparison to CPUs only and the performance per watt was improved by more than one fold, resulting in a powerful and energy efficient HPC platform.

  • LAMMPS benchmarking on 4 nodes cluster with Intel Xeon Phi 7120P Coprocessors

    by Ashish Kumar Singh

    This blog explores the application performance analysis of LAMMPS on a cluster of PowerEdge R730 servers with Intel Xeon Phi 7120Ps. All the runs were carried out with Hyper Threading (logical processors) disabled.

    LAMMPS (Large Scale Atomic/Molecular Massively Parallel Simulator) is a classical molecular dynamics code, capable of doing simulation for solid-state materials (metals, semi-conductors), soft matter (biomolecules, polymers) and coarse-grained or mesoscopic systems. It can be used to model atoms or more generically as a parallel particle simulator at the atomic, meso or continuum scale.

    Test Cluster Configuration:

    The test cluster consisted of four PowerEdge R730 servers with two Intel Xeon Phi 7120P co-processors each. Each PowerEdge R730 had two Intel Xeon E5-2695v3 @ 2.3GHz CPU and eight 16GB DIMMS of 2133MHz making it a total of 128GB of memory. Each PowerEdge R730 consisted of one Mellanox FDR Infiniband HCA card in the low-profile x8 PCIe Gen3 slot (Linked with CPU2).

                            Compute node configuration

    The BIOS options selected for this blog were as below:


    LAMMPS was run for Rhodopsin benchmark. Rhodopsin benchmark simulates the movement of protein in the retina which in turn plays an important role in the perception of light. The protein is solvated lipid bilayer using the CHARMM force field with particle-particle particle-mesh long-range electrostatics and SHAKE constraints. The simulation was performed with 2,048,000 atoms at the temperature of 300K and pressure of 1 atm. The results for single node, two nodes and four nodes are as shown below. On one node with CPU only configuration, the loop-time was 66.5 seconds, while configuration of CPUs and two Intel Xeon Phi 7120Ps had a loop-time of 34.8 seconds. This demonstrated a performance increase of 1.9X. In comparison to CPUs only, CPUs + co-processors from one node to four nodes showed performance increase of 5.2X.

    The LAMMPS power consumption analysis with RHODOPSIN benchmark is shown below. On single node, the power consumption by a CPU-only configuration was 442.4 watts, while configuration with CPUs and one co-processor consumed around 423W and subsequently configuration with CPUs and two co-processors consumed 450.8W.



    All the LAMMPS runs on co-processors used the auto-balance mode. The performance per watt demonstrated 2 fold increase with CPUs + 2 co-processors than CPUs only.

    Conclusion:

    The Intel Xeon Phi 7120Ps cluster with Dell PowerEdge R730 showed sustained performance increase of two fold. The power-efficiency was increased by 2X with two Intel Xeon Phi 7120Ps in comparison to CPUs only, resulting in a powerful, energy-efficient HPC platform.