High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • Unlocking the Value of Big Data

    Having the ability to quickly and effectively react to customer needs and market demands is invaluable to a business. Yet too many decision makers are stymied by a lack of useful insight into their data. However, agility and efficacy in analytics is possible. With the right mindset, tools and technologies, organizations can become much more adroit about how they use the power of analytics to improve decision making.

    A recent survey indicated that an impressive 61% of organizations around the globe have data waiting to be processed. Unfortunately, a mere 39% felt they understood how to extrapolate the value from that data.

    In order to unlock the value found in data, organizations must have:

    • The right analytics tools - thanks to our every increasingly connected world, data miners in companies have access to greater amounts of data than ever before. That means your organization must be able to aggregate the various sources to produce a full understanding of what customers and market conditions are revealing.
    • Leadership dedicated to following the data - the point of analytics agility is to quickly altar your direction if your business decision is flawed. Disagreeing with the data or hoping for different results isn't making the most of your data.
    • Empowered IT teams - IT teams free to continually and consistently collect and managing data can help guarantee that gathered data is properly aggregated and analyzed to provide for a single, correct version of what it is telling you.

     The analytics tools needed to drive fast and flexible business decisions are available. However, it also takes the right mindset for the power of analytics to improve decision making.

     You can read more about what IT decision makers are thinking about a variety of data-related topics here.

  • The Advantages of Using Intel Enterprise Edition for Lustre

    When it comes to processing big data platforms, Hadoop has become the go-to platform. It allows vast amounts of data, especially unstructured or very diverse data, to be quickly processed. As the de facto open sources parallel file system for HPC environments, Lustre provides compute clusters with efficient storage and fast access to large data sets. Together these technologies help to solve big data problems. However, they also present some disadvantages, including a need for HTTP calls, added overhead, reduced efficiency, slower speed, and a requirement for fairly large local storage on each Hadoop node.

    There is, however, a way to overcome those obstacles. As a Hadoop software adaptor, Intel Enterprise Edition for Lustre (IEEL) provides direct access to Lustre during MapReduce computations, improving performance.

    A presentation by J. Mario Gallegos, at the Recent LUG 15 conference highlighted some of the advantages gained and some of the best practices to follow when adding IEEL.

    Among the advantages observed:

    • Using Lustre is more efficient for accessing data - HDFS file transfers rely on the HTTP protocol, which results in higher overhead and slower access.
    • Centralized access from Lustre allows data availability to  all compute nodes  - By eliminating transfers during the MapReduce ”shuffle” phase, users gain better performance, such as  higher jobs throughput.
    • Lustre allows convergence of HPC infrastructure with big data applications - The existing HPC cluster has limited storage on each compute node.

    You can read about Mario's other findings and see his LUG presentation here.

     

  • WRF benchmarking on 4 nodes cluster with Intel Xeon Phi 7120P Coprocessors

    by Ashish Kumar Singh

    This blog explores performance analysis of WRF (Weather Research and Forecasting) model on a cluster of PowerEdge R730 servers with Intel Xeon Phi 7120Ps Coprocessors. All the runs were carried out with Hyper Threading (logical Processors) disabled.

    The WRF (Weather Research and Forecasting) model is a next-generation mesoscale numerical weather prediction system designed to serve both atmospheric research and operational forecasting needs. The model serves a wide range of metrological applications across scales from tens of meters to thousands of kilometers. WRF allows for atmospheric simulations based on real data (observations, analysis) or idealized conditions to be generated.

    Test Cluster Configuration:

    The test cluster consisted of four PowerEdge R730 servers with two Intel Xeon Phi 7120P co-processors each. Each PowerEdge R730 had two Intel Xeon E5-2695v3 @ 2.3GHz CPU and eight 16GB DIMMS of 2133MHz making it a total of 128GB of memory. Each PowerEdge R730 consisted of one Mellanox FDR Infiniband HCA card in the low-profile x8 PCIe Gen3 slot (Linked with CPU2).

                   Compute node configuration


    The BIOS options selected for this blog were as below:

    WRF performance analysis was run for Conus-2.5km data. The Conus-2.5km data set was a single domain,the large size 2.5KM is equal to the continental US, which had the final 3hr simulation for hours 3-6, starting from a provided restart file. It may also be performed for the full 6hrs starting from a cold start.


    All the runs on CPU with Intel Xeon Phi configuration were performed in symmetric mode. For single node CPUs-only configuration, the average time was 7.425 seconds. However on CPUs and two Intel Xeon Phi configurations, the average time taken was 6.093 seconds, which showed improvement of 1.2 times. With a two node cluster of CPUs and Intel Xeon Phi, the average time was 2.309 seconds, an improvement of 3.2 times. For a four node cluster of CPUs and Intel Xeon Phi configuration, a performance improvement was increased to 5.7 times.

    The power consumption analysis for WRF with Conus-2.5KM benchmark is shown below. On single node, with CPU only configuration, the power consumption was 395.4 watts. On CPUs with one Intel Xeon Phi configuration, power consumption was at 526.3 watts, while on CPUs with two Intel Xeon Phi configuration, the power consumption was 688.2 watts. 

    Results showed power consumption increase in addition of Intel Xeon Phi. However, results also showed increase in performance per watt to the order of 2.6 times on a CPUs with two Intel Xeon Phi configuration.

    Conclusion:

    The configuration of CPUs with Intel Xeon Phi 7120P showed sustained performance and power-efficiency gains in comparison to CPUs-only configuration. With two Intel Xeon Phi 7120Ps WRF with Conus-2.5KM benchmark showed 1.2 fold increase and performance per watt improved by more than 2.6 times too, resulting in a powerful, easy-to-use and energy efficient HPC platform. 

     

  • NAMD benchmarking on 4 nodes cluster with Intel Xeon Phi 7120P Coprocessors

    by Ashish Kumar Singh

    This blog explores the application performance analysis of NAMD (NAnoscale Molecular Dynamics) for large data sets on cluster of PowerEdge R730 servers with Intel Xeon Phi 7120Ps. All the runs were carried out with Hyper Threading (logical processors) disabled. IB verbs version of NAMD was used for all the runs. 

    Test Cluster Configuration:

    The test cluster consisted of four PowerEdge R730 servers with two Intel Xeon Phi 7120P co-processors each. Each PowerEdge R730 had two Intel Xeon E5-2695v3 @ 2.3GHz CPU and eight 16GB DIMMS of 2133MHz making it a total of 128GB of memory per server. Each PowerEdge R730 consisted of one Mellanox FDR Infiniband HCA card in the low-profile x8 PCIe Gen3 slot (Linked with CPU2).

                     Compute node configuration

    The BIOS options selected for this blog are as below:

    NAMD (NAnoscale Molecular Dynamics) is a parallel, object-oriented simulation package written using the Charm++ parallel programming model, designed for high performance simulation of large bimolecular systems. Charm++ is developed with simplified parallel programming and also provides automatic load balancing, which is crucial to the performance of NAMD.

    All the runs with STMV (virus) benchmark were run with ibverbs version of NAMD. The performance analysis with STMV benchmark shown below. STMV (Satellite Tobacco Mosaic Virus) is a small, icosahedral plant virus. On single node, we observed performance improvement of 2.5 times on CPUs with Intel Xeon Phi configuration in comparison to CPUs-only configuration. 

     

    STMV showed performance of 0.2ns/day with CPUs-only configuration. With CPUs and two Intel Xeon Phi performance was 0.5ns/day, which showed performance increase of 2.5 times. While on a four node cluster with the CPUs and Intel Xeon Phi 7120P performance increase was 8.5 times. Scaling from one node to four node resulted in almost 3.5 times scale-up.

    The Power analysis was done for single node among CPUs-only configuration, CPUs with one Intel Xeon Phi 7120P configuration and CPUs with two Intel Xeon Phi 7120P configuration. With CPUs and two Intel Xeon Phi configuration, the power consumption increased along with the performance per watt, which was 2.4 times in comparison to CPU-only configuration. The power efficiency increase showed in below picture. 

    Conclusion:

    With CPUs and two Intel Xeon Phi 7120Ps, the STMV benchmark demonstrated increase of 2.5 times in performance and 2.4 times in power efficiency when compared to CPUs-only configuration, resulting in a powerful and energy efficient HPC platform. 

     

  • LINPACK benchmarking on a 4 nodes cluster with Intel Xeon Phi 7120P Coprocessors


    This blog explores the HPL (High Performance LINPACK) performance and power analysis on Intel Xeon Phi 7120P cluster with current generation PowerEdge R730 servers. All the runs were carried out with Hyper Threading (logical Processors) disabled.

    Test Cluster Configuration:

    The test cluster consisted of four PowerEdge R730 servers with two Intel Xeon Phi 7120P co-processors each. Each PowerEdge R730 had two Intel Xeon E5-2695v3 @ 2.3GHz CPU and eight 16GB DIMMS of 2133MHz making it a total of 128GB of memory. Each PowerEdge R730 consisted of one Mellanox FDR Infiniband HCA card in the low-profile x8 PCIe Gen3 slot (Linked with CPU2).

                                       Compute node configuration

    The BIOS options selected for this blog were as below:

    High Performance LINPACK is a benchmark that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed memory systems. HPL performed with block size of NB=192 for CPU only and NB=1280 for Intel Xeon Phi (offload) with different problem sizes of N=118272 (NB=1280) for single node N=172032 (NB=1280) for two node and N=215040 (NB=1280) for four node cluster runs.

    Compared to the Intel CPU only configuration, the acceleration was about 3 times with Intel Xeon Phi 7120Ps.

    On a single node, with CPUs only, the PowerEdge R730 achieved 802.09 GFOLPS, while with two 7120Ps it was 2.553 TFLOPS. So the 7120P provides 3.26X performance increase. Similarly, two node and four node demonstrated performance increase of 3.25X.

    The HPL power consumption analysis is shown among CPU only, CPU with one Intel Xeon Phi and CPU with two Intel Xeon Phi. 

    The power consumption of single node CPUs-only was about 398.72 watts. With two 7120Ps and CPUs, it was increased to 983.5 watts. It showed the power consumption of the CPUs-only configuration was lower than system with Intel Xeon Phi. while the performance per watt for the configurations with Intel Xeon Phi was 1.31 times of CPUs-only configuration.

    Conclusion:

    The Intel Xeon Phi 7120P showed sustained performance and power-efficiency gains in comparison to CPUs only. With two Intel Xeon Phi 7120Ps, HPL benchmark showed three fold performance increase in comparison to CPUs only and the performance per watt was improved by more than one fold, resulting in a powerful and energy efficient HPC platform.

  • LAMMPS benchmarking on 4 nodes cluster with Intel Xeon Phi 7120P Coprocessors

    by Ashish Kumar Singh

    This blog explores the application performance analysis of LAMMPS on a cluster of PowerEdge R730 servers with Intel Xeon Phi 7120Ps. All the runs were carried out with Hyper Threading (logical processors) disabled.

    LAMMPS (Large Scale Atomic/Molecular Massively Parallel Simulator) is a classical molecular dynamics code, capable of doing simulation for solid-state materials (metals, semi-conductors), soft matter (biomolecules, polymers) and coarse-grained or mesoscopic systems. It can be used to model atoms or more generically as a parallel particle simulator at the atomic, meso or continuum scale.

    Test Cluster Configuration:

    The test cluster consisted of four PowerEdge R730 servers with two Intel Xeon Phi 7120P co-processors each. Each PowerEdge R730 had two Intel Xeon E5-2695v3 @ 2.3GHz CPU and eight 16GB DIMMS of 2133MHz making it a total of 128GB of memory. Each PowerEdge R730 consisted of one Mellanox FDR Infiniband HCA card in the low-profile x8 PCIe Gen3 slot (Linked with CPU2).

                            Compute node configuration

    The BIOS options selected for this blog were as below:


    LAMMPS was run for Rhodopsin benchmark. Rhodopsin benchmark simulates the movement of protein in the retina which in turn plays an important role in the perception of light. The protein is solvated lipid bilayer using the CHARMM force field with particle-particle particle-mesh long-range electrostatics and SHAKE constraints. The simulation was performed with 2,048,000 atoms at the temperature of 300K and pressure of 1 atm. The results for single node, two nodes and four nodes are as shown below. On one node with CPU only configuration, the loop-time was 66.5 seconds, while configuration of CPUs and two Intel Xeon Phi 7120Ps had a loop-time of 34.8 seconds. This demonstrated a performance increase of 1.9X. In comparison to CPUs only, CPUs + co-processors from one node to four nodes showed performance increase of 5.2X.

    The LAMMPS power consumption analysis with RHODOPSIN benchmark is shown below. On single node, the power consumption by a CPU-only configuration was 442.4 watts, while configuration with CPUs and one co-processor consumed around 423W and subsequently configuration with CPUs and two co-processors consumed 450.8W.



    All the LAMMPS runs on co-processors used the auto-balance mode. The performance per watt demonstrated 2 fold increase with CPUs + 2 co-processors than CPUs only.

    Conclusion:

    The Intel Xeon Phi 7120Ps cluster with Dell PowerEdge R730 showed sustained performance increase of two fold. The power-efficiency was increased by 2X with two Intel Xeon Phi 7120Ps in comparison to CPUs only, resulting in a powerful, energy-efficient HPC platform.   

     

  • The Student Cluster Competition - An Integral Part of SC15

    One of the highlights of the Supercomputing Conference is always  the Student Cluster Competition. It's an opportunity to see some of the future superstars of our industry demonstrate their skill under some pretty intense (but fun) pressure!

    The student cluster competition also provides the competitors with a variety of opportunities. For some undergrads it is their first chance to focus exclusively on HPC. For other students the competition affords them the opportunity to make important networking and mentoring connections.

    College teams from around the world are already gearing up for the Student Cluster Competition at SC 15 this November in Austin. The hometown favorites are no doubt feeling the pressure to fourpeat! It's an exciting time for everyone involved - especially the students.

    Threepeat winners from SC14 in New Orleans.

    You can learn more about the Student Cluster Competition and its positive impact on the lives of participants in this video.

  • Using HPC to Improve Personalized Medicine

    Medical professionals and patient advocates agree genomics, as part of a personalized medical plan, can garner the best results for patients. However, there remain several challenges that prevent organizations from adopting the necessary technologies. Among those challenges:

    • Governmental Approval and Compliance - Whether it's FDA approval or Clinical Laboratory Improvement Amendments (CLIA) compliance, the government has daunting safety checks that must be followed for devices used to treat and diagnose diseases, as well as certifications that must be gained when working with lab instruments, appliances, and technology used to facilitate patient health.
    • Data Management - Large chunks of data must be managed in a genomics solution. A single genome is approximately 200GB to 300GB. Additionally, a single person has approximately 3 billion nucleotide bases. That creates a very large data file that must be accessed, stored and acted upon for each patient.
    • Perception of Needs - Many people believe that a genome analysis can only be done on hundreds of nodes or an expensive supercomputer.

    However, these challenges can be met. Working with an experienced vendor services team can help navigate governmental regulations. They're also able to help with the integration, storage and analysis of large amounts of data. And by starting small and growing with your organization's needs, servers and storage can be added incrementally.

    By recognizing that the perceived obstacles can be overcome, healthcare organizations can begin to harness the answers offered by genomics, leading to better patient outcomes.

    You can read more about the challenges and solutions, as well as access a white paper on the topic, here.

  • Calit2 is Using HPC to Unlock the Secrets of Microorganisms

    Over the past several years, Larry Smarr, director of the California Institute for Telecommunications and Information Technology (Calit2) at the University of California San Diego, has been exploring how advanced data analytics tools can find patterns in microbial distribution data. These discoveries can lead to newer and better clinical applications. However, with 100 trillion microorganisms in the human body, more than 10 times the number of human cells, and 100 times the number of DNA as human genes, microbes are the epitome of big data! To unlock their secrets, Dr. Smarr employs high performance computing with impressive results.

    His interest in this topic arose from a desire to improve his own health. To obtain the data he needed, he began using himself as a "guinea pig." Working with his doctors, and quantifying a wide-range of his personal microbial data, Smarr was finally able to obtain a diagnosis of Crohn's disease, which several traditional examinations had dismissed.

    From this discovery, Dr. Smarr is now working to unlock what role microorganisms may play in other intestinal diseases.

    You can hear his presentation on this fascinating topic at insideHPC.

     

  • The Elephant in the Room

    by Quy Ta

    This blog will explore a hybrid computing environment that takes Lustre®, a high performance parallel file system and integrates it with Hadoop®, a framework for processing and storing big data in a distributed environment. We will explore some reasons and benefits of such a hybrid approach and provide a foundation on how to easily and quickly implement the solution using Bright Cluster Manager® (BCM) to deploy and configure the hybrid cluster.

    First, let’s establish some definitions and technologies for our discussion. Hadoop is a software framework for distributed storage and processing of typically very large data sets on compute clusters. The Lustre file system is a parallel distributed file system that is often the choice for large scale computing clusters. In the context of this blog, we define a hybrid cluster as taking a traditional HPC cluster and integrating a Hadoop computing environment capable of processing MapReduce jobs using the Lustre File System. The hybrid solution that we will use as an example in this blog was jointly developed and consists of components from Dell, Intel, Cloudera and Bright Computing.

    Why would you want to use the Lustre file system with Hadoop? Why not just use the native Hadoop file system, HDFS? Scientists and researchers have been looking for ways to use both Lustre and Hadoop from within a shared HPC infrastructure. This hybrid approach will allow them to use Lustre as both the file system for Hadoop analytics work as well as the file system for their general HPC workloads. They can also avoid standing up two different clusters (HPC and Hadoop), and the associated resources required, by allowing the re-purposed provisioning of the existing HPC cluster resources into a small to medium sized self-contained Hadoop cluster. This solution would typically target those HPC users that have a need to run periodic Hadoop specific jobs.

    A key component to connecting the Hadoop and Lustre ecosystems is the Intel Hadoop Adapter for Lustre plug-in or Intel HAL for short. Intel HAL is bundled with the Intel Enterprise Edition for Lustre software. It allows the users to run MapReduce jobs directly on a Lustre file system. The immediate benefit is that Lustre is able to deliver faster, stable and easily managed storage for the MapReduce applications directly. A potential long term benefit using Lustre as the underlying Hadoop storage would be a higher raw capacity available when compared to HDFS due to the three time replication as well as the performance benefits of running Lustre on InfiniBand connectivity. The following architectural diagram will illustrate a typical topology for the hybrid solution. 


    The following will be a high level recount of how we easily implement the solution using the BCM tool to deploy and configure.

    The first thing we did was to establish an optimized and fully functional Lustre environment. For this solution, we used the Dell Storage for HPC with Intel Enterprise Edition (EE) for Lustre software as the Lustre solution, Cloudera CDH as the Hadoop distribution and Bright Cluster Manager (BCM) as the imaging and cluster deployment tool.

    Using the Intel Manager for Lustre (IML) GUI interface, we verified the MDT and OST objects are healthy and in an optimal state. We also verified that the LNet interface and the Lustre Kernel modules are loaded and Lustre NIDS are accessible.

    Verify contents of /etc/modprobe.d/iml_lnet_module_parameters.conf are correct for each MDS and OSS server. Example below.

    [root@boulder_mds1 ~]# cat /etc/modprobe.d/iml_lnet_module_parameters.conf

    # This file is auto-generated for Lustre NID configuration by IML

    # Do not overwrite this file or edit its contents directly

    options lnet networks=o2ib0(ib0)

    ### LNet Configuration Data

    ##  {

    ##    "state": "lnet_unloaded",

    ##    "modprobe_entries": [

    ##      "o2ib0(ib0)"

    ##    ],

    ##    "network_interfaces": [

    ##      [

    ##        "10.149.255.250",

    ##        "o2ib",

    ##        0

    ##      ]

    ##    ]

    ##  }

    [root@boulder_mds1 ~]#

    Using the IML GUI, verify status of MDT and OST objects. There should be no file system alerts and all MDT and OST objects should have green status.

    Configuration > File Systems > Current File Systems > “lustrefs” 

    (Click on all images to enlarge.)

    Verify that UIDs and GIDs are consistent on Lustre clients. This must be done before installing Hadoop software. In particular, the following users and groups should be checked:

    users: hdfs, mapred, yarn, hbase, zookeeper

    groups: hadoop, zookeeper, hbase

    We used the following script to set up our Hadoop users prior to installing Hadoop:

    VALUE=10000;

    for i in hive hbase hdfs mapred yarn;

    do

                      VALUE=$(expr $VALUE + 1);

                      groupadd -g $VALUE $i;

                      adduser -u $VALUE -g $VALUE $i;

    done;

    groupadd -g 10006 hadoop;

    groupmems -g hadoop -a yarn;

    groupmems -g hadoop -a mapred;

    groupmems -g hadoop -a hdfs;

    usermod -d /var/lib/hive -s /sbin/nologin hive;

    usermod -d /var/run/hbase -s /sbin/nologin hbase;

    usermod -d /var/lib/hadoop-yarn -s /sbin/nologin yarn;

    usermod -d /var/lib/hadoop-mapreduce -s /sbin/nologin mapred;

    usermod -d /var/lib/hadoop-hdfs -s /bin/bash hdfs

     As a sanity check, we verified the nodes we wanted to re-provision as Hadoop nodes that were able to read/write to the Lustre file system.

    Once we verified all the pre-requisite items above and established that we had a working Lustre environment, we proceeded with the following steps to build, configure and deploy the Hadoop nodes that mount and use the Lustre file system.

    Steps we took to build the hybrid solution:

    1)      Created a software image ‘ieel-hadoop’ using BCM. You can clone an existing software image.

    2)      Created a node category ‘ieel-hadoop’ using BCM. You can clone an existing node category.

    3)      We assigned the selected nodes to be provisioned as Hadoop nodes to the ieel-hadoop node category.

    4)      Installed Cloudera CDH 5.1.2 and the Intel Hadoop Adapter for Lustre (HAL) plug-in into the ieel-hadoop software image.

    5)      We installed the Intel EE for Lustre client software onto the ieel-hadoop software image.

    6)      We prepared the Lustre directory for Hadoop on the ieel-hadoop software image.

    Example:

    #chmod 0777 /mnt/lustre/Hadoop

    #setfacl –R –m group:hadoop:rwx /mnt/lustre/hadoop

    #setfacl –R –d –m group:hadoop:rwx /mnt/lustre/hadoop

    7)      Added the Lustre file system mount point to the ieel-hadoop node category for automatic mounting upon bootup.

    Example: 192.168.4.140@tcp0:192.168.4.141@tcp0:/lustre /mnt/lustre   lustre defaults,_netdev 0 0

    8)      We added several tuning parameters to /etc/sysctl.conf in the ieel-hadoop software image.

    Example:

                    lctl set_param osc.*.max_dirty_mb=512

                    lctl set_param osc.*.max_rpcs_in_flight=32

    To further optimize the solution, you can edit the core-site.xml and mapred-site.xml with the following Hadoop configuration for Lustre. 

    • core-site.xml


    •  mapred-site.xml