Some of the challenges researchers face when analyzing large genomic data sets include the need for immense computational and storage infrastructure requirements, difficulty in deploying and managing such an infrastructure in-house with minimal IT experience and privacy issues concerned with private genomic data. The Active Infrastructure for HPC Life Sciences solution [1] was designed with these challenges in mind. Figure 1 depicts the server, storage and networking components of the solution. At a high level the solution is comprised of:

  • A 180 TB NFS storage solution
  • A 360 TB Lustre file system
  • 32 Dell PowerEdge M420 Blades and a Dell PowerEdge R820
  • Dell PowerEdge R420 head nodes and login nodes
  • Bright Cluster Manager cluster management software
  • Choice of FDR InfiniBand or 10GbE as the cluster interconnect

A technical whitepaper providing a detailed account of the what, why and the how of the solution is available at http://www.dellhpcsolutions.com/asset/174155510/95990/3710410/119266. The performance and energy efficiency of the solution was benchmarked by running whole genome analysis pipelines. The pipeline software framework used for this purpose is bcbio-nextgen [2].

 

For the purpose of benchmarking, 10x coverage whole genome sequencing data from the Illumina Platinum Genomes project [3], was used. The whitepaper also details the time spent by the pipeline in various analysis steps such as alignment, alignment post-processing, variant calling and variant post-processing. These tests aim to present new metrics which are relevant to the life sciences domain, rather than using the traditional HPC metrics like GFLOPS. The new metrics obtained as a part of this effort are Kilowatt-hours/genome and number of genomes analyzed/day.

 

From the results obtained it was determined that the solution is capable of analyzing 37 genomes per day and consumes ~7.5 kWh per genome.

Figure 1 Active Infrastructure for HPC Life Sciences

Going forward, we plan on stressing the solution by using higher coverage datasets, experimenting with the scaling behavior of various pipelines with different numbers of compute nodes, and tuning Lustre file system performance.

References

  1. http://www.dell.com/learn/us/en/555/hpcc/high-performance-computing-life-sciences
  2. https://bcbio-nextgen.readthedocs.org/en/latest/index.html
  3. http://www.illumina.com/platinumgenomes/

You can learn more about Active Infrastructure for HPC Life Sciences by listening to a recent podcast to from the Intel Chip Chat program that features Glen Otero, Dell’s Life Sciences HPC Solution Architect: Revolutionizing Genomic Workloads with Dell – Intel® Chip Chat episode 258