Dell Community
High Performance Computing Blogs

## High Performance Computing

A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.

### Life Sciences

Sort by: | |

#### How much data do we need to use a ML algorithm?

Although this is the most common question, it is hard to answer since the amount of data mainly depends on how complex the learning concept is. In Machine Learning (ML), the learning complexity can be broken down into informational and computational complexities. Further, informational complexity considers two aspects, how many training examples are needed (sample complexity) and how fast a learner/model’s estimate can converge to the true population parameters (rate of convergence). Computational complexity refers the types of algorithms and the computational resources to extract the learner/model’s prediction within a reasonable time. As you can guess now, this blog will cover informational complexity to answer the question.

#### Learn from an example – ‘To be or Not to be banana’

Let’s try to learn what banana is. In this example, banana is the learning concept (one hypothesis, that is ‘to be’ or ‘not to be banana’), and the various descriptions associated with banana can be features such as colors and shapes. Unlike the way human can process the concept of banana – the human does not require non-banana information to classify a banana, typical machine learning algorithm requires counter-examples. Although there is One Class Classification (OCC) which has been widely used for outlier or anomaly detection, this is harder than the problem of conventional binary/multi-class classification.

Let’s place another concept ‘Apple’ into this example and make this practice as a binary-class classification. By doing this, we just made the learning concept simpler, ‘to be banana = not apple’ and ‘not to be banana = apple’. This is little counter-intuitive since adding an additional learning concept into a model makes the model simpler: however, OCC basically refers one versus all others, and the number of all other cases are pretty much infinite. This is where we are in terms of ML; one of the simplest learning activities for human is the most difficult problem to solve in ML. Before generating some data for banana, we need to define some terms.

• Instances[ii] X describes banana with features such as color (f1 = yellow, green or red, |f1|=3), shape (f2 = cylinder or sphere, |f2|=2) and class label (C → {banana, apple}, |C|=2). These values for color and shape need to be enumerated. For examples, we can assign integers to each value like (Yellow=1, Green=2, Red=3), (Cylinder=1, Sphere=2) and (banana=0, apple=1) (Table 1)
• The target function t generates a prediction for ‘is this banana or apple’ as a number ranging between 0 ≤ t(xi) ≤ 1. Typically, we want to have a prediction, t(xi) as close as c(xi), 0 ≤ in, where n is the total number of samples.
• The hypothesis space H can be defined as the conjunction of features and target function h(xi) = (f1i, f2i, t(xi)).
• Training examples S must contain roughly the same number of banana (0) and apple (1) examples. A sample is described as s(xi) = (f1i, f2i, c(xi))

#### Sample complexity – estimate the size of training data set in a quick and dirty way

Ideally, we want to have all the instances in the training sample set S covering all the possible combinations of features with respect to t as you can see in Table 1. There are three possible values for f1 and two possible values for f2. Also, there are two classes in this example. Therefore, the number of all the possible instances |X| = |f1| x |f2| x |C| = 3 x 2 x 2 = 12. However, f2 is a lucky feature[iii] that is mutually exclusive between banana and apple. Hence, |f2| is considered as 1 in this case. In addition to that, we can subtract one case because there is no red banana. For this example, only 5 instances can exhaust the entire sample space H. In general, the number of features (columns) in a data set is exponentially proportional to the required number of training samples (|S| = n, where n is the unique number of samples in the set). If we assume that all features are binary like a simple value of yes or no, then |X| = 2 x 2 x 2 = 23. Two to the power of the number of columns is the minimum n in the simplest case. This example only works when the values in all the features are discrete values. If we use the gradient color values for Color (RGB 256 color pallet ranges from 0 to 16777215 in decimal), the required number of training samples will increase quite significantly because now you need to multiply 16777216 for f1 if all the possible colors exist in H.

It is worth noting that the number of instances we calculate here does not always guarantee that a learner/model can converge properly. If you have the number of data equal or below this number, the amount of data is simply too small for the most of algorithm except a ML algorithm evaluating one feature at a time such as a decision tree. As a rough rule of thumb, many statisticians say that a sample size of 30 is large enough. This rule can be applied for a regression based ML algorithm that assumes one smooth linear decision boundary. Although an optimal n could be different on a case-by-case basis, it is not a bad idea to start from the total number of samples of N = |X| x 30.

Table 1 Training sample set S

 Instances f1 Color (Yellow=1, Green=2, Red=3) f2 Shape (Cylinder=1, Sphere=2) Ci Class = Banana (0) or Apple (1) x1, Banana 1 1 1 0 x2, Banana 2 2 1 0 x3, Apple 1 1[iv] 2 1 x4, Apple 2 2 2 1 x5, Apple 3 3 2 1

#### Learning curve – an accurate way to estimate the size of training data set

In ML, learning curve refers a plot of the classification accuracy against the training set size. This is not an estimation method; it requires building a classification model multiple times with the different size of training data set. This is a good technique for sanity-checks (underfitting – high bias and overfitting – high variance) for a model. It also can be utilized to optimize/improve the performance.

Figure 1 shows two examples of learning curves that represent underfitting (left-side plot) and overfitting (right-side plot). Basically, these are the plots of the training and cross-validation error (root mean square error in this example) when the size of training set increases. For underfitting case (left-side plot), both the training and cross-validation errors are very high, and the increment of training set size does not help. This indicates that the features in the data set are not quite relevant to the target learning concept. The examples are confusing the model. On the other hand, the right-side plot shows an overfitting case. The validation error is much higher than the training error in this case. The model trains on the identical samples repeatedly, and the training error climbs continuously while the cross-validation error continue to decrease. The performance for an overfitted model usually looks good; however, it will fail miserably for the real-life data.

Back in the day, not a single ML paper was accepted without a learning curve. Without this simple plot, the entire performance claim will be unverifiable.

#### Resources

Internal web page

External web page

Contacts
Americas
Kihoon Yoon
Sr. Principal Systems Dev Eng
Kihoon.Yoon@dell.com
+1 512 728 4191

Or attributes. A feature is an individual measurable property or characteristic of a phenomenon being observed.

[ii] Instance is also referred as example or sample

[iii] If a feature like f2 exists in a data set, we could make 100% accurate prediction simply by looking at f2.

[iv] Is there yellow apple? Yes, Golden Delicious is yellow.

### Overview

We published the whitepaper, “Dell EMC PowerEdge R940 makes De Novo Aseembly easier”, last year to study the behavior of SOAPdenovo2 [1]. However, the whitepaper is limited to one De Novo assembly application. Hence, we want to expand our application coverage little further. We decided to test SPAdes (2012) since it is a relatively new application and reported for some improvement on the Euler-Velvet-SC assembler (2011) and SOAPdenovo. SPAdes is also based on de Bruijn graph algorithm like most of the assemblers targeting Next Generation Sequencing (NGS) data. De Bruijin graph-based assemblers would be more appropriate for larger datasets having more than a hundred-millions of short reads.

As shown in Figure 1, Greedy-Extension and overlap-layout-consensus (OLC) approaches were used in the very early next gen assemblers [2]. Greedy-Extension’s heuristic is that the highest scoring alignment takes on another read with the highest score. However, this approach is vulnerable to imperfect overlaps and multiple matches among the reads and leads to an incomplete assembly or an arrested assembly. OLC approach works better for long reads such as Sanger or other technology generating more than 100bp due to minimum overlap threshold (454, Ion Torrent, PacBio, and so on). De Bruijin graph-based assemblers are more suitable for short read sequencing technologies such as Illumina. The approach breaks the sequencing reads into successive k-mers, and the graph maps the k-mers. Each k-mer forms a node, and edges are drawn between each k-mer in a read.

Figure 1 Overview of de novo short reads assemblers. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3056720/

SPAdes is a relatively recent application based on de Bruijn graph for both single-cell and multicell data. It improves on the recently released Euler Velvet Single Cell (E +V- SC) assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data).

All tests were performed on Dell EMC PowerEdge R940 configured as shown in Table 1. The total number of cores available in the system is 96, and the total amount of memory is 1.5TB.

 Dell EMC PowerEdge R940 CPU 4x Intel® Xeon® Platinum 8168 CPU, 24c @ 2.70GHz (Skylake) RAM 48x 32GB @2666 MHz OS RHEL 7.4 Kernel 3.10.0-693.el7.x86_64 Local Storage 12x 1.2TB 10K RPM SAS 12Gbps 512n 2.5in Hot-plug Hard Drive in RAID 0 Interconnect Intel® Omni-Path BIOS System Profile Performance Optimized Logical Processor Disabled Virtualization Technology Disabled SPAdes Version 3.10.1 Python Version 2.7.13

The data used for the tests is a paired-end read, ERR318658 which can be downloaded from European Nucleotide Archive (ENA). The read generated from blood sample as a control to identify somatic alterations in the primary and metastatic colorectal tumors. This data contains 3.2 Billion Reads (BR) with the read length of 101 nucleotides.

### Performance Evaluation

SPAdes runs three sets of de Bruijn graphs with 21-mer, 33-mer, and 55-mer consecutively. This is the main difference with regards to SOAPdenovo2 which run a single k-mer, either 63-mer or 127-mer.

In Figure 2, the runtimes, wall-clock times, are plotted in days (blue bars) with various number of cores, 28, 46, and 92 cores. Since we do not want to use the entire cores of each socket, 92 cores were picked as the maximum number of cores for the system. One core per socket was reserved for OS and other maintenance processes. Subsequent tests were done by reducing the number of cores in half. Peak memory consumptions for each case is plotted as a line graph. SPAdes runs significantly longer than SOAPdenovo2 due to the multiple iterations on three different k-mers.

The peak memory consumption is very similar to SOAPdenovo2. Both applications require slightly less than 800GB memory to process 3.2 BR.

### Conclusion

Utilizing more cores helps to reduce the runtime of SPAdes significantly as shown in Figure 2. For SPAdes, it is recommendable to use the highest core count CPUs like Intel Xeon Plantinum 8180 processor with 28 cores and 3.80GHz to bring down the runtime further.

### Resources

Internal web page

External web page

Contacts
Americas
Kihoon Yoon
Sr. Principal Systems Dev Eng
Kihoon.Yoon@dell.com
+1 512 728 4191

It refers an earlier version of SOAPdenovo, not SOAPdenovo2.

## Deferentially Expressed Gene (DEG) Analysis with Tuxedo Pipeline

### Overview

Gene expression analysis is as important as identifying Single Nucleotide Polymorphism (SNP), InDel or chromosomal restructuring. Eventually, the entire physiological and biochemical events depend on the final gene expression products, proteins. Many quantitative scientists, non-biologists tend to oversimplify the flow of genetic information and forget about what the actual roles of proteins are. Simplification is the beginning of most science fields; however, it is too optimistic to think that this practice also works for biology. Although all the human organs contain the identical genomic composition, the actual protein expressed in various organs are completely different. Ideally, a technology enables to quantify the entire proteins in a cell could excel the progress of Life Science significantly; however, we are far from to achieving it. Here, in this blog, we test one popular RNA-Seq data analysis pipeline known as Tuxedo pipeline. The Tuxedo suite offers a set of tools for analyzing a variety of RNA-Seq data, including short-read mapping, identification of splice junctions, transcript and isoform detection, differential expression, visualizations, and quality control metrics.
A typical RNA-Seq data set consists of multiple samples as shown in Figure 1. Although the number of sample sets depends on the biological experimental designs, two sets of samples are used to make comparisons between normal vs. cancer samples or untreated vs. treated samples, for examples.

Figure 1 Tested Tuxedo pipeline workflow

All the samples are aligned individually in Step 1. In this pipeline, the Tophat process uses Bowtie 2 version 2.3.1 as an underlying short sequence read aligner. Step 3, Cuffmerge job has a dependency from all the previous jobs in Step 2. The results from Cufflinks jobs are collected at this step to merge together multiple Cufflinks assemblies which is required for Cuffdiff step. Cuffmerge also runs Cuffcompare in the background and automatically filters out transcription artifacts. Cuffnorm generates tables of expression values that are properly normalized for library size, and these tables can be used for other statistical analysis instead of CummeRbund. At Step 5, CummeRbund step is set to generate three plots, gene density, gene box and volcano plots by using R script.
A performance study of RNA-Seq pipeline is not trivial because the nature workflow requires non-identical input files. 185 RNA-Seq paired-end read data are collected from public data repositories. All the read data files contain around 25 Million Fragments (MF)[1] and have similar read lengths. The samples for a test randomly selected from the pool of 185 paired-end read files. Although these randomly selected data will not have any biological meaning, certainly these data will put the tests on the worst-case scenario with very high level of noise.
The test cluster configurations are summarized in Table 1.

 8x Dell EMC PowerEdge C6420 CPU 2x Xeon® Gold 6148 20c 2.4GHz (Skylake) RAM 12x 16GB @2666 MHz OS RHEL 7.4 Interconnect Intel® Omni-Path BIOS System Profile Performance Optimized Logical Processor Disabled Virtualization Technology Disabled

The test clusters and H600 storage system was connected via 4 x 100GbE links between two Dell Networking Z9100-ON switches. Each compute node was connected to the test cluster side Dell Networking Z9100-ON switch via single 10GbE. Four storage nodes in Dell EMC Isilon H600 was connected to the other switch via 4x 40GbE links. The configuration of the storage is listed in Table 2.

 Dell EMC Isilon H600 Number of nodes 4 CPU per node Intel® Xeon™ CPU E5-2680 v4 @2.40GHz Memory per node 256GB Storage Capacity Total usable space: 126.8 TB, 31.7 TB per node SSD L3 Cache 2.9 TB per node Network Front end network: 40GbE, Back end network: IB QDR OS Isilon OneFS v8.1.0.0 B_8_1_0_011

### Performance Evaluation

#### Two sample Test – Bare-minimum

DEG analysis requires at least two samples. In Figure 2, each step described in Figure 1 is submitted to Slurm job scheduler with proper dependencies. For example, Cuffmerge step must wait for all the Cufflinks jobs are completed. Two samples, let’s imagine one normal and one treated sample, begin with Tophat step individually and followed by Cufflinks step. Upon the completion of all the Cufflinks steps, Cuffmerge aggregates gene expressions in the entire samples provided. Then, subsequent steps, Cuffdiff and Cuffnorm begin. The output of Cuffnorm can be used for other statistical analysis. Cuffdiff steps generates gene expression differences at the gene level as well as isoformer level. CummeRbund step uses R-package CummeRbund to visualize the results as shown in Figure 3. The total runtime[2]  with 38 cores and two PowerEdge C6420s is 3.15 hours.

Figure 2 Tuxedo pipeline with two samples

Figure 3 shows differentially expressed genes in red with significantly lower p-values (Y-axis) compared to other gene expressions illustrated in black. X-axis is fold changes in log base of 2, and these fold changes of each genes are plotted against p-values. More samples will bring a better gene expression estimation. The right upper plot are gene expressions in sample 2 in comparisons with sample 1 whereas the left lower plot are gene expressions in sample 1 compared to sample 2. Gene expressions in black dots are not significantly different in both samples.

Figure 3 Volcano plot of the Cuffdiff results

#### Throughput Test – Single pipeline with more than two samples - Biological / Technical replicates

Typical RNA-Seq studies consist of multiple samples, sometime 100s of different samples, normal versus disease or untreated versus treated samples. These samples tend to have high level of noisy due to their biological reasons; hence, the analysis requires vigorous data preprocessing procedure.
Here, we tested various numbers of samples (all different RNA-Seq data selected from 185 paired-end reads data set) to see how much data can be processed by 8 nodes PowerEdge C6420 cluster. As shown in Figure 4, the runtimes with 2, 4, 8, 16, 32 and 64 samples grow exponentially when the number of samples increases. Cuffmerge step does not slow down as the number of samples grows while Cuffdiff and Cuffnorm steps slow down significantly. Especially, Cuffdiff step becomes a bottle-neck for the pipeline since the running time grows exponentially (Figure 5). Although Cuffnorm’s runtime increases exponentially like Cuffdiff, it is ignorable since Cuffnorm’s runtime is bounded by Cuffdiff’s runtime.

Figure 4 Runtime and throughput results

Figure 5 Behaviors of Cuffmerge, Cuffdiff and Cuffnorm

### Conclusion

The throughput test results show that 8 node PowerEdge C6420s with Isilon H600 can process roughly 1 Billion Fragments which is little more than 32 samples with ~50 million paired reads each (25 MF) through Tuxedo pipeline illustrated in Figure 1.

Since Tuxedo pipeline is relatively faster than other popular pipelines, it is hard to generalize or utilize these results for sizing a HPC system. However, this provides a good reference point to help designing a right size HPC system.

### Resources

Internal web page
External web page

Contacts

Americas
Kihoon Yoon
Sr. Principal Systems Dev Eng
Kihoon.Yoon@dell.com
+1 512 728 4191

[1] “For RNA sequencing, determining coverage is complicated by the fact that different transcripts are expressed at different levels. This means that more reads will be captured from highly expressed genes, and few reads will be captured from genes expressed at low levels. When planning RNA sequencing experiments, researchers usually think in terms of numbers of millions of reads to be sampled.” – cited from https://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf
[2] Runtime refers Wall-Clock time throughout the blog.

#### Refresh with 14th Generation servers

Dell EMC's flexible HPC architecture for Life Sciences has been through a dramatic improvement with new Intel® Xeon® Scalable Processors. Dell EMC Ready Bundle for HPC Life Sciences equipped with better 14G servers, faster CPUs, and more memory bring a much higher performance for genomic data processing.

#### Refresh with 14th Generation servers

Dell EMC’s flexible HPC architecture for Life Sciences has been through a dramatic improvement with new Intel® Xeon® Scalable Processors. Dell EMC Ready Bundle for HPC Life Sciences equipped with better 14G servers, faster CPUs, and more memory bring a much higher performance in terms of throughput compared to the previous generation especially in genomic data processing.

### Optimized Relion Tests on Dell EMC PowerEdge C6420s with Dell EMC Isilon

#### Overview

The purpose of this blog is to validate the optimized Relion (for REgularised LIkelihood OptimizatioN) on Dell EMC PowerEdge C6420s with Skylake CPUs. Relion was developed from the Scheres lab at MRC Laboratory of Molecular Biology. It uses an empirical Bayesian approach to refining multiple 3D images or 2D class averages for the data generated from CryoElectron Microscopy (Cryo-EM). The impressive performance gain from Intel®’s efforts in the collaboration of Relion development team reduced the performance gap between CPUs and GPUs. The CPU/GPU performance comparison results are not shown here; however, the performance gap becomes single digit fold between Skylake CPU systems and Broadwell CPU/Tesla P100 GPU systems.

Essentially, Cryo-EM is a type of transmission electron microscopy (TEM)[1] for imaging frozen-hydrated specimens at cryogenic temperatures[2]. Specimens remain in their native state without the need for dyes or fixatives allowing the study of fine cellular structures, viruses and protein complexes at molecular resolution. A rapid vitrification at cryogenic temperature is the key step to avoid water molecule crystallization and forming amorphous solid that does almost no damage to the sample structure. Regular electron microscopy requires samples to be prepared in complex ways, and the sample preparations make hard to retaining the original molecular structures. Cryo-EM is not perfect like X-ray crystallography; however, it gains the popularity in the research community fast due to the simple sample preparation steps and flexibility to the sample size, complexity, and non-rigid structure. As the resolution revolution in Cryo-EM progresses due to the 40+ years of dedicated work from the structural biology community, we now have the ability to yield accurate, detailed, 3D models of intricate biological structures at the sub-cellular and molecular scales.

The tests were performed on 8 nodes of Dell PowerEdge C6420s which is a part of Dell EMC Ready Bundle for HPC Life Sciences. A summary of the configuration is listed in Table 1.

 Dell EMC PowerEdge C6420 Dell EMC PowerEdge R740xd CPU 2x Xeon® Gold 6148F 20c 2.4GHz (Skylake) 2x Xeon® Gold 6148F 20c 2.4GHz (Skylake) RAM 12x 32GB @2666 MHz 12x 32GB @2666 MHz OS RHEL 7.4 RHEL 7.4 Interconnect Intel® Omni-Path Intel® Omni-Path BIOS System Profile Performance Optimized Performance Optimized Logical Processor Enabled Enabled Virtualization Technology Dissabled Disabled Compiler and Libraries Intel Compiler and Libraries 2017.4.196 gcc 4.8.5, CUDA 9.1, OpenMPI 3.0.0 GLIBC 2.17-196.el7 2.17-196.el7 GPU N/A NVIDIA Tesla V100 with 16GB memory

The test clusters and F800/H600 storage systems were connected via 4 x 100GbE links between two Dell Networking Z9100-ON switches. Each of the compute nodes was connected to the test cluster side Dell Networking Z9100-ON switch via single 10GbE. Four storage nodes in the Dell EMC Isilon F800/H600 were connected to the other switch via 8x 40GbE links. The configuration of the storage is listed in Table 2. The detailed network topology is illustrated in Figure 1.

 Dell EMC Isilon F800 Dell EMC Isilon H600 Number of Nodes 4 4 CPU per node Intel® Xeon™ CPU E5-2697A v4 @2.60 GHz Intel® Xeon™ CPU E5-2680 v4 @2.40GHz Memory per node 256GB 256GB Storage Capacity Total usable space: 166.8 TB, 41.7 TB per node Total usable space: 126.8 TB, 31.7 TB per node SSD L3 Cache N/A 2.9 TB per node Network Front end network: 40GbE Back end network: 40GbE Front end network: 40GbE Back end network: IB QDR OS Isilon OneFS v8.1.0 DEV.0 Isilon OneFS v8.1.0.0 B_8_1_0_011

Figure 1 Illustration of Networking Topology

The original tests with an optimized Relion from Intel include Lustre as the main storage; however, we tested the identical optimized Relion binary against Dell EMC Isilon, F800 and H600 due to the limited availability of Dell EMC Lustre Storage at the time of testing. The detailed running scripts were also obtained from Intel as well to ensure that we reproduce the results.

The test data is the recommend standard benchmark data which is presented in Wong et al., 2014. This Plasmodium ribosome data set is download from the ftp site.

#### Performance Evaluation

Performance Comparisons of the Optimized Relion on Dell EMC PowerEdge C6420

Dell EMC PowerEdge C6420s were configured as listed in Table 1, and two Dell EMC Isilons, F800 and H600, were used for the tests. Although Intel’s results shown here in Figure 2 from the tests with a Lustre storage, we were not able to test with Dell EMC Lustre Storage due to the limited availability. As shown in Figure 2, the running times with various numbers of compute nodes are consistently slightly faster than the Intel’s results. This is small variations in our configurations such as Xeon® Gold 6148F instead of non-F series CPUs Intel used and the performance differences in Lustre and Isilon storages.

Figure 2 Optimized Relion Benchmark Comparisons with Intel®’s results and Dell EMC PowerEdge C6420. The storage used for the tests were Dell EMC Isilon, F800 and H600. Intel’s results shown here were using Lustre as a storage system; However, we were not able to run our tests with Dell EMC Lustre Storage due to the limited availability.

Performance Comparisons of Relion without the optimization

Figure 3 shows Relion performances without the code optimization. The bars labeled with ‘Intel E5-2697 v4’ are from the Intel’s publication. In the CPU test results, Relion v2.0.3 took 47.6 hours on Intel E5-2697 v4 (Broadwell) whereas the same version of Relion took 34.5 hours on Dell EMC PowerEdge C6420 and the latest version of Relion (Relion2-beta-v2.0) took 32.8 hours on Dell EMC PowerEdge R740xd.

Unfortunately, a GPU currently available, NVIDIA Tesla V100, is not compatible with Relion v2.0.3. Hence, the test was performed with Relion2-beta-v2.0. Therefore, the results for GPU tests are not fair comparisons.

NVIDIA Tesla V100 with 16GB memory performs similar to the result from 2 nodes – 8 rank test in Figure 2.

Figure 3 Relion performance comparisons with CPU and GPU

#### Conclusion

Dell EMC PowerEdge C6420 shows that it is an ideal compute platform for the Optimized Relion. It scales well over various number of compute nodes with Plasmodium ribosome data. In the future study, we plan to use a larger protein data and more compute nodes to accomplish more comprehensive scaling tests.

#### Resources

Internal web page
External web page

Contacts
Americas
Kihoon Yoon
Sr. Principal Systems Dev Eng
Kihoon.Yoon@dell.com
+1 512 728 4191

[1] Transmission Electron Microscopy refers to a technique where the image of specimen is formed by directing a high energy electron beam at a thin sample.

[2] It is not well defined; however, cryogenic temperatures is temperatures below around 123 K, which equals -150°C or -238°F.

NOTE: Updated on 5/11/2018

• #### Variant Calling (BWA-GATK) pipeline benchmark with Dell EMC Ready Bundle for HPC Life Sciences

13G/14G server performance comparisons with Dell EMC Isilon and Lustre Storage

## Overview

Variant calling is a process by which we identify variants from sequence data. This process helps determine if there is single nucleotide polymorphisms (SNPs), insertions and deletions (indels), and/or structural variants (SVs) at a given position in an individual genome or transcriptome. The main goal of identifying genomic variations is linking to human diseases. Although not all human diseases are associated with genetic variations, variant calling can provide a valuable guideline for geneticists working on a particular disease caused by genetic variations. BWA-GATK is one of the Next Generation Sequencing (NGS) computational tools that is designed to identify germline and somatic mutations from human NGS data. There are a handful of variant identification tools, and we understand that there is not a single tool performs perfectly (Pabinger et al., 2014). However, we chose GATK which is one of most popular tools as our benchmarking tool to demonstrate how well Dell EMC Ready Bundle for HPC Life Sciences can process complex and massive NGS workloads.

The purpose of this blog is to provide valuable performance information on the Intel® Xeon® Gold 6148 processor and the previous generation Xeon® E5-2697 v4 processor using a BWA-GATK pipeline benchmark on Dell EMC Isilon F800/H600 and Dell EMC Lustre Storage. The Xeon® Gold 6148 CPU features 20 physical cores or 40 logical cores when utilizing hyper threading. This processor is based on Intel’s new micro-architecture codenamed “Skylake”. Intel significantly increased the L2 cache per core from 256 KB on Broadwell to 1 MB on Skylake. The 6148 also touts 27.5 MB of L3 cache and a six channel DDR4 memory interface. The test cluster configurations are summarized in Table 1.

 Dell EMC PowerEdge C6420 Dell EMC PowerEdge C6320 CPU 2x Xeon® Gold 6148 20c 2.4GHz (Skylake) 2x Xeon® E5-2697 v4 18c 2.3GHz (Broadwell) RAM 12x 16GB @2666 MHz 8x 16GB @2400 MHz OS RHEL 7.3 Interconnect Intel® Omni-Path BIOS System Profile Performance Optimized Logical Processor Disabled Virtualization Technology Disabled BWA 0.7.15-r1140 sambamba 0.6.5 samtools 1.3.1 GATK 3.6

The test clusters and F800/H600 storage systems were connected via 4 x 100GbE links between two Dell Networking Z9100-ON switches. Each of the compute nodes was connected to the test cluster side Dell Networking Z9100-ON switch via single 10GbE. Four storage nodes in the Dell EMC Isilon F800/H600 were connected to the other switch via 8x 40GbE links. The configuration of the storage is listed in Table 2. For the Dell EMC Lustre Storage connection, four servers (a Metadata Server pair (MDS) and an Object Storage Server pair (OSS)) were connected to 13/14th Generation servers via Intel® Omni-Path. The detailed network topology is illustrated in Figure 1.

Figure 1 Illustration of Networking Topology: only 8 compute nodes are illustrated here for the simplicity, 64 nodes of 13G/14G servers used for the actual tests.

 Dell EMC Isilon F800 Dell EMC Isilon H600 Dell EMC Lustre Storage Number of nodes 4 4 2x Dell EMC PowerEdge R730 as MDS 2x Dell EMC PowerEdge R730 as OSS CPU per node Intel® Xeon™ CPU E5-2697A v4 @2.60 GHz Intel® Xeon™ CPU E5-2680 v4 @2.40GHz 2x Intel® Xeon™ E5-2630V4 @ 2.20GHz Memory per node 256GB Storage Capacity Total usable space: 166.8 TB, 41.7 TB per node Total usable space: 126.8 TB, 31.7 TB per node 960TB Raw, 768TB (698 TiB) usable MDS Storage Array: 1 x Dell EMC PowerVault MD3420 ( Total 24 - 2.5" 300GB 15K RPM SAS)OSS Storage Array: 4 x Dell EMC PowerVault MD3460 (Total 240 3.5" 4 TB 7.2K RPM NL SAS) SSD L3 Cache N/A 2.9 TB per node N/A Network Front end network: 40GbE Back end network: 40GbE Front end network: 40GbE Back end network: IB QDR Front end network: Intel® Omni-Path Internal connections: 12Gbps SAS OS Isilon OneFS v8.1.0 DEV.0 Isilon OneFS v8.1.0.0 B_8_1_0_011 Red Hat Enterprise Linux 7.2 (3.10.0-327.el7.x86_64)

The test data was chosen from one of Illumina’s Platinum Genomes. ERR091571 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. Although the description of the data from the linked website shows that this sample has a 30x depth of coverage, in reality it is closer to 10x coverage according to the number of reads counted.

Performance Evaluation

### BWA-GATK Benchmark over Generations

Dell EMC PowerEdge C6320s and C6420s were configured as listed in Table 1. The tests performed here are designed to demonstrate performance at the server level, not for comparisons on individual components. At the same time, the tests were also designed to estimate the sizing information of Dell EMC Isilon F800/H600 and Dell EMC Lustre Storage. The data points in Figure 2 are calculated based on the total number of samples (X axis in the figure) that were processed concurrently. The number of genomes per day metric is obtained from total running time taken to process the total number of samples in a test. The smoothed curves are generated by using a polynomial spline with the piecewise polynomial degree of 3 generating B-spline basis matrix. The details of BWA-GATK pipeline information can be obtained from the Broad Institute web site.

Figure 2 BWA-GATK Benchmark over 13/14 generation with Dell EMC Isilon and Dell EMC Lustre Storage: the number of compute nodes used for the tests are 64x C6420s and 63x C6320s (64x C6320s for testing H600). The number of samples per node was increased to get the desired total number of samples processed concurrently. For C6320 (13G), 3 samples per node was the maximum number of samples each node can process. 64, 104, and 126 test results for 13G system (blue) were with 2 samples per node while 129, 156, 180, 189 and 192 sample test results were obtained from 3 samples per node. For C6420 (14G), the tests were performed with maximum 5 samples per node. The plot for 14G was generated by processing 1, 2, 3, 4, and 5 samples per node. The number of samples per node is limited by the amount of memory in a system. 128 GB and 192 GB of RAM were used in 13G and 14G system, respectively as shown in Table 1. C6420s show a better scaling behavior than C6320s. 13G server with Broadwell CPUs seems to be more sensitive to the number of samples loaded onto system as shown from the results of 126 vs 129 sample tests on all the storages tested in this study.

The results with Dell EMC Isilon F800 that indicate C6320 with Broadwell/128GB RAM performs roughly 50 genomes per day less when 3 samples are processed per compute node and 30 genomes per day less when 2 samples are processed in each compute node compared to C6420. It is not clear if C6320’s performance will drop again when more samples are added to each compute node; however, it is obvious that C6420 does not show this behavior when the number of samples is increased on each compute node. The results also allow estimating the maximum performance of Dell EMC Isilon F800. As the total number of genomes increases, the increment of the number of genomes per day metric is slow down. Unfortunately, we were not able to identify the exact number of C6420s that would saturate Dell EMC Isilon F800 with four nodes. However, it is safe to say that more than 64x C6420s will require additional Dell EMC Isilon F800/H600 nodes to maintain high performance with more than 320 10x whole human genome samples. Dell EMC Lustre Storage did not scale as well as Dell EMC Isilon F800/H600. However, we observed that some optimizations are necessary to make Dell EMC Lustre Storage perform better. For example, the aligning, sorting, and marking duplicates steps in the pipeline perform extremely well when the file system’s stripe size was set to 2MB while other steps perform very poorly with 2MB stripe size. This suggests that Dell EMC Lustre Storage needs to be optimized further for these heterogeneous workloads in the pipeline. Since there is not any concrete configuration for the pipeline, we will further investigate the idea of using multiple tier file systems to cover different requirements in each step for both Dell EMC Isilon and Lustre Storage.

Dell EMC PowerEdge C6320 with Dell EMC Isilon H600 performance reached the maximum around 140 concurrent 10x human whole genomes. Running three 10x samples concurrently on a single node is not ideal. This limit appears to be on the compute node side, since H600 performance is much better with C6420s running a similar number of samples.

Conclusion

Dell EMC PowerEdge C6420 has at least a 12% performance gain compared to the previous generation. Each C6420 compute node with 192 GB RAM can process about seven 10x whole human genomes per day. This number could be increased if the C6420 compute node is configured with more memory. In addition to the improvement on the 14G server side, four Isilon F800 nodes in a 4U chassis can support 64x C6420s and 320 10x whole human genomes concurrently.

Resources

External web page https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3956068/

Contacts
Americas
Kihoon Yoon
Sr. Principal Systems Dev Eng
Kihoon.Yoon@dell.com
+1 512 728 4191

• #### Understanding the Role of Dell EMC Isilon SmartConnect in Genomics Workloads

Coming together with EMC has opened many new opportunities for the Dell EMC HPC Team to develop high-performance computing and storage solutions for the Life Sciences. Our lab recently stood up a ‘starter' 3 node Dell EMC Isilon X410 cluster. As a loyal user of the Isilon X210 in a previous role, I couldn’t wait to start profiling genomics applications using the X410 with Dell EMC HPC System for Life Sciences.

Because our current Isilon X410 storage cluster is currently fixed at the 3 node minimum, we aren’t set up yet to evaluate the scalability of the X410 with genomics workflows. We will tackle this work once our lab receives additional X nodes and the new the Isilon All-Flash node (formerly project Nitro).

In the meantime, I wanted to understand how the Isilon storage behaves relative to other storage solutions and decided to focus on the role of Isilon SmartConnect.

Through a single host name, SmartConnect enables client connection load balancing and dynamic network file system (NFS) failover and failback of client connections across storage nodes to provide optimal utilization of the Isilon cluster resources.

Without the need to install client-side drivers, administrators can easily manage a large and growing number of clients and ensure in the event of a system failure, in-flight reads and writes will successfully finish without failing.

Traditional storage systems with two-way failover typically sustain a minimum 50 percent degradation in performance when a storage head fails, as all clients must fail over to the remaining head. With Isilon SmartConnect, clients are evenly distributed across all remaining nodes in the cluster during failover, helping to ensure minimal performance impact.

To test this concept, I ran the GATK pipeline varying the number of samples and compute nodes without and with SmartConnect enabled on the Isilon storage cluster.

The configuration of our current lab environment and whole human genome sequencing data used for this evaluation are listed below.

Table 1 System configuration, software, and data

 Dell EMC HPC System for Life Sciences Server 40 x PowerEdge C6320 Processor 2 x Intel Xeon E5-2697 v4. 18 cores per socket, 2.3 GHz Memory 128 GB at 2400 MT/s Interconnect 10GbE NIC and switch for accessing Isilon & Intel Omni-Path fabric Software Operating System Red Hat Enterprise Linux 7.2 BWA 0.7.2-r1039 Samtools 1.2.1 Sambamba 0.6.0 GATK 3.5 ERR091571 10x Whole human genome sequencing data from Illumina HiSeq 2000, Total number of reads = 211,437,919

As noted earlier, the Isilon in our environment is currently set up in a minimum 3 node configuration. The current generation of Isilon is scalable up to 144 nodes. As you add additional Isilon nodes, the aggregate memory, throughput, and IOPS scale linearly. For a deep dive on Isilon and OneFS file system, see this technical white paper

The Isilon storage cluster in our lab is summarized in Table 2. The Isilon storage is mounted on each compute nodes, up to 40 nodes, through NFS service (version 3) over a 10GbE network.

Table 2: Current Isilon Configuration

 Server 3 x X410 Processor 2 x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz (1995.20-MHz K8-class CPU), 16 cores. Memory 256 GB at 2400 MT/s Back End Networking 2 x IB QDR links Front-end Networking 2 x 1GbE ports and 2 x 10GbE SFP+ ports Storage Capacity 4TB x 30 HDDs, 120TB (usable) Software Operating System OneFS 8.0 Isilon SmartConnect Round Robin Mode

Table 3 summarizes all the tests we performed. To mimic a storage environment without proper load balancing, all the tests were performed without SmartConnect enabled except the concurrent 120 sample run.

Each pipeline (job) runs with one sample and uses 11 cores on a single compute node. A maximum three pipelines run concurrently on a single compute node. The tests were performed up to 40 nodes and 120 samples.

I included the detailed running times for each sub-step in BWA-GATK pipeline in Table 3. The running times for Aligning & Sorting and HaplotypeCaller steps are the bottlenecks in the pipeline, but the Aligning & Sorting step is more sensitive to the number of samples. In this benchmark, GenotypeGVCFs is not a bottleneck since we used an identical sequence data for all the concurrent pipelines. In real data analysis, a large number of different samples are used, and GenotypeGVCFs becomes the major bottleneck.

Table 3 Test results for BWA-GATK pipeline without and with Isilon SmartConnect enabled.

 Number of samples (Data Size) 3 15 30 60 90 120 *120 with SmartConnect Number of compute nodes 1 5 10 20 30 40 40 Aligning & Sorting 3.48 3.54 3.64 4.21 4.94 5.54 4.69 Mark/Remove Duplicates 0.46 0.49 0.79 1.27 2.52 3.07 1.84 Generate Realigning Targets 0.19 0.18 0.19 0.19 0.18 0.2 0.18 Realign around InDel 2.22 2.2 2.24 2.27 2.26 2.27 2.29 Base Recalibration 1.17 1.18 1.19 1.18 1.16 1.13 1.17 HaplotypeCaller 4.13 4.35 4.39 4.34 4.31 4.32 4.29 GenotypeGVCFs 0.02 0.02 0.02 0.02 0.02 0.02 0.02 Variant Recalibration 0.58 0.5 0.53 0.55 0.57 0.53 0.41 Apply Recalibration 0.02 0.02 0.02 0.02 0.02 0.02 0.02 Total Running Time (Hrs) 12.3 12.5 13 14.1 16 17.1 15.0 Number of Genomes per Day 6 29 53 96 129 156 185

As shown in Table 3, after 30 samples on 10 compute nodes, the total running time of BWA-GATK began to increase. When the number of compute nodes doubled, the total run time continued to climb without SmartConnect enabled. Also starting at 30 samples, we saw jobs starting to fail presumably due to unbalanced client connections and inability to failover and failback those connections.

However, when we enabled SmartConnect using the default round-robin settings, we saw a significant improvement on the total run time and daily sample throughput.

As expected, SmartConnect, maximized performance by keeping client connections balanced across all three Isilon storage nodes. In the three X410 configuration with SmartConnect enabled, the 120 samples processed with 40 compute nodes showed a 14% speed-up and 19% increased daily sample throughput.

This test also suggests a starting point identifying the number client connections per Isilon node for this genomics workflow. In our case, adding one additional X410 to the Isilon storage cluster for each 15 additional compute nodes may be a reasonable place to start.

As we add additional Isilon nodes to our cluster, we will perform additional studies to refine recommendations for the number of client connections per Isilon node for this genomics workflow. We’ll also take a deeper dive with the advanced SmartConnect load balancing options like CPU utilization, connection count, and network throughput. The Isilon SmartConnect White Paper provides a detailed summary and examples for each of these modes.

If you are using Bright Cluster Manger and need some tips to set it up with Isilon SmartConnect, read this post.

• #### Scaling behavior of short read sequence aligners

This blog explores the scaling behaviors of the most popular three short read sequence aligners; BWA-mem, Bowtie and Bowtie2. Aligning process is the beginning of all Next Generation Sequencing (NGS) data analysis and typically the most time-consuming step in any NGS data analysis pipeline.

It is clear that using more cores for the aligning process will help to speed up the process, but as you already know parallelization comes with cost. Higher complexity, perhaps order of magnitude, due to multiple instruction streams and data flowing between the instruction streams can easily overshadow the speed gain by parallelizing the process. Hence, it is not wise to use the entire cores in a compute node to speed up blindly especially when the overall throughput is more important. Identifying a sweet spot for the optimal number of cores for each aligning process and maximizing the overall throughput at the same time is not an easy task in a complex workflow.

Table 1 Server configuration and software

 Component Detail Server PowerEdge FC430 in FX2 chassis - 2 Sockets Processor Intel® Xeon® Dual E5-2695 v3 - 14 cores, total 28 physical cores Memory 128GB - 8x 16GB RDIMM, 2133 MT/s, Dual Rank, x4 Data Width Storage 480TB IEEL (Lustre) Interconnect InfiniBand FDR OS Red Hat Enterprise 6.6 Cluster Management tool Bright Cluster Manager 7.1 Short Sequence Aligner BWA 0.7.2-r1039, Bowtie 1.1.2, Bowtie 2.2.6

Table 1 shows the summary of the system for the test here. Hyperthreading was not enabled for the test although it helps to improve the overall performance.

In Figure 1, BWA running times were measured for each different sizes of paired end read sequencing data. The size of sequence read data is represented as million fragments (MF) here. For example, 2MF means that the sequence data consist of two fastq files containing two million sequence reads in each file. One read is in a fastq file, and the corresponding paired read is in the other fastq file. A sweet spot for BWA-mem is from 4 to 16 cores roughly depending on the sequence read file size. Typically, the sequence read size is larger than 10 million and hence, 2MF and 10MF results are not realistic. However, larger sequence read data follow the behavior of the smaller input sizes as well. In the speedup chart, blue solid line labeled as 'ideal' represents the theoretical speedup we could obtain by increasing the number of cores.

Bowtie results in Figure 2 shows that the similar sweet spot comparing to BWA-mem results. However, the running time is slightly faster than BWA-mem’s. It is notable that Bowtie and Bowtie2 are sensitive to the read lengths while BWA-mem shows more consistent scaling behavior regardless of the sequence read lengths.

Although Bowtie2 is even faster than Bowtie, it is more sensitive to the length of the sequence reads as shown in Figure 3. Actually, the total number of nucleotides could be a better matrix to estimate the running time for a given sequence read data.

In practice, there are more factors to consider to maximize the overall throughput. One of critical factors is the bandwidth of existing storages in order to utilize the sweet spot instead of using the entire cores in a compute node. For example, if we decided to use 14 cores for each aligning process instead of using 28 cores, this will double up the number of samples processed simultaneously. However, the twice number of processes will fill up the limited storage bandwidth, and overly used storages will slow down significantly the entire processes running simultaneously.

Also, these aligning processes are not used alone typically in a pipeline. It is frequently tied up with a file conversion and a sorting process since subsequent analysis tools requiring these alignment results sorted in either chromosome coordinates or the name of the reads. The most popular approach is to use ‘pipe and redirection’ to save time to write multiple output files. However, this practice makes the optimization harder since it generally requires more computational resources. More detailed optimization for NGS pipelines in this aspect will be discussed in the next blogs.

• #### Variant Calling Benchmark -- Not Only Human

Variant call process refers to the identification of a nucleotide difference from reference sequences at a given position in an individual genome or transcriptome. It includes single nucleotide polymorphism (SNPs), insertion/deletions (indels) and structural variants. One of the most popular variant calling applications is GenomeAnalysisTK (GATK) from Broad Institute. Often this GATK is used with BWA to compose a variant calling workflow focusing on SNPs and indels. After we published Dell HPC System for Genomics White Paper last year, there were significant changes in GATK. The key process, variant call step UnifiedGenotyper is no longer recommended in their best practice. Hence, here we recreate BWA-GATK pipeline according to the recommended practice to test whole genome sequencing data from mammals and plants in addition to human’s whole genome sequencing data. This is a part of Dell’s effort to help customers estimating their infrastructure needs for their various genomics data loads by providing a comprehensive benchmark.

# Variant Analysis for Whole Genome Sequencing data

## System

The detailed configuration is in Dell HPC System for Genomics White Paper, and the summary of system configuration and software is in Table 2.

Table 1 Server configuration and software

 Component Detail Server 40x PowerEdge FC430 in FX2 chassis Processor Total of 1120 cores: Intel® Xeon® Dual E5-2695 v3 - 14 cores Memory 128GB - 8x 16GB RDIMM, 2133 MT/s, Dual Rank, x4 Data Width Storage 480TB IEEL (Lustre) Interconnect InfiniBand FDR OS Red Hat Enterprise 6.6 Cluster Management tool Bright Cluster Manager 7.1 Short Sequence Aligner BWA 0.7.2-r1039 Variant Analysis GATK 3.5 Utilities sambamba 0.6.0, samtools 1.2.1

## BWA-GATK pipeline

The current version of GATK is 3.5, and the actual workflow tested obtained from the workshop, ‘GATK Best Practices and Beyond’. In this workshop, they introduce a new workflow with three phases.

• Best Practices Phase 1: Pre-processing
• Best Practices Phase 2A: Calling germline variants
• Best Practices Phase 2B: Calling somatic variants
• Best Practices Phase 3: Preliminary analyses

Here we tested out phase 1, phase 2A and phase3 for germline variant call pipeline. The details of commands used in benchmark are listed below.

### Phase 1. Pre-processing

#### Step 1. Aligning and Sorting

bwa mem -c 250 -M -t [number of threads] -R ‘@RG\tID:noID\tPL:illumine\tLB:noLB\tSM:bar’ [reference chromosome] [read fastq 1] [read fastq 2] | samtools view -bu - | sambamba sort -t [number of threads] -m 30G --tmpdir [path/to/temp] -o [sorted bam output] /dev/stdin

#### Step 2. Mark and Remove Duplicates

sambamba markdup -t [number of threads] --remove-duplicates --tmpdir=[path/to/temp] [input: sorted bam output] [output: bam without duplicates]

#### Step 3. Generate Realigning Targets

java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt [number of threads] -R [reference chromosome] -o [target list file] -I [bam without duplicates] -known [reference vcf file]

#### Step 4. Realigning around InDel

java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T IndelRealigner -R [reference chromosome] -I [bam without duplicates] -targetIntervals [target list file] -known [reference vcf file] -o [realigned bam]

#### Step 5. Base Recalibration

java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct [number of threads] -l INFO -R [reference chromosome] -I [realigned bam] -known [reference vcf file] -o [recalibrated data table]

#### Step 6. Print Recalibrated Reads - Optional

java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T PrintReads -nct [number of threads] -R [reference chromosome] -I [realigned bam] -BQSR [recalibrated data table] -o [recalibrated bam]

#### Step 7. After Base Recalibration - Optional

java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct [number of threads] -l INFO -R [reference chromosome] -I [recalibrated bam] -known [reference vcf file] -o [post recalibrated data table]

#### Step 8. Analyze Covariates - Optional

java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R [reference chromosome] -before [recalibrated data table] -after [post recalibrated data table] -plots [recalibration report pdf] -csv [recalibration report csv]

### Phase 2. Variant Discovery – Calling germline variants

#### Step 1. Haplotype Caller

java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -nct [number of threads] -R [reference chromosome] -ERC GVCF -BQSR [recalibrated data table] -L [reference vcf file] -I [recalibrated bam] -o [gvcf output]

#### Step 2. GenotypeGVCFs

java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -nt [number of threads] -R [reference chromosome] -V [gvcf output] -o [raw vcf]

### Phase 3. Preliminary Analyses

#### Step 1. Variant Recalibration

java -d64 -Xms512m -Xmx2g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R [reference chromosome] --input [raw vcf] -an QD -an DP -an FS -an ReadPosRankSum -U LENIENT_VCF_PROCESSING --mode SNP --recal_file [raw vcf recalibration] --tranches_file [raw vcf tranches]

#### Step 2. Apply Recalibration

java -d64 -Xms512m -Xmx2g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R [reference chromosome] -input [raw vcf] -o [recalibrated filtered vcf] --ts_filter_level 99.97 --tranches_file [raw vcf tranches] --recal_file [raw vcf recalibration] --mode SNP -U LENIENT_VCF_PROCESSING

## Job Scheduling

Torque/Maui is used to manage a large number of jobs to process sequencing samples simultaneously. Optional steps, 6, 7 and 8 in phase 1 were not included in the benchmark since Step 6 PrintRead took 12.5 hours with 9 threads for Bos Taurus sample (18 hours with single thread). These optional steps are not required, but these steps are useful for the reporting purpose. If it is necessary, it can be added as a side workflow to the main procedure. For each job, 9 cores were assigned when 120 concurrent jobs were processed concurrently and 13 cores were used for the test of 80 concurrent jobs.

## Data

In addition to the benchmark for human whole genome sequencing data published in the whitepaper, we gathered cow, pig, two sub-species of rice (japonica and indica) and corn reference genomes from Illumina’s iGenome site and Ensembl database. Fortunately, reference variant call data exist as a standard VCF file format for human, cow and pig. A variant data for japonica rice were obtained from 3000 Rice Genome on AWS and was modified according to the standard VCF file format. However, the chromosome coordinates in this VCF file do not match to the actual reference chromosome sequences, and we were not able to find matching version of reference variant information from public databases. For indica rice and corn, we gathered the variant information from Ensembl and converted them into a compatible VCF format. Whole genome sequencing data were obtained from European Nucleotide Archive (ENA). ENA Sample IDs in Table 1 are the identifiers allow to retrieve sequence data from the site. Although it is not ideal to test an identical input for large number of processes, it is not feasible to obtain large number of similar sample data from public databases.

Table 2 WGS test data for the different species: * x2 indicates the data is paired end reads. ƚTest ID column represent identifiers for the sequence data used throughout the test.

 Species Test IDƚ ENA Sample ID Sample Base Count Single file Size x2* Reference Genome size (bp) Depth of Coverage Number of variants in Ref Homo sapiens (human) Hs1 ERR091571 42,710,459,638 17 GB x2 3,326,743,047 13x 3,152,430 Hs2 ERR194161 171,588,070,386 54 GB x2 3,326,743,047 52x Bos Taurus (cow) Bt1 SRR1706031 82,272,305,762 35 GB x2 2,649,685,036 31x 93,347,258 Bt2 SRR1805809 32,681,063,800 12 GB x2 2,649,685,036 12x Sus scrofa (pig) Ss1 SRR1178925 41,802,035,944 19 GB x2 3,024,658,544 14x 52,573,286 Ss2 SRR1056427 24,901,150,040 10 GB x2 3,024,658,544 8x Oryza sativa (rice) japonica Osj SRR1450198 49,676,959,200 22 GB x2 374,424,240 132x 19,409,227 indica Osi SRR3098100 12,191,702,544 4 GB x2 411,710,190 30x 4,538,869 Zea mays (corn) Zm SRR1575496 36,192,217,200 14 GB x2 3,233,616,351 11x 51,151,183

# Benchmark Results

## Data Quality

After mapping and sorting of the sequence input files, quality statistics were obtained from the output files of Phase 1, Step 1. SRR17060031 sample is from bovine gut metagenomics study and was not well mapped onto Bos taurus UMD3.1 reference genome from Ensembl as expected. The majority of DNAs from bovine gut is foreign and has different sequence composition.

Table 3 Mapping qualities of sequence reads data; obtained by using ‘samtools flagstat’. ‘Total QC-passed reads’ is the number of reads passed the criteria of sequencing quality. Among all QC-passed reads, the number of reads actually mapped on a reference genome and its percentage is on ‘Mapped reads (%)’ column. ‘Paired in sequencing’ column is the number of paired reads properly paired by a sequencer. Among the reads properly paired in sequencing, the number of those paired reads mapped on a reference genome as paired reads is listed in ‘Properly paired (%) in mapping.

 Species Sequencing Reads Test ID Total QC-passed reads Mapped reads (%) Paired in sequencing Properly paired (%) in mapping Human ERR091571 Hs1 424,118,221 421,339,198 (99.34%) 422,875,838 412,370,120 (97.52%) ERR194161 Hs2 1,691,135,957 1,666,486,126 (98.54%) 1,686,908,514 1,621,073,394 (96.10%) Cow SRR1706031 Bt1 813,545,863 29,291,792 (  3.60%) 813,520,998 28,813,072 (  3.54%) SRR1805809 Bt2 327,304,866 316,654,265 (96.75%) 326,810,638 308,600,196 (94.43%) Pig SRR1178925 Ss1 416,854,287 379,784,341 (91.11%) 413,881,544 344,614,170 (83.26%) SRR1056427 Ss2 249,096,674 228,015,545 (91.54%) 246,546,040 212,404,874 (86.15%) Rice SRR1450198 Osj 499,853,597 486,527,154 (97.33%) 496,769,592 459,665,726 (92.53%) SRR3098100 Osi 97,611,519 95,332,114 (97.66%) 96,759,544 86,156,978 (89.04%) Corn SRR1575496 Zm 364,636,704 358,393,982 (98.29%) 361,922,172 315,560,140 (87.19%)

The rest of samples were properly aligned on the reference genome with high quality; more than 80% of reads paired in sequencing data properly mapped as pairs on reference genomes.

It is also important to check what level of mismatches exists in the aligning results. The estimated variance in human genome is one in every 1,200 to 1,500 bases. This makes 3 million base differences between any two people randomly picked. However, as shown in Table 4, the results are not quite matched to the 3 million base estimation. Ideally, 36 million mismatches should be shown in Hs1 data set since it covers the human reference genome 13 times. However, the rate of mismatches is quite higher than the estimation, and at least one out of two variants reported by the sequencing might be an error.

Table 4 The number of reads mapped perfectly on a reference genome and the number of reads mapped partially

 Test ID Depth Mapped reads Number of reads mapped with mismatches (mm) Perfect match (%) One mm (%) Two mm (%) Three mm (%) Four mm (%) Five mm (%) Hs1 13x 421,339,198 328,815,216 (78.0) 53,425,338 (12.7) 13,284,425 (3.2) 6,842,191 (1.6) 5,140,438 (1.2) 4,082,446 (1.0) Hs2 52x 1,666,486,126 1,319,421,905 (79.2) 201,568,633 (12.1) 47,831,915 (2.9) 24,862,727 (1.5) 19,052,800 (1.1) 15,568,114 (0.9) Bt1 31x 29,291,792 25,835,536 (88.2) 2,684,650 (9.2) 338,781 (1.2) 147,841 (0.5) 89,706 (0.3) 70,789 (0.24) Bt2 12x 316,654,265 158,463,463 (50.0) 68,754,190 (21.7) 29,544,252 (9.3) 17,337,205 (5.5) 12,639,289 (4.0) 10,015,029 (3.2) Ss1 14x 379,784,341 228,627,231 (60.2) 69,912,403 (18.4) 29,142,572 (7.7) 16,701,248 (4.4) 11,036,852 (2.9) 7,652,513 (2.0) Ss2 8x 228,015,545 112,216,441 (49.2) 53,739,562 (23.6) 25,132,226 (11.0) 13,874,636 (6.1) 8,431,144 (3.7) 5,375,833 (2.4) Osj 132x 486,527,154 208,387,077 (42.8) 113,948,043 (23.4) 61,697,586 (12.7) 37,520,642 (7.7) 23,761,302 (4.9) 15,370,422 (3.2) Osi 30x 95,332,114 54,462,837 (57.1) 17,325,526 (18.2) 8,190,929 (8.6) 5,146,096 (5.4) 3,394,245 (3.6) 2,322,355 (2.4) Zm 11x 358,393,982 150,686,819 (42.1) 82,912,817 (23.1) 44,823,583 (12.5) 28,375,226 (7.9) 19,093,235 (5.3) 12,503,856 (3.5)

## Time Measurement

Total run time is the elapsed wall time from the earliest start of Phase 1, Step 1 to the latest completion of Phase 3, Step 2. Time measurement for each step is from the latest completion time of the previous step to the latest completion time of the current step as described in Figure 1.

The running time for each data set is summarized in Table 4. Clearly the input size, size of sequence read files and reference genomes are the major factors affecting to the running time. The reference genome size is a major player for ‘Aligning & Sorting’ step while the size of variant reference affects most on ‘HaplotypeCaller’ step.

Table 5 running time for BWA-GATK pipeline

 Species Homo sapiens (human) Bos taurus (cow) Sus scrofa (pig) Oryza sativa (rice) Zea mays (corn) japonica Indica Depth of Coverage 13x 52x 31x 12x 14x 8x 132x 30x 11x Test ID Hs1 Hs2 Bt1 Bt2 Ss1 Ss2 Osj Osi Zm Total read size, gzip compressed (GB) 34 108 70 22 38 20 44 8 28 Number of samples ran concurrently 80 80 120 80 120 80 120 80 80 Run Time (hours) Aligning & Sorting 3.93 15.79 7.25 5.77 7.53 3.04 9.50 1.18 11.16 Mark/Remove Duplicates 0.66 2.62 3.45 0.73 1.07 0.27 1.27 0.12 0.72 Generate Realigning Targets 0.29 1.08 3.12 1.57 0.47 0.27 0.22 0.05 0.26 Realign around InDel 2.50 8.90 4.00 3.15 2.87 1.83 7.37 1.25 3.18 Base Recalibration 1.82 6.80 1.39 1.96 2.37 1.01 3.16 0.36 1.91 HaplotypeCaller 4.52 10.28 2.75 9.33 26.21 14.65 8.95 1.77 16.72 GenotypeGVCFs 0.03 0.03 0.20 0.05 0.34 0.06 1.12 0.01 0.04 Variant Recalibration 0.67 0.37 0.32 0.86 0.58 0.56 0.92 0.04 0.46 Apply Recalibration 0.04 0.04 0.03 0.06 0.03 0.08 0.03 0.01 0.05 Total Run Time 14.5 45.9 22.5 23.5 41.5 21.8 32.5 4.78 34.5 Number of Genomes per day 133 42 128 82 46 88 89 402 56

## Discussion

The running time of the current version, GATK 3.5 is overly slower than the version of 2.8-1 we tested in our white paper. Particularly, HaplotypeCaller in the new workflow took 4.52 hours while UnifiedGenotyper in the older version took about 1 hour. Despite of the significant slow-down, GATK team believes HaplotypeCaller brings a better result, and that is worthy for the five times longer run.

There are data issues in non-human species. As shown in Table 4, for the similar size of inputs, Hs1 and Ss1 show large difference in the running time. The longer running time in non-human species can be explained by the quality of reference data. Aligning and sorting process takes more than twice times in other mammals, and it became worse in plants. It is known that plants genomes contain large number of repeat sequences which make mapping process difficult. It is important to note that the shorter running time for HaplotypeCaller in rice does not reflect a real time since the size of the reference variant file was reduced significantly due to the chromosome length/position mismatches in the data. All the variant records outside of chromosome range were removed, but position mismatches were used without corrections. The smaller size of the reference variant information and wrong position information the running time of HaplotypeCaller shorter. Corn’s reference data is not any better in terms of the accuracy of these benchmark. These data errors are the major causes of longer processing time.

Nonetheless, the results shown here could serve as good reference points for the worst case running time. Once reference data are cleaned up by researchers, the overall running time for other mammals should be similar to the one from Hs1 in Table 4 with a proper scaling of input size. However, it is hard to estimate an accurate running times for non-human species at this moment.