The purpose of this blog is to validate the optimized Relion (for REgularised LIkelihood OptimizatioN) on Dell EMC PowerEdge C6420s with Skylake CPUs. Relion was developed from the Scheres lab at MRC Laboratory of Molecular Biology. It uses an empirical Bayesian approach to refining multiple 3D images or 2D class averages for the data generated from CryoElectron Microscopy (Cryo-EM). The impressive performance gain from Intel®’s efforts in the collaboration of Relion development team reduced the performance gap between CPUs and GPUs. The CPU/GPU performance comparison results are not shown here; however, the performance gap becomes single digit fold between Skylake CPU systems and Broadwell CPU/Tesla P100 GPU systems.
Essentially, Cryo-EM is a type of transmission electron microscopy (TEM) for imaging frozen-hydrated specimens at cryogenic temperatures. Specimens remain in their native state without the need for dyes or fixatives allowing the study of fine cellular structures, viruses and protein complexes at molecular resolution. A rapid vitrification at cryogenic temperature is the key step to avoid water molecule crystallization and forming amorphous solid that does almost no damage to the sample structure. Regular electron microscopy requires samples to be prepared in complex ways, and the sample preparations make hard to retaining the original molecular structures. Cryo-EM is not perfect like X-ray crystallography; however, it gains the popularity in the research community fast due to the simple sample preparation steps and flexibility to the sample size, complexity, and non-rigid structure. As the resolution revolution in Cryo-EM progresses due to the 40+ years of dedicated work from the structural biology community, we now have the ability to yield accurate, detailed, 3D models of intricate biological structures at the sub-cellular and molecular scales.
The tests were performed on 8 nodes of Dell PowerEdge C6420s which is a part of Dell EMC Ready Bundle for HPC Life Sciences. A summary of the configuration is listed in Table 1.
The test clusters and F800/H600 storage systems were connected via 4 x 100GbE links between two Dell Networking Z9100-ON switches. Each of the compute nodes was connected to the test cluster side Dell Networking Z9100-ON switch via single 10GbE. Four storage nodes in the Dell EMC Isilon F800/H600 were connected to the other switch via 8x 40GbE links. The configuration of the storage is listed in Table 2. The detailed network topology is illustrated in Figure 1.
Front end network: 40GbE
Back end network: 40GbE
Back end network: IB QDR
Figure 1 Illustration of Networking Topology
The original tests with an optimized Relion from Intel include Lustre as the main storage; however, we tested the identical optimized Relion binary against Dell EMC Isilon, F800 and H600 due to the limited availability of Dell EMC Lustre Storage at the time of testing. The detailed running scripts were also obtained from Intel as well to ensure that we reproduce the results.
The test data is the recommend standard benchmark data which is presented in Wong et al., 2014. This Plasmodium ribosome data set is download from the ftp site.
Performance Comparisons of the Optimized Relion on Dell EMC PowerEdge C6420
Dell EMC PowerEdge C6420s were configured as listed in Table 1, and two Dell EMC Isilons, F800 and H600, were used for the tests. Although Intel’s results shown here in Figure 2 from the tests with a Lustre storage, we were not able to test with Dell EMC Lustre Storage due to the limited availability. As shown in Figure 2, the running times with various numbers of compute nodes are consistently slightly faster than the Intel’s results. This is small variations in our configurations such as Xeon® Gold 6148F instead of non-F series CPUs Intel used and the performance differences in Lustre and Isilon storages.
Figure 2 Optimized Relion Benchmark Comparisons with Intel®’s results and Dell EMC PowerEdge C6420. The storage used for the tests were Dell EMC Isilon, F800 and H600. Intel’s results shown here were using Lustre as a storage system; However, we were not able to run our tests with Dell EMC Lustre Storage due to the limited availability.
Dell EMC PowerEdge C6420 shows that it is an ideal compute platform for the Optimized Relion. It scales well over various number of compute nodes with Plasmodium ribosome data. In the future study, we plan to use a larger protein data and more compute nodes to accomplish more comprehensive scaling tests.
Internal web page External web page
Contacts Americas Kihoon Yoon Sr. Principal Systems Dev Eng Kihoon.Yoon@dell.com +1 512 728 4191
 Transmission Electron Microscopy refers to a technique where the image of specimen is formed by directing a high energy electron beam at a thin sample.
 It is not well defined; however, cryogenic temperatures is temperatures below around 123 K, which equals -150°C or -238°F.
13G/14G server performance comparisons with Dell EMC Isilon and Lustre Storage
Variant calling is a process by which we identify variants from sequence data. This process helps determine if there is single nucleotide polymorphisms (SNPs), insertions and deletions (indels), and/or structural variants (SVs) at a given position in an individual genome or transcriptome. The main goal of identifying genomic variations is linking to human diseases. Although not all human diseases are associated with genetic variations, variant calling can provide a valuable guideline for geneticists working on a particular disease caused by genetic variations. BWA-GATK is one of the Next Generation Sequencing (NGS) computational tools that is designed to identify germline and somatic mutations from human NGS data. There are a handful of variant identification tools, and we understand that there is not a single tool performs perfectly (Pabinger et al., 2014). However, we chose GATK which is one of most popular tools as our benchmarking tool to demonstrate how well Dell EMC Ready Bundle for HPC Life Sciences can process complex and massive NGS workloads.
The purpose of this blog is to provide valuable performance information on the Intel® Xeon® Gold 6148 processor and the previous generation Xeon® E5-2697 v4 processor using a BWA-GATK pipeline benchmark on Dell EMC Isilon F800/H600 and Dell EMC Lustre Storage. The Xeon® Gold 6148 CPU features 20 physical cores or 40 logical cores when utilizing hyper threading. This processor is based on Intel’s new micro-architecture codenamed “Skylake”. Intel significantly increased the L2 cache per core from 256 KB on Broadwell to 1 MB on Skylake. The 6148 also touts 27.5 MB of L3 cache and a six channel DDR4 memory interface. The test cluster configurations are summarized in Table 1.
The test clusters and F800/H600 storage systems were connected via 4 x 100GbE links between two Dell Networking Z9100-ON switches. Each of the compute nodes was connected to the test cluster side Dell Networking Z9100-ON switch via single 10GbE. Four storage nodes in the Dell EMC Isilon F800/H600 were connected to the other switch via 8x 40GbE links. The configuration of the storage is listed in Table 2. For the Dell EMC Lustre Storage connection, four servers (a Metadata Server pair (MDS) and an Object Storage Server pair (OSS)) were connected to 13/14th Generation servers via Intel® Omni-Path. The detailed network topology is illustrated in Figure 1.
Figure 1 Illustration of Networking Topology: only 8 compute nodes are illustrated here for the simplicity, 64 nodes of 13G/14G servers used for the actual tests.
2x Dell EMC PowerEdge R730 as MDS
Front end network: Intel® Omni-Path
Internal connections: 12Gbps SAS
The test data was chosen from one of Illumina’s Platinum Genomes. ERR091571 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. Although the description of the data from the linked website shows that this sample has a 30x depth of coverage, in reality it is closer to 10x coverage according to the number of reads counted.
Dell EMC PowerEdge C6320s and C6420s were configured as listed in Table 1. The tests performed here are designed to demonstrate performance at the server level, not for comparisons on individual components. At the same time, the tests were also designed to estimate the sizing information of Dell EMC Isilon F800/H600 and Dell EMC Lustre Storage. The data points in Figure 2 are calculated based on the total number of samples (X axis in the figure) that were processed concurrently. The number of genomes per day metric is obtained from total running time taken to process the total number of samples in a test. The smoothed curves are generated by using a polynomial spline with the piecewise polynomial degree of 3 generating B-spline basis matrix. The details of BWA-GATK pipeline information can be obtained from the Broad Institute web site.
Figure 2 BWA-GATK Benchmark over 13/14 generation with Dell EMC Isilon and Dell EMC Lustre Storage: the number of compute nodes used for the tests are 64x C6420s and 63x C6320s (64x C6320s for testing H600). The number of samples per node was increased to get the desired total number of samples processed concurrently. For C6320 (13G), 3 samples per node was the maximum number of samples each node can process. 64, 104, and 126 test results for 13G system (blue) were with 2 samples per node while 129, 156, 180, 189 and 192 sample test results were obtained from 3 samples per node. For C6420 (14G), the tests were performed with maximum 5 samples per node. The plot for 14G was generated by processing 1, 2, 3, 4, and 5 samples per node. The number of samples per node is limited by the amount of memory in a system. 128 GB and 192 GB of RAM were used in 13G and 14G system, respectively as shown in Table 1. C6420s show a better scaling behavior than C6320s. 13G server with Broadwell CPUs seems to be more sensitive to the number of samples loaded onto system as shown from the results of 126 vs 129 sample tests on all the storages tested in this study.
The results with Dell EMC Isilon F800 that indicate C6320 with Broadwell/128GB RAM performs roughly 50 genomes per day less when 3 samples are processed per compute node and 30 genomes per day less when 2 samples are processed in each compute node compared to C6420. It is not clear if C6320’s performance will drop again when more samples are added to each compute node; however, it is obvious that C6420 does not show this behavior when the number of samples is increased on each compute node. The results also allow estimating the maximum performance of Dell EMC Isilon F800. As the total number of genomes increases, the increment of the number of genomes per day metric is slow down. Unfortunately, we were not able to identify the exact number of C6420s that would saturate Dell EMC Isilon F800 with four nodes. However, it is safe to say that more than 64x C6420s will require additional Dell EMC Isilon F800/H600 nodes to maintain high performance with more than 320 10x whole human genome samples. Dell EMC Lustre Storage did not scale as well as Dell EMC Isilon F800/H600. However, we observed that some optimizations are necessary to make Dell EMC Lustre Storage perform better. For example, the aligning, sorting, and marking duplicates steps in the pipeline perform extremely well when the file system’s stripe size was set to 2MB while other steps perform very poorly with 2MB stripe size. This suggests that Dell EMC Lustre Storage needs to be optimized further for these heterogeneous workloads in the pipeline. Since there is not any concrete configuration for the pipeline, we will further investigate the idea of using multiple tier file systems to cover different requirements in each step for both Dell EMC Isilon and Lustre Storage.
Dell EMC PowerEdge C6320 with Dell EMC Isilon H600 performance reached the maximum around 140 concurrent 10x human whole genomes. Running three 10x samples concurrently on a single node is not ideal. This limit appears to be on the compute node side, since H600 performance is much better with C6420s running a similar number of samples.
Dell EMC PowerEdge C6420 has at least a 12% performance gain compared to the previous generation. Each C6420 compute node with 192 GB RAM can process about seven 10x whole human genomes per day. This number could be increased if the C6420 compute node is configured with more memory. In addition to the improvement on the 14G server side, four Isilon F800 nodes in a 4U chassis can support 64x C6420s and 320 10x whole human genomes concurrently.
Internal web page http://en.community.dell.com/techcenter/blueprints/blueprint_for_hpc/m/mediagallery/20442903
External web page https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3956068/
Coming together with EMC has opened many new opportunities for the Dell EMC HPC Team to develop high-performance computing and storage solutions for the Life Sciences. Our lab recently stood up a ‘starter' 3 node Dell EMC Isilon X410 cluster. As a loyal user of the Isilon X210 in a previous role, I couldn’t wait to start profiling genomics applications using the X410 with Dell EMC HPC System for Life Sciences.
Because our current Isilon X410 storage cluster is currently fixed at the 3 node minimum, we aren’t set up yet to evaluate the scalability of the X410 with genomics workflows. We will tackle this work once our lab receives additional X nodes and the new the Isilon All-Flash node (formerly project Nitro).
In the meantime, I wanted to understand how the Isilon storage behaves relative to other storage solutions and decided to focus on the role of Isilon SmartConnect.
Through a single host name, SmartConnect enables client connection load balancing and dynamic network file system (NFS) failover and failback of client connections across storage nodes to provide optimal utilization of the Isilon cluster resources.
Without the need to install client-side drivers, administrators can easily manage a large and growing number of clients and ensure in the event of a system failure, in-flight reads and writes will successfully finish without failing.
Traditional storage systems with two-way failover typically sustain a minimum 50 percent degradation in performance when a storage head fails, as all clients must fail over to the remaining head. With Isilon SmartConnect, clients are evenly distributed across all remaining nodes in the cluster during failover, helping to ensure minimal performance impact.
To test this concept, I ran the GATK pipeline varying the number of samples and compute nodes without and with SmartConnect enabled on the Isilon storage cluster.
The configuration of our current lab environment and whole human genome sequencing data used for this evaluation are listed below.
Table 1 System configuration, software, and data
Dell EMC HPC System for Life Sciences
40 x PowerEdge C6320
2 x Intel Xeon E5-2697 v4. 18 cores per socket, 2.3 GHz
128 GB at 2400 MT/s
10GbE NIC and switch for accessing Isilon &
Intel Omni-Path fabric
Red Hat Enterprise Linux 7.2
10x Whole human genome sequencing data from Illumina HiSeq 2000, Total number of reads = 211,437,919
As noted earlier, the Isilon in our environment is currently set up in a minimum 3 node configuration. The current generation of Isilon is scalable up to 144 nodes. As you add additional Isilon nodes, the aggregate memory, throughput, and IOPS scale linearly. For a deep dive on Isilon and OneFS file system, see this technical white paper.
The Isilon storage cluster in our lab is summarized in Table 2. The Isilon storage is mounted on each compute nodes, up to 40 nodes, through NFS service (version 3) over a 10GbE network.
Table 2: Current Isilon Configuration
Dell EMC Isilon
3 x X410
2 x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz (1995.20-MHz K8-class CPU), 16 cores.
256 GB at 2400 MT/s
2 x IB QDR links
2 x 1GbE ports and 2 x 10GbE SFP+ ports
4TB x 30 HDDs, 120TB (usable)
Round Robin Mode
Table 3 summarizes all the tests we performed. To mimic a storage environment without proper load balancing, all the tests were performed without SmartConnect enabled except the concurrent 120 sample run.
Each pipeline (job) runs with one sample and uses 11 cores on a single compute node. A maximum three pipelines run concurrently on a single compute node. The tests were performed up to 40 nodes and 120 samples.
I included the detailed running times for each sub-step in BWA-GATK pipeline in Table 3. The running times for Aligning & Sorting and HaplotypeCaller steps are the bottlenecks in the pipeline, but the Aligning & Sorting step is more sensitive to the number of samples. In this benchmark, GenotypeGVCFs is not a bottleneck since we used an identical sequence data for all the concurrent pipelines. In real data analysis, a large number of different samples are used, and GenotypeGVCFs becomes the major bottleneck.
Table 3 Test results for BWA-GATK pipeline without and with Isilon SmartConnect enabled.
Number of samples (Data Size)
*120 with SmartConnect
Number of compute nodes
Aligning & Sorting
Generate Realigning Targets
Realign around InDel
Total Running Time (Hrs)
Number of Genomes per Day
As shown in Table 3, after 30 samples on 10 compute nodes, the total running time of BWA-GATK began to increase. When the number of compute nodes doubled, the total run time continued to climb without SmartConnect enabled. Also starting at 30 samples, we saw jobs starting to fail presumably due to unbalanced client connections and inability to failover and failback those connections.
However, when we enabled SmartConnect using the default round-robin settings, we saw a significant improvement on the total run time and daily sample throughput.
As expected, SmartConnect, maximized performance by keeping client connections balanced across all three Isilon storage nodes. In the three X410 configuration with SmartConnect enabled, the 120 samples processed with 40 compute nodes showed a 14% speed-up and 19% increased daily sample throughput.
This test also suggests a starting point identifying the number client connections per Isilon node for this genomics workflow. In our case, adding one additional X410 to the Isilon storage cluster for each 15 additional compute nodes may be a reasonable place to start.
As we add additional Isilon nodes to our cluster, we will perform additional studies to refine recommendations for the number of client connections per Isilon node for this genomics workflow. We’ll also take a deeper dive with the advanced SmartConnect load balancing options like CPU utilization, connection count, and network throughput. The Isilon SmartConnect White Paper provides a detailed summary and examples for each of these modes.
If you are using Bright Cluster Manger and need some tips to set it up with Isilon SmartConnect, read this post.
This blog explores the scaling behaviors of the most popular three short read sequence aligners; BWA-mem, Bowtie and Bowtie2. Aligning process is the beginning of all Next Generation Sequencing (NGS) data analysis and typically the most time-consuming step in any NGS data analysis pipeline.
It is clear that using more cores for the aligning process will help to speed up the process, but as you already know parallelization comes with cost. Higher complexity, perhaps order of magnitude, due to multiple instruction streams and data flowing between the instruction streams can easily overshadow the speed gain by parallelizing the process. Hence, it is not wise to use the entire cores in a compute node to speed up blindly especially when the overall throughput is more important. Identifying a sweet spot for the optimal number of cores for each aligning process and maximizing the overall throughput at the same time is not an easy task in a complex workflow.
Table 1 Server configuration and software
PowerEdge FC430 in FX2 chassis - 2 Sockets
Intel® Xeon® Dual E5-2695 v3 - 14 cores, total 28 physical cores
128GB - 8x 16GB RDIMM, 2133 MT/s, Dual Rank, x4 Data Width
480TB IEEL (Lustre)
Red Hat Enterprise 6.6
Cluster Management tool
Bright Cluster Manager 7.1
Short Sequence Aligner
BWA 0.7.2-r1039, Bowtie 1.1.2, Bowtie 2.2.6
Table 1 shows the summary of the system for the test here. Hyperthreading was not enabled for the test although it helps to improve the overall performance.
In Figure 1, BWA running times were measured for each different sizes of paired end read sequencing data. The size of sequence read data is represented as million fragments (MF) here. For example, 2MF means that the sequence data consist of two fastq files containing two million sequence reads in each file. One read is in a fastq file, and the corresponding paired read is in the other fastq file. A sweet spot for BWA-mem is from 4 to 16 cores roughly depending on the sequence read file size. Typically, the sequence read size is larger than 10 million and hence, 2MF and 10MF results are not realistic. However, larger sequence read data follow the behavior of the smaller input sizes as well. In the speedup chart, blue solid line labeled as 'ideal' represents the theoretical speedup we could obtain by increasing the number of cores.
Bowtie results in Figure 2 shows that the similar sweet spot comparing to BWA-mem results. However, the running time is slightly faster than BWA-mem’s. It is notable that Bowtie and Bowtie2 are sensitive to the read lengths while BWA-mem shows more consistent scaling behavior regardless of the sequence read lengths.
Although Bowtie2 is even faster than Bowtie, it is more sensitive to the length of the sequence reads as shown in Figure 3. Actually, the total number of nucleotides could be a better matrix to estimate the running time for a given sequence read data.
In practice, there are more factors to consider to maximize the overall throughput. One of critical factors is the bandwidth of existing storages in order to utilize the sweet spot instead of using the entire cores in a compute node. For example, if we decided to use 14 cores for each aligning process instead of using 28 cores, this will double up the number of samples processed simultaneously. However, the twice number of processes will fill up the limited storage bandwidth, and overly used storages will slow down significantly the entire processes running simultaneously.
Also, these aligning processes are not used alone typically in a pipeline. It is frequently tied up with a file conversion and a sorting process since subsequent analysis tools requiring these alignment results sorted in either chromosome coordinates or the name of the reads. The most popular approach is to use ‘pipe and redirection’ to save time to write multiple output files. However, this practice makes the optimization harder since it generally requires more computational resources. More detailed optimization for NGS pipelines in this aspect will be discussed in the next blogs.
Variant call process refers to the identification of a nucleotide difference from reference sequences at a given position in an individual genome or transcriptome. It includes single nucleotide polymorphism (SNPs), insertion/deletions (indels) and structural variants. One of the most popular variant calling applications is GenomeAnalysisTK (GATK) from Broad Institute. Often this GATK is used with BWA to compose a variant calling workflow focusing on SNPs and indels. After we published Dell HPC System for Genomics White Paper last year, there were significant changes in GATK. The key process, variant call step UnifiedGenotyper is no longer recommended in their best practice. Hence, here we recreate BWA-GATK pipeline according to the recommended practice to test whole genome sequencing data from mammals and plants in addition to human’s whole genome sequencing data. This is a part of Dell’s effort to help customers estimating their infrastructure needs for their various genomics data loads by providing a comprehensive benchmark.
The detailed configuration is in Dell HPC System for Genomics White Paper, and the summary of system configuration and software is in Table 2.
40x PowerEdge FC430 in FX2 chassis
Total of 1120 cores: Intel® Xeon® Dual E5-2695 v3 - 14 cores
sambamba 0.6.0, samtools 1.2.1
The current version of GATK is 3.5, and the actual workflow tested obtained from the workshop, ‘GATK Best Practices and Beyond’. In this workshop, they introduce a new workflow with three phases.
Here we tested out phase 1, phase 2A and phase3 for germline variant call pipeline. The details of commands used in benchmark are listed below.
bwa mem -c 250 -M -t [number of threads] -R ‘@RG\tID:noID\tPL:illumine\tLB:noLB\tSM:bar’ [reference chromosome] [read fastq 1] [read fastq 2] | samtools view -bu - | sambamba sort -t [number of threads] -m 30G --tmpdir [path/to/temp] -o [sorted bam output] /dev/stdin
sambamba markdup -t [number of threads] --remove-duplicates --tmpdir=[path/to/temp] [input: sorted bam output] [output: bam without duplicates]
java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt [number of threads] -R [reference chromosome] -o [target list file] -I [bam without duplicates] -known [reference vcf file]
java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T IndelRealigner -R [reference chromosome] -I [bam without duplicates] -targetIntervals [target list file] -known [reference vcf file] -o [realigned bam]
java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct [number of threads] -l INFO -R [reference chromosome] -I [realigned bam] -known [reference vcf file] -o [recalibrated data table]
java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T PrintReads -nct [number of threads] -R [reference chromosome] -I [realigned bam] -BQSR [recalibrated data table] -o [recalibrated bam]
java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct [number of threads] -l INFO -R [reference chromosome] -I [recalibrated bam] -known [reference vcf file] -o [post recalibrated data table]
java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R [reference chromosome] -before [recalibrated data table] -after [post recalibrated data table] -plots [recalibration report pdf] -csv [recalibration report csv]
java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -nct [number of threads] -R [reference chromosome] -ERC GVCF -BQSR [recalibrated data table] -L [reference vcf file] -I [recalibrated bam] -o [gvcf output]
java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -nt [number of threads] -R [reference chromosome] -V [gvcf output] -o [raw vcf]
java -d64 -Xms512m -Xmx2g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R [reference chromosome] --input [raw vcf] -an QD -an DP -an FS -an ReadPosRankSum -U LENIENT_VCF_PROCESSING --mode SNP --recal_file [raw vcf recalibration] --tranches_file [raw vcf tranches]
java -d64 -Xms512m -Xmx2g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R [reference chromosome] -input [raw vcf] -o [recalibrated filtered vcf] --ts_filter_level 99.97 --tranches_file [raw vcf tranches] --recal_file [raw vcf recalibration] --mode SNP -U LENIENT_VCF_PROCESSING
Torque/Maui is used to manage a large number of jobs to process sequencing samples simultaneously. Optional steps, 6, 7 and 8 in phase 1 were not included in the benchmark since Step 6 PrintRead took 12.5 hours with 9 threads for Bos Taurus sample (18 hours with single thread). These optional steps are not required, but these steps are useful for the reporting purpose. If it is necessary, it can be added as a side workflow to the main procedure. For each job, 9 cores were assigned when 120 concurrent jobs were processed concurrently and 13 cores were used for the test of 80 concurrent jobs.
In addition to the benchmark for human whole genome sequencing data published in the whitepaper, we gathered cow, pig, two sub-species of rice (japonica and indica) and corn reference genomes from Illumina’s iGenome site and Ensembl database. Fortunately, reference variant call data exist as a standard VCF file format for human, cow and pig. A variant data for japonica rice were obtained from 3000 Rice Genome on AWS and was modified according to the standard VCF file format. However, the chromosome coordinates in this VCF file do not match to the actual reference chromosome sequences, and we were not able to find matching version of reference variant information from public databases. For indica rice and corn, we gathered the variant information from Ensembl and converted them into a compatible VCF format. Whole genome sequencing data were obtained from European Nucleotide Archive (ENA). ENA Sample IDs in Table 1 are the identifiers allow to retrieve sequence data from the site. Although it is not ideal to test an identical input for large number of processes, it is not feasible to obtain large number of similar sample data from public databases.
Table 2 WGS test data for the different species: * x2 indicates the data is paired end reads. ƚTest ID column represent identifiers for the sequence data used throughout the test.
ENA Sample ID
Sample Base Count
Single file Size x2*
Reference Genome size (bp)
Depth of Coverage
Number of variants in Ref
Homo sapiens (human)
17 GB x2
54 GB x2
Bos Taurus (cow)
35 GB x2
12 GB x2
Sus scrofa (pig)
19 GB x2
10 GB x2
Oryza sativa (rice)
22 GB x2
4 GB x2
Zea mays (corn)
14 GB x2
After mapping and sorting of the sequence input files, quality statistics were obtained from the output files of Phase 1, Step 1. SRR17060031 sample is from bovine gut metagenomics study and was not well mapped onto Bos taurus UMD3.1 reference genome from Ensembl as expected. The majority of DNAs from bovine gut is foreign and has different sequence composition.
Table 3 Mapping qualities of sequence reads data; obtained by using ‘samtools flagstat’. ‘Total QC-passed reads’ is the number of reads passed the criteria of sequencing quality. Among all QC-passed reads, the number of reads actually mapped on a reference genome and its percentage is on ‘Mapped reads (%)’ column. ‘Paired in sequencing’ column is the number of paired reads properly paired by a sequencer. Among the reads properly paired in sequencing, the number of those paired reads mapped on a reference genome as paired reads is listed in ‘Properly paired (%) in mapping.
Total QC-passed reads
Mapped reads (%)
Paired in sequencing
Properly paired (%) in mapping
29,291,792 ( 3.60%)
28,813,072 ( 3.54%)
The rest of samples were properly aligned on the reference genome with high quality; more than 80% of reads paired in sequencing data properly mapped as pairs on reference genomes.
It is also important to check what level of mismatches exists in the aligning results. The estimated variance in human genome is one in every 1,200 to 1,500 bases. This makes 3 million base differences between any two people randomly picked. However, as shown in Table 4, the results are not quite matched to the 3 million base estimation. Ideally, 36 million mismatches should be shown in Hs1 data set since it covers the human reference genome 13 times. However, the rate of mismatches is quite higher than the estimation, and at least one out of two variants reported by the sequencing might be an error.
Table 4 The number of reads mapped perfectly on a reference genome and the number of reads mapped partially
Number of reads mapped with mismatches (mm)
Perfect match (%)
One mm (%)
Two mm (%)
Three mm (%)
Four mm (%)
Five mm (%)
Total run time is the elapsed wall time from the earliest start of Phase 1, Step 1 to the latest completion of Phase 3, Step 2. Time measurement for each step is from the latest completion time of the previous step to the latest completion time of the current step as described in Figure 1.
The running time for each data set is summarized in Table 4. Clearly the input size, size of sequence read files and reference genomes are the major factors affecting to the running time. The reference genome size is a major player for ‘Aligning & Sorting’ step while the size of variant reference affects most on ‘HaplotypeCaller’ step.
Table 5 running time for BWA-GATK pipeline
Bos taurus (cow)
Total read size, gzip compressed (GB)
Number of samples ran concurrently
Run Time (hours)
Total Run Time
Number of Genomes per day
The running time of the current version, GATK 3.5 is overly slower than the version of 2.8-1 we tested in our white paper. Particularly, HaplotypeCaller in the new workflow took 4.52 hours while UnifiedGenotyper in the older version took about 1 hour. Despite of the significant slow-down, GATK team believes HaplotypeCaller brings a better result, and that is worthy for the five times longer run.
There are data issues in non-human species. As shown in Table 4, for the similar size of inputs, Hs1 and Ss1 show large difference in the running time. The longer running time in non-human species can be explained by the quality of reference data. Aligning and sorting process takes more than twice times in other mammals, and it became worse in plants. It is known that plants genomes contain large number of repeat sequences which make mapping process difficult. It is important to note that the shorter running time for HaplotypeCaller in rice does not reflect a real time since the size of the reference variant file was reduced significantly due to the chromosome length/position mismatches in the data. All the variant records outside of chromosome range were removed, but position mismatches were used without corrections. The smaller size of the reference variant information and wrong position information the running time of HaplotypeCaller shorter. Corn’s reference data is not any better in terms of the accuracy of these benchmark. These data errors are the major causes of longer processing time.
Nonetheless, the results shown here could serve as good reference points for the worst case running time. Once reference data are cleaned up by researchers, the overall running time for other mammals should be similar to the one from Hs1 in Table 4 with a proper scaling of input size. However, it is hard to estimate an accurate running times for non-human species at this moment.
Author - Kihoon Yoon
First step in NGS workflow is to obtain samples for the sequencing. Either DNAs or RNAs need to be extracted from samples. Unfortunately, sequencing technology is not at the stage that we could load samples directly onto a sequencer. The samples have to be sheared and become random ‘short’ fragments. Also, depending on NGS platforms, DNA or RNA short fragments have to be amplified in order to obtain sufficient amount of short DNA/RNA fragments before sequencing. This non glamorous traditional laboratory work is still very critical for obtaining high quality sequencing results.
There are small number of different NGS platforms available currently, but Illumina sequencers are the industry leading platform. More or less, these different platforms are based on sequencing by synthesis (SBS) technology except Oxford Nanopore. The fundamental of SBS technology is to capture signals when DNA polymerases add labeled complementary nucleotides on target sequences.
After a sequencer generates short sequence reads, these reads need to be mapped onto a reference genome to figure out the origins since a sequencer generates nothing, but millions of short DNA/RNA fragments. Adding meaningful labels on these short sequences, so called aligning, is the beginning of NGS data analysis. This enables downstream analyses. There are many different flavor of aligners available, but not all aligners are RNA-Seq analysis ready.
Figure 5 DNA-Seq vs RNA-Seq (The original image was obtained form https://upload.wikimedia.org/wikipedia/commons/9/91/Chromosome.gif and modified.)
As illustrated on Figure 2, there are many ways to apply NGS in different studies. However, in terms of target sequences, those applications can be organized into two groups, DNA-Seq and RNA-Seq. Only about 9.2 percent of human DNA does something, and little over 1 percent of human genomes codes proteins. And, these coding regions are shared by about less than 20,000 genes. Hence, a small fraction of chromosome sequences transcribed to mRNAs in the nucleusof cell as shown in Figure 5. Once pre-RNAs are synthesized, splicing mechanism process removes introns and rejoins exons to create messenger RNAs (mRNAs). These matured RNAs are called messenger RNAs since they are transported out to cytoplasm to deliver genetic information to protein synthesis apparatus located in cytoplasm. This makes mRNA sequences (transcriptome) quite different from gene sequences on chromosomes.
DNA sequences tell scientists which part of chromosomes carries genetic information (locations of genes) and which stretch carries regulatory information. More importantly, scientists can spot the changes in genes or regulatory regions that may cause disease. Unfortunately, these gene information do not come with sequence itself. There has been tremendous effort to label these sequences according to their functions such as ENCODE project. Without having a proper annotation for genome sequences, sequences are just very expensive garbage. Nonetheless, the ability to sequence genomes quickly and cheaply allows doctors to identify the particular type of disease a patient has. It will still takes some time that sequencing becomes a routine in the doctor’s office, but the cheaper and faster NGS technology creates vast potentials for diagnostics and therapies.
Preparation of DNA sequencing sample is called as ‘library preparation’. There are many different version of reagent kits commercially available. All these kits rely on a series of standard molecular biology reactions. A common procedure includes purifying genomic DNAs, making random DNA fragments and adding adaptors (known short DNA fragments attached to both ends of purified- fragmented DNAs). Genomic DNA library is then amplified through PCR (polymerase chain reaction) to increase the amount of target sequences to meet the requirements for various sequencers. Once sequencing is completed, sequence read files need to be mapped onto proper reference sequences with annotations. This is the so-called ‘aligning’ process. Since genomic DNAs were sheared randomly, the final sequence read files do not contain any information of their origin. Aligning process is used to map the locations of sequence reads on chromosomes and finding what genes are on that locations by looking up the annotation comes with a reference sequence. After alignment is done, further statistical analyses reveal where nucleotide variations are located on genes. This is an example of variance analysis, and the procedure is subjective and differs for different types of studies.
Not all genes are expressed at a given moment in a cell, and the distribution of mRNAs is cell or organ specific. Although all cells have the same copy of genes in their chromosomes, cells in different organs exhibit distinct gene expression patterns. There is not a direct correlation between the distributions of mRNAs and proteins, but mRNA profiling provides a better clue to estimate protein level activities.
Library preparation for RNA-Seq is nearly identical to the one for DAN-Seq. Extracting RNAs is the first step, but due to the instability of RNA molecules, these RNA sequences have to be converted to complementary DNAs before sequencing. The purpose of RNA-Seq is to obtain the information of which genes are expressed and the amount they are expressed at a given moment. Normally, RNA-Seq data is much harder to automate since there are many potential workflows. There should not be an expectation of uniform results for all RNA-Seq like what you expect from DNA-Seq because gene expression is changed over the time. However, this variability provides a unique opportunity to compare differences between two cells under different conditions. For example, RNA-Seq is useful to compare gene expressions between normal cells and cancer cells from a patient.
Neither DNA-Seq nor RNA-Seq alone can provide a complete answer for any biological question. This is not only NGS’s problem, but also any experimental process or single technology in life sciences. A conclusion drawn from a controlled experiment will not actually reflect the real conditions in a life form. There is always the possibility that the same experiment with similar conditions might end up with a completely different result. Besides these innate difficulties NGS comes with errors, like any other technology; hence it requires a careful interpretation. Again, NGS is currently the best technology we can use, but nothing magical.
Recently, Genomics has been getting spotlights upon the completion of the Human Genome Project in 2003 and a series of innovations in sequencing technologies. As NGS technology matures and the cost of sequencing is dropping fast, there is an explosion in the amount of data, and it elevates the necessity of computer infrastructures. Clearly, there is no doubt that NGS technology brought the “Gene-omics” revolution to the field and opened up a whole new HPC market. However, from a Biologist's viewpoint, there are some important aspects missed by media and Sci-fi movies.
Figure 1 Central Dogma (Obtained from http://www.dbriers.com/tutorials/wp-content/uploads/2013/12/Central_Dogma-GODS-wikimedia.jpg)
Life forms arose from the edge of Chaos. This is a statement that well describes the uncertainty and complexity of life forms. Unlike many other scientific fields, especially quantitative research areas, Biology is a science where the analysis and predictions were probabilistic. Most outcomes of biological research depend on hypothesis, speculations and their controlled environment. Nonetheless, the field of Genomics equipped with mighty sequencers still provide only a part of the entire story. As you can see from the central dogma illustration, Figure 1, genetic information flow does not stop at DNA or RNA level. The ultimate goal is to gain detailed insight on proteins, the final products of gene expression as well as work-horses for maintaining life. There are complex controlling mechanisms between each step in Figure 1. Although proteins are the final products, they also regulate transcription and translation at DNA and RNA levels. It is like a feedback-loop in a simple view, but the complexity of genomic system is way beyond a single line feedback-loop. They are actually multilayer feedback-networks intertwined such as transcriptional, post-transcriptional and translational regulatory networks. We still are quite far from the completion!
All the recent excitement about NGS is reminiscent of gene therapy from the late 80s. The rush to apply gene therapy led to a series of failures caused tragic deaths and unexpected cancers and put the field into dark days. Without having complete and accurate knowledge, manipulating genes should be taken cautiously since we do not know yet what we do not know now.
Figure 2 NGS usages in Genomics
Knowledge in DNA sequences is not only important in basic life science research, but also indispensable in biotechnology, medical diagnosis and forensic biology. NGS technology brings tremendous advances in those fields consuming DNA sequences as a key information: by lowering sequencing cost, shortening sequencing time and increasing its accuracy.
In Figure 2, typical NGS applications based on the sequencing targets are listed. It is natural that these applications fall into two major groups, DNA-Seq or RNA-Seq. Most of the DNA-Seq studies are focused on genomic mutations or Single Nucleotide Polymorphism (SNP) in chromosomes while RNA-Seq studies are obtaining snapshots of the entire gene expressions at a given point in time. RNA-Seq rapidly replaces microarray technology and becomes a standard procedure for differentially expressed gene and RNA-Profiling studies.
The ability to perform DNA-Seq and RNA-Seq at the same time for the same sample opened up a new opportunity in translational research area (Figure 3). Identifying mutations from genes in chromosomes and knowing how those mutations change the gene expressions brings a tremendous advantage to completing the puzzle from ‘the central dogma’. Although we still do not have comprehensive protein information, it is much easier to look into a small number of target proteins instead of testing large number of proteins. Currently, there are various projects to label genomic variations in humans according to their disease type, phenotypic characters, drug treatments, clinical outcomes and so on. This data integration efforts will provide the basis for precision medicine and personalized medicine in the near future.
Figure 4 Overview of NGS data flow
NGS technology has brought a broad impact on the different layer of studies (see Figure 3), but the computational infrastructure requirements for these areas have not been correctly assessed. By the nature of basic life science and translational study, these two areas require the most flexible, powerful, and customizable infrastructure in terms of analysis phase in Figure 4. Large amount of samples and existing data are constantly regrouped and reanalyzed to elucidate biological phenomena in basic life science research while experts in translational research area seek any applicable knowledge toward to medical area. The data and workloads should be similar in basic life science research and translational research.
However, translational research brings simple and consolidated methods which can be applied at the clinical area, to help making treatment decisions, or to provide any useful process that can improve human health and well-being. For an example, a set of key genes, say 10 genes, are identified as breast cancer progression in basic life science area. Experts in translational research pick that knowledge to develop method to suppress those identified genes, to stop or slow down cancer progression efficiently in different individuals through clinical trials or large cohort studies. As the results of the study, experts might figure out there are different subtypes of breast cancers which respond differently to various drug treatments based on the expression of those 10 key genes. This will allow development of a gene panel to detect such a subtype to go through different drug treatment options. Dealing with only 10 genes out of 20,000 genes will be a lot more applicable. As you can guess from this scenario, there are different computational requirements among the different layer of studies. While high throughput is an important metric in basic life science and translational research areas, the speed of single sample processing time is more important in the clinical field.