Dell Community
High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • Understanding the Role of Dell EMC Isilon SmartConnect in Genomics Workloads

    Coming together with EMC has opened many new opportunities for the Dell EMC HPC Team to develop high-performance computing and storage solutions for the Life Sciences. Our lab recently stood up a ‘starter' 3 node Dell EMC Isilon X410 cluster. As a loyal user of the Isilon X210 in a previous role, I couldn’t wait to start profiling genomics applications using the X410 with Dell EMC HPC System for Life Sciences.

    Because our current Isilon X410 storage cluster is currently fixed at the 3 node minimum, we aren’t set up yet to evaluate the scalability of the X410 with genomics workflows. We will tackle this work once our lab receives additional X nodes and the new the Isilon All-Flash node (formerly project Nitro).

     In the meantime, I wanted to understand how the Isilon storage behaves relative to other storage solutions and decided to focus on the role of Isilon SmartConnect.

    Through a single host name, SmartConnect enables client connection load balancing and dynamic network file system (NFS) failover and failback of client connections across storage nodes to provide optimal utilization of the Isilon cluster resources.

     Without the need to install client-side drivers, administrators can easily manage a large and growing number of clients and ensure in the event of a system failure, in-flight reads and writes will successfully finish without failing. 

     Traditional storage systems with two-way failover typically sustain a minimum 50 percent degradation in performance when a storage head fails, as all clients must fail over to the remaining head. With Isilon SmartConnect, clients are evenly distributed across all remaining nodes in the cluster during failover, helping to ensure minimal performance impact.

     To test this concept, I ran the GATK pipeline varying the number of samples and compute nodes without and with SmartConnect enabled on the Isilon storage cluster.

    The configuration of our current lab environment and whole human genome sequencing data used for this evaluation are listed below. 

    Table 1 System configuration, software, and data

    Dell EMC HPC System for Life Sciences

    Server

    40 x PowerEdge C6320

    Processor

    2 x Intel Xeon E5-2697 v4. 18 cores per socket, 2.3 GHz

    Memory

    128 GB at 2400 MT/s

    Interconnect

    10GbE NIC and switch for accessing Isilon &

    Intel Omni-Path fabric

    Software

    Operating System

    Red Hat Enterprise Linux 7.2

    BWA

    0.7.2-r1039

    Samtools

    1.2.1

    Sambamba

    0.6.0

    GATK

    3.5

    Benchmark Data

    ERR091571

    10x Whole human genome sequencing data from Illumina HiSeq 2000, Total number of reads = 211,437,919

    As noted earlier, the Isilon in our environment is currently set up in a minimum 3 node configuration. The current generation of Isilon is scalable up to 144 nodes. As you add additional Isilon nodes, the aggregate memory, throughput, and IOPS scale linearly. For a deep dive on Isilon and OneFS file system, see this technical white paper

    The Isilon storage cluster in our lab is summarized in Table 2. The Isilon storage is mounted on each compute nodes, up to 40 nodes, through NFS service (version 3) over a 10GbE network.

    Table 2: Current Isilon Configuration

    Dell EMC Isilon

    Server

    3 x X410

    Processor

    2 x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz (1995.20-MHz K8-class CPU), 16 cores.

    Memory

    256 GB at 2400 MT/s

    Back End

    Networking

    2 x IB QDR links

    Front-end Networking

    2 x 1GbE ports and 2 x 10GbE SFP+ ports

    Storage Capacity

    4TB x 30 HDDs, 120TB (usable)

    Software

    Operating System

    OneFS 8.0

    Isilon SmartConnect

    Round Robin Mode

    Table 3 summarizes all the tests we performed. To mimic a storage environment without proper load balancing, all the tests were performed without SmartConnect enabled except the concurrent 120 sample run.

    Each pipeline (job) runs with one sample and uses 11 cores on a single compute node. A maximum three pipelines run concurrently on a single compute node. The tests were performed up to 40 nodes and 120 samples.

    I included the detailed running times for each sub-step in BWA-GATK pipeline in Table 3. The running times for Aligning & Sorting and HaplotypeCaller steps are the bottlenecks in the pipeline, but the Aligning & Sorting step is more sensitive to the number of samples. In this benchmark, GenotypeGVCFs is not a bottleneck since we used an identical sequence data for all the concurrent pipelines. In real data analysis, a large number of different samples are used, and GenotypeGVCFs becomes the major bottleneck.

    Table 3 Test results for BWA-GATK pipeline without and with Isilon SmartConnect enabled.

    Number of samples (Data Size)

    3

    15

    30

    60

    90

    120

    *120 with SmartConnect

    Number of compute nodes

    1

    5

    10

    20

    30

    40

    40

    Aligning & Sorting

    3.48

    3.54

    3.64

    4.21

    4.94

    5.54

    4.69

    Mark/Remove Duplicates

    0.46

    0.49

    0.79

    1.27

    2.52

    3.07

    1.84

    Generate Realigning Targets

    0.19

    0.18

    0.19

    0.19

    0.18

    0.20

    0.18

    Realign around InDel

    2.22

    2.20

    2.24

    2.27

    2.26

    2.27

    2.29

    Base Recalibration

    1.17

    1.18

    1.19

    1.18

    1.16

    1.13

    1.17

    HaplotypeCaller

    4.13

    4.35

    4.39

    4.34

    4.31

    4.32

    4.29

    GenotypeGVCFs

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    Variant Recalibration

    0.58

    0.50

    0.53

    0.55

    0.57

    0.53

    0.41

    Apply Recalibration

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    Total Running Time (Hrs)

    12.3

    12.5

    13.0

    14.1

    16.0

    17.1

    15.0

    Number of Genomes per Day

    6

    29

    53

    96

    129

    156

    185

    As shown in Table 3, after 30 samples on 10 compute nodes, the total running time of BWA-GATK began to increase. When the number of compute nodes doubled, the total run time continued to climb without SmartConnect enabled. Also starting at 30 samples, we saw jobs starting to fail presumably due to unbalanced client connections and inability to failover and failback those connections.

    However, when we enabled SmartConnect using the default round-robin settings, we saw a significant improvement on the total run time and daily sample throughput.

    As expected, SmartConnect, maximized performance by keeping client connections balanced across all three Isilon storage nodes. In the three X410 configuration with SmartConnect enabled, the 120 samples processed with 40 compute nodes showed a 14% speed-up and 19% increased daily sample throughput.

    This test also suggests a starting point identifying the number client connections per Isilon node for this genomics workflow. In our case, adding one additional X410 to the Isilon storage cluster for each 15 additional compute nodes may be a reasonable place to start. 

    As we add additional Isilon nodes to our cluster, we will perform additional studies to refine recommendations for the number of client connections per Isilon node for this genomics workflow. We’ll also take a deeper dive with the advanced SmartConnect load balancing options like CPU utilization, connection count, and network throughput. The Isilon SmartConnect White Paper provides a detailed summary and examples for each of these modes. 

    If you are using Bright Cluster Manger and need some tips to set it up with Isilon SmartConnect, read this post.

  • Scaling behavior of short read sequence aligners

    This blog explores the scaling behaviors of the most popular three short read sequence aligners; BWA-mem, Bowtie and Bowtie2. Aligning process is the beginning of all Next Generation Sequencing (NGS) data analysis and typically the most time-consuming step in any NGS data analysis pipeline.

    It is clear that using more cores for the aligning process will help to speed up the process, but as you already know parallelization comes with cost. Higher complexity, perhaps order of magnitude, due to multiple instruction streams and data flowing between the instruction streams can easily overshadow the speed gain by parallelizing the process. Hence, it is not wise to use the entire cores in a compute node to speed up blindly especially when the overall throughput is more important. Identifying a sweet spot for the optimal number of cores for each aligning process and maximizing the overall throughput at the same time is not an easy task in a complex workflow.

    Table 1 Server configuration and software

    Component

    Detail

    Server

    PowerEdge FC430 in FX2 chassis - 2 Sockets

    Processor

    Intel® Xeon® Dual E5-2695 v3 - 14 cores, total 28 physical cores

    Memory

    128GB - 8x 16GB RDIMM, 2133 MT/s, Dual Rank, x4 Data Width

    Storage

    480TB IEEL (Lustre)

    Interconnect

    InfiniBand FDR

    OS

    Red Hat Enterprise 6.6

    Cluster Management tool

    Bright Cluster Manager 7.1

    Short Sequence Aligner

    BWA 0.7.2-r1039, Bowtie 1.1.2, Bowtie 2.2.6

     

    Table 1 shows the summary of the system for the test here. Hyperthreading was not enabled for the test although it helps to improve the overall performance.

    In Figure 1, BWA running times were measured for each different sizes of paired end read sequencing data. The size of sequence read data is represented as million fragments (MF) here. For example, 2MF means that the sequence data consist of two fastq files containing two million sequence reads in each file. One read is in a fastq file, and the corresponding paired read is in the other fastq file. A sweet spot for BWA-mem is from 4 to 16 cores roughly depending on the sequence read file size. Typically, the sequence read size is larger than 10 million and hence, 2MF and 10MF results are not realistic. However, larger sequence read data follow the behavior of the smaller input sizes as well. In the speedup chart, blue solid line labeled as 'ideal' represents the theoretical speedup we could obtain by increasing the number of cores.

      

    Bowtie results in Figure 2 shows that the similar sweet spot comparing to BWA-mem results. However, the running time is slightly faster than BWA-mem’s. It is notable that Bowtie and Bowtie2 are sensitive to the read lengths while BWA-mem shows more consistent scaling behavior regardless of the sequence read lengths.

     

    Although Bowtie2 is even faster than Bowtie, it is more sensitive to the length of the sequence reads as shown in Figure 3. Actually, the total number of nucleotides could be a better matrix to estimate the running time for a given sequence read data.

    In practice, there are more factors to consider to maximize the overall throughput. One of critical factors is the bandwidth of existing storages in order to utilize the sweet spot instead of using the entire cores in a compute node. For example, if we decided to use 14 cores for each aligning process instead of using 28 cores, this will double up the number of samples processed simultaneously. However, the twice number of processes will fill up the limited storage bandwidth, and overly used storages will slow down significantly the entire processes running simultaneously.

    Also, these aligning processes are not used alone typically in a pipeline. It is frequently tied up with a file conversion and a sorting process since subsequent analysis tools requiring these alignment results sorted in either chromosome coordinates or the name of the reads. The most popular approach is to use ‘pipe and redirection’ to save time to write multiple output files. However, this practice makes the optimization harder since it generally requires more computational resources. More detailed optimization for NGS pipelines in this aspect will be discussed in the next blogs.

  • Variant Calling Benchmark -- Not Only Human

    Variant call process refers to the identification of a nucleotide difference from reference sequences at a given position in an individual genome or transcriptome. It includes single nucleotide polymorphism (SNPs), insertion/deletions (indels) and structural variants. One of the most popular variant calling applications is GenomeAnalysisTK (GATK) from Broad Institute. Often this GATK is used with BWA to compose a variant calling workflow focusing on SNPs and indels. After we published Dell HPC System for Genomics White Paper last year, there were significant changes in GATK. The key process, variant call step UnifiedGenotyper is no longer recommended in their best practice. Hence, here we recreate BWA-GATK pipeline according to the recommended practice to test whole genome sequencing data from mammals and plants in addition to human’s whole genome sequencing data. This is a part of Dell’s effort to help customers estimating their infrastructure needs for their various genomics data loads by providing a comprehensive benchmark.

    Variant Analysis for Whole Genome Sequencing data

    System

    The detailed configuration is in Dell HPC System for Genomics White Paper, and the summary of system configuration and software is in Table 2.

    Table 1 Server configuration and software

    Component

    Detail

    Server

    40x PowerEdge FC430 in FX2 chassis

    Processor

    Total of 1120 cores: Intel® Xeon® Dual E5-2695 v3 - 14 cores

    Memory

    128GB - 8x 16GB RDIMM, 2133 MT/s, Dual Rank, x4 Data Width

    Storage

    480TB IEEL (Lustre)

    Interconnect

    InfiniBand FDR

    OS

    Red Hat Enterprise 6.6

    Cluster Management tool

    Bright Cluster Manager 7.1

    Short Sequence Aligner

    BWA 0.7.2-r1039

    Variant Analysis

    GATK 3.5

    Utilities

    sambamba 0.6.0, samtools 1.2.1

     

    BWA-GATK pipeline

    The current version of GATK is 3.5, and the actual workflow tested obtained from the workshop, ‘GATK Best Practices and Beyond’. In this workshop, they introduce a new workflow with three phases.

    • Best Practices Phase 1: Pre-processing
    • Best Practices Phase 2A: Calling germline variants
    • Best Practices Phase 2B: Calling somatic variants
    • Best Practices Phase 3: Preliminary analyses

    Here we tested out phase 1, phase 2A and phase3 for germline variant call pipeline. The details of commands used in benchmark are listed below.

    Phase 1. Pre-processing

    Step 1. Aligning and Sorting

    bwa mem -c 250 -M -t [number of threads] -R ‘@RG\tID:noID\tPL:illumine\tLB:noLB\tSM:bar’ [reference chromosome] [read fastq 1] [read fastq 2] | samtools view -bu - | sambamba sort -t [number of threads] -m 30G --tmpdir [path/to/temp] -o [sorted bam output] /dev/stdin

    Step 2. Mark and Remove Duplicates

    sambamba markdup -t [number of threads] --remove-duplicates --tmpdir=[path/to/temp] [input: sorted bam output] [output: bam without duplicates]

    Step 3. Generate Realigning Targets

     java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt [number of threads] -R [reference chromosome] -o [target list file] -I [bam without duplicates] -known [reference vcf file]

    Step 4. Realigning around InDel

    java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T IndelRealigner -R [reference chromosome] -I [bam without duplicates] -targetIntervals [target list file] -known [reference vcf file] -o [realigned bam]

    Step 5. Base Recalibration

    java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct [number of threads] -l INFO -R [reference chromosome] -I [realigned bam] -known [reference vcf file] -o [recalibrated data table]

    Step 6. Print Recalibrated Reads - Optional

    java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T PrintReads -nct [number of threads] -R [reference chromosome] -I [realigned bam] -BQSR [recalibrated data table] -o [recalibrated bam]

    Step 7. After Base Recalibration - Optional

    java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct [number of threads] -l INFO -R [reference chromosome] -I [recalibrated bam] -known [reference vcf file] -o [post recalibrated data table]

    Step 8. Analyze Covariates - Optional

    java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R [reference chromosome] -before [recalibrated data table] -after [post recalibrated data table] -plots [recalibration report pdf] -csv [recalibration report csv]

    Phase 2. Variant Discovery – Calling germline variants

    Step 1. Haplotype Caller

    java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -nct [number of threads] -R [reference chromosome] -ERC GVCF -BQSR [recalibrated data table] -L [reference vcf file] -I [recalibrated bam] -o [gvcf output]

    Step 2. GenotypeGVCFs

    java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -nt [number of threads] -R [reference chromosome] -V [gvcf output] -o [raw vcf]

    Phase 3. Preliminary Analyses

    Step 1. Variant Recalibration

    java -d64 -Xms512m -Xmx2g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R [reference chromosome] --input [raw vcf] -an QD -an DP -an FS -an ReadPosRankSum -U LENIENT_VCF_PROCESSING --mode SNP --recal_file [raw vcf recalibration] --tranches_file [raw vcf tranches]

    Step 2. Apply Recalibration

    java -d64 -Xms512m -Xmx2g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R [reference chromosome] -input [raw vcf] -o [recalibrated filtered vcf] --ts_filter_level 99.97 --tranches_file [raw vcf tranches] --recal_file [raw vcf recalibration] --mode SNP -U LENIENT_VCF_PROCESSING

    Job Scheduling

    Torque/Maui is used to manage a large number of jobs to process sequencing samples simultaneously. Optional steps, 6, 7 and 8 in phase 1 were not included in the benchmark since Step 6 PrintRead took 12.5 hours with 9 threads for Bos Taurus sample (18 hours with single thread). These optional steps are not required, but these steps are useful for the reporting purpose. If it is necessary, it can be added as a side workflow to the main procedure. For each job, 9 cores were assigned when 120 concurrent jobs were processed concurrently and 13 cores were used for the test of 80 concurrent jobs.

    Data

    In addition to the benchmark for human whole genome sequencing data published in the whitepaper, we gathered cow, pig, two sub-species of rice (japonica and indica) and corn reference genomes from Illumina’s iGenome site and Ensembl database. Fortunately, reference variant call data exist as a standard VCF file format for human, cow and pig. A variant data for japonica rice were obtained from 3000 Rice Genome on AWS and was modified according to the standard VCF file format. However, the chromosome coordinates in this VCF file do not match to the actual reference chromosome sequences, and we were not able to find matching version of reference variant information from public databases. For indica rice and corn, we gathered the variant information from Ensembl and converted them into a compatible VCF format. Whole genome sequencing data were obtained from European Nucleotide Archive (ENA). ENA Sample IDs in Table 1 are the identifiers allow to retrieve sequence data from the site. Although it is not ideal to test an identical input for large number of processes, it is not feasible to obtain large number of similar sample data from public databases.

    Table 2 WGS test data for the different species: * x2 indicates the data is paired end reads. ƚTest ID column represent identifiers for the sequence data used throughout the test.

    Species

    Test IDƚ

    ENA Sample ID

    Sample Base Count

    Single file Size x2*

    Reference Genome size (bp)

    Depth of Coverage

    Number of variants in Ref

    Homo sapiens (human)

    Hs1

    ERR091571

    42,710,459,638

    17 GB x2

    3,326,743,047

    13x

    3,152,430

    Hs2

    ERR194161

    171,588,070,386

    54 GB x2

    3,326,743,047

    52x

    Bos Taurus (cow)

    Bt1

    SRR1706031

    82,272,305,762

    35 GB x2

    2,649,685,036

    31x

    93,347,258

    Bt2

    SRR1805809

    32,681,063,800

    12 GB x2

    2,649,685,036

    12x

    Sus scrofa (pig)

    Ss1

    SRR1178925

    41,802,035,944

    19 GB x2

    3,024,658,544

    14x

    52,573,286

    Ss2

    SRR1056427

    24,901,150,040

    10 GB x2

    3,024,658,544

    8x

    Oryza sativa (rice)

    japonica

    Osj

    SRR1450198

    49,676,959,200

    22 GB x2

    374,424,240

    132x

    19,409,227

    indica

    Osi

    SRR3098100

    12,191,702,544

    4 GB x2

    411,710,190

    30x

    4,538,869

    Zea mays (corn)

    Zm

    SRR1575496

    36,192,217,200

    14 GB x2

    3,233,616,351

    11x

    51,151,183

    Benchmark Results

    Data Quality

    After mapping and sorting of the sequence input files, quality statistics were obtained from the output files of Phase 1, Step 1. SRR17060031 sample is from bovine gut metagenomics study and was not well mapped onto Bos taurus UMD3.1 reference genome from Ensembl as expected. The majority of DNAs from bovine gut is foreign and has different sequence composition.

    Table 3 Mapping qualities of sequence reads data; obtained by using ‘samtools flagstat’. ‘Total QC-passed reads’ is the number of reads passed the criteria of sequencing quality. Among all QC-passed reads, the number of reads actually mapped on a reference genome and its percentage is on ‘Mapped reads (%)’ column. ‘Paired in sequencing’ column is the number of paired reads properly paired by a sequencer. Among the reads properly paired in sequencing, the number of those paired reads mapped on a reference genome as paired reads is listed in ‘Properly paired (%) in mapping.

    Species

    Sequencing Reads

    Test ID

    Total QC-passed reads

    Mapped reads (%)

    Paired in sequencing

    Properly paired (%) in mapping

    Human

    ERR091571

    Hs1

    424,118,221

    421,339,198 (99.34%)

    422,875,838

    412,370,120 (97.52%)

    ERR194161

    Hs2

    1,691,135,957

    1,666,486,126 (98.54%)

    1,686,908,514

    1,621,073,394 (96.10%)

    Cow

    SRR1706031

    Bt1

    813,545,863

    29,291,792 (  3.60%)

    813,520,998

    28,813,072 (  3.54%)

    SRR1805809

    Bt2

    327,304,866

    316,654,265 (96.75%)

    326,810,638

    308,600,196 (94.43%)

    Pig

    SRR1178925

    Ss1

    416,854,287

    379,784,341 (91.11%)

    413,881,544

    344,614,170 (83.26%)

    SRR1056427

    Ss2

    249,096,674

    228,015,545 (91.54%)

    246,546,040

    212,404,874 (86.15%)

    Rice

    SRR1450198

    Osj

    499,853,597

    486,527,154 (97.33%)

    496,769,592

    459,665,726 (92.53%)

    SRR3098100

    Osi

    97,611,519

    95,332,114 (97.66%)

    96,759,544

    86,156,978 (89.04%)

    Corn

    SRR1575496

    Zm

    364,636,704

    358,393,982 (98.29%)

    361,922,172

    315,560,140 (87.19%)

    The rest of samples were properly aligned on the reference genome with high quality; more than 80% of reads paired in sequencing data properly mapped as pairs on reference genomes.

    It is also important to check what level of mismatches exists in the aligning results. The estimated variance in human genome is one in every 1,200 to 1,500 bases. This makes 3 million base differences between any two people randomly picked. However, as shown in Table 4, the results are not quite matched to the 3 million base estimation. Ideally, 36 million mismatches should be shown in Hs1 data set since it covers the human reference genome 13 times. However, the rate of mismatches is quite higher than the estimation, and at least one out of two variants reported by the sequencing might be an error.

    Table 4 The number of reads mapped perfectly on a reference genome and the number of reads mapped partially

    Test ID

    Depth

    Mapped reads

    Number of reads mapped with mismatches (mm)

    Perfect match (%)

    One mm (%)

    Two mm (%)

    Three mm (%)

    Four mm (%)

    Five mm (%)

    Hs1

    13x

    421,339,198

    328,815,216 (78.0)

    53,425,338 (12.7)

    13,284,425 (3.2)

    6,842,191 (1.6)

    5,140,438 (1.2)

    4,082,446 (1.0)

    Hs2

    52x

    1,666,486,126

    1,319,421,905 (79.2)

    201,568,633 (12.1)

    47,831,915 (2.9)

    24,862,727 (1.5)

    19,052,800 (1.1)

    15,568,114 (0.9)

    Bt1

    31x

    29,291,792

    25,835,536 (88.2)

    2,684,650 (9.2)

    338,781 (1.2)

    147,841 (0.5)

    89,706 (0.3)

    70,789 (0.24)

    Bt2

    12x

    316,654,265

    158,463,463 (50.0)

    68,754,190 (21.7)

    29,544,252 (9.3)

    17,337,205 (5.5)

    12,639,289 (4.0)

    10,015,029 (3.2)

    Ss1

    14x

    379,784,341

    228,627,231 (60.2)

    69,912,403 (18.4)

    29,142,572 (7.7)

    16,701,248 (4.4)

    11,036,852 (2.9)

    7,652,513 (2.0)

    Ss2

    8x

    228,015,545

    112,216,441 (49.2)

    53,739,562 (23.6)

    25,132,226 (11.0)

    13,874,636 (6.1)

    8,431,144 (3.7)

    5,375,833 (2.4)

    Osj

    132x

    486,527,154

    208,387,077 (42.8)

    113,948,043 (23.4)

    61,697,586 (12.7)

    37,520,642 (7.7)

    23,761,302 (4.9)

    15,370,422 (3.2)

    Osi

    30x

    95,332,114

    54,462,837 (57.1)

    17,325,526 (18.2)

    8,190,929 (8.6)

    5,146,096 (5.4)

    3,394,245 (3.6)

    2,322,355 (2.4)

    Zm

    11x

    358,393,982

    150,686,819 (42.1)

    82,912,817 (23.1)

    44,823,583 (12.5)

    28,375,226 (7.9)

    19,093,235 (5.3)

    12,503,856 (3.5)

     

    Time Measurement


    Total run time is the elapsed wall time from the earliest start of Phase 1, Step 1 to the latest completion of Phase 3, Step 2. Time measurement for each step is from the latest completion time of the previous step to the latest completion time of the current step as described in Figure 1.

    The running time for each data set is summarized in Table 4. Clearly the input size, size of sequence read files and reference genomes are the major factors affecting to the running time. The reference genome size is a major player for ‘Aligning & Sorting’ step while the size of variant reference affects most on ‘HaplotypeCaller’ step.

    Table 5 running time for BWA-GATK pipeline

    Species

    Homo sapiens (human)

    Bos taurus (cow)

    Sus scrofa (pig)

    Oryza sativa (rice)

    Zea mays (corn)

    japonica

    Indica

    Depth of Coverage

    13x

    52x

    31x

    12x

    14x

    8x

    132x

    30x

    11x

    Test ID

    Hs1

    Hs2

    Bt1

    Bt2

    Ss1

    Ss2

    Osj

    Osi

    Zm

    Total read size, gzip compressed (GB)

    34

    108

    70

    22

    38

    20

    44

    8

    28

    Number of samples ran concurrently

    80

    80

    120

    80

    120

    80

    120

    80

    80

    Run Time (hours)

    Aligning & Sorting

    3.93

    15.79

    7.25

    5.77

    7.53

    3.04

    9.50

    1.18

    11.16

    Mark/Remove Duplicates

    0.66

    2.62

    3.45

    0.73

    1.07

    0.27

    1.27

    0.12

    0.72

    Generate Realigning Targets

    0.29

    1.08

    3.12

    1.57

    0.47

    0.27

    0.22

    0.05

    0.26

    Realign around InDel

    2.50

    8.90

    4.00

    3.15

    2.87

    1.83

    7.37

    1.25

    3.18

    Base Recalibration

    1.82

    6.80

    1.39

    1.96

    2.37

    1.01

    3.16

    0.36

    1.91

    HaplotypeCaller

    4.52

    10.28

    2.75

    9.33

    26.21

    14.65

    8.95

    1.77

    16.72

    GenotypeGVCFs

    0.03

    0.03

    0.20

    0.05

    0.34

    0.06

    1.12

    0.01

    0.04

    Variant Recalibration

    0.67

    0.37

    0.32

    0.86

    0.58

    0.56

    0.92

    0.04

    0.46

    Apply Recalibration

    0.04

    0.04

    0.03

    0.06

    0.03

    0.08

    0.03

    0.01

    0.05

    Total Run Time

    14.5

    45.9

    22.5

    23.5

    41.5

    21.8

    32.5

    4.78

    34.5

    Number of Genomes per day

    133

    42

    128

    82

    46

    88

    89

    402

    56

     

    Discussion

    The running time of the current version, GATK 3.5 is overly slower than the version of 2.8-1 we tested in our white paper. Particularly, HaplotypeCaller in the new workflow took 4.52 hours while UnifiedGenotyper in the older version took about 1 hour. Despite of the significant slow-down, GATK team believes HaplotypeCaller brings a better result, and that is worthy for the five times longer run.

    There are data issues in non-human species. As shown in Table 4, for the similar size of inputs, Hs1 and Ss1 show large difference in the running time. The longer running time in non-human species can be explained by the quality of reference data. Aligning and sorting process takes more than twice times in other mammals, and it became worse in plants. It is known that plants genomes contain large number of repeat sequences which make mapping process difficult. It is important to note that the shorter running time for HaplotypeCaller in rice does not reflect a real time since the size of the reference variant file was reduced significantly due to the chromosome length/position mismatches in the data. All the variant records outside of chromosome range were removed, but position mismatches were used without corrections. The smaller size of the reference variant information and wrong position information the running time of HaplotypeCaller shorter. Corn’s reference data is not any better in terms of the accuracy of these benchmark. These data errors are the major causes of longer processing time.

    Nonetheless, the results shown here could serve as good reference points for the worst case running time. Once reference data are cleaned up by researchers, the overall running time for other mammals should be similar to the one from Hs1 in Table 4 with a proper scaling of input size. However, it is hard to estimate an accurate running times for non-human species at this moment. 

  • Genomics at a Glance – Part 2 of 2

    Author - Kihoon Yoon

    NGS Workflow

    First step in NGS workflow is to obtain samples for the sequencing. Either DNAs or RNAs need to be extracted from samples. Unfortunately, sequencing technology is not at the stage that we could load samples directly onto a sequencer. The samples have to be sheared and become random ‘short’ fragments. Also, depending on NGS platforms, DNA or RNA short fragments have to be amplified in order to obtain sufficient amount of short DNA/RNA fragments before sequencing. This non glamorous traditional laboratory work is still very critical for obtaining high quality sequencing results.

    There are small number of different NGS platforms available currently, but Illumina sequencers are the industry leading platform. More or less, these different platforms are based on sequencing by synthesis (SBS) technology except Oxford Nanopore. The fundamental of SBS technology is to capture signals when DNA polymerases add labeled complementary nucleotides on target sequences.

    After a sequencer generates short sequence reads, these reads need to be mapped onto a reference genome to figure out the origins since a sequencer generates nothing, but millions of short DNA/RNA fragments. Adding meaningful labels on these short sequences, so called aligning, is the beginning of NGS data analysis. This enables downstream analyses. There are many different flavor of aligners available, but not all aligners are RNA-Seq analysis ready.

    Figure 5 DNA-Seq vs RNA-Seq (The original image was obtained form https://upload.wikimedia.org/wikipedia/commons/9/91/Chromosome.gif and modified.)

    As illustrated on Figure 2, there are many ways to apply NGS in different studies. However, in terms of target sequences, those applications can be organized into two groups, DNA-Seq and RNA-Seq. Only about 9.2 percent of human DNA does something, and little over 1 percent of human genomes codes proteins. And, these coding regions are shared by about less than 20,000 genes. Hence, a small fraction of chromosome sequences transcribed to mRNAs in the nucleusof cell as shown in Figure 5. Once pre-RNAs are synthesized, splicing mechanism process removes introns and rejoins exons to create messenger RNAs (mRNAs). These matured RNAs are called messenger RNAs since they are transported out to cytoplasm to deliver genetic information to protein synthesis apparatus located in cytoplasm. This makes mRNA sequences (transcriptome) quite different from gene sequences on chromosomes.

    DNA-Seq

    DNA sequences tell scientists which part of chromosomes carries genetic information (locations of genes) and which stretch carries regulatory information. More importantly, scientists can spot the changes in genes or regulatory regions that may cause disease. Unfortunately, these gene information do not come with sequence itself. There has been tremendous effort to label these sequences according to their functions such as ENCODE project. Without having a proper annotation for genome sequences, sequences are just very expensive garbage. Nonetheless, the ability to sequence genomes quickly and cheaply allows doctors to identify the particular type of disease a patient has. It will still takes some time that sequencing becomes a routine in the doctor’s office, but the cheaper and faster NGS technology creates vast potentials for diagnostics and therapies.

    Preparation of DNA sequencing sample is called as ‘library preparation’. There are many different version of reagent kits commercially available. All these kits rely on a series of standard molecular biology reactions. A common procedure includes purifying genomic DNAs, making random DNA fragments and adding adaptors (known short DNA fragments attached to both ends of purified- fragmented DNAs). Genomic DNA library is then amplified through PCR (polymerase chain reaction) to increase the amount of target sequences to meet the requirements for various sequencers. Once sequencing is completed, sequence read files need to be mapped onto proper reference sequences with annotations. This is the so-called ‘aligning’ process. Since genomic DNAs were sheared randomly, the final sequence read files do not contain any information of their origin. Aligning process is used to map the locations of sequence reads on chromosomes and finding what genes are on that locations by looking up the annotation comes with a reference sequence. After alignment is done, further statistical analyses reveal where nucleotide variations are located on genes. This is an example of variance analysis, and the procedure is subjective and differs for different types of studies.

    RNA-Seq

    Not all genes are expressed at a given moment in a cell, and the distribution of mRNAs is cell or organ specific. Although all cells have the same copy of genes in their chromosomes, cells in different organs exhibit distinct gene expression patterns. There is not a direct correlation between the distributions of mRNAs and proteins, but mRNA profiling provides a better clue to estimate protein level activities.

    Library preparation for RNA-Seq is nearly identical to the one for DAN-Seq. Extracting RNAs is the first step, but due to the instability of RNA molecules, these RNA sequences have to be converted to complementary DNAs before sequencing. The purpose of RNA-Seq is to obtain the information of which genes are expressed and the amount they are expressed at a given moment. Normally, RNA-Seq data is much harder to automate since there are many potential workflows. There should not be an expectation of uniform results for all RNA-Seq like what you expect from DNA-Seq because gene expression is changed over the time. However, this variability provides a unique opportunity to compare differences between two cells under different conditions. For example, RNA-Seq is useful to compare gene expressions between normal cells and cancer cells from a patient.

    Limitations of NGS

    Neither DNA-Seq nor RNA-Seq alone can provide a complete answer for any biological question. This is not only NGS’s problem, but also any experimental process or single technology in life sciences. A conclusion drawn from a controlled experiment will not actually reflect the real conditions in a life form. There is always the possibility that the same experiment with similar conditions might end up with a completely different result. Besides these innate difficulties NGS comes with errors, like any other technology; hence it requires a careful interpretation. Again, NGS is currently the best technology we can use, but nothing magical.

  • Genomics at a Glance – Part 1 of 2

    Author - Kihoon Yoon

    Recently, Genomics has been getting spotlights upon the completion of the Human Genome Project in 2003 and a series of innovations in sequencing technologies. As NGS technology matures and the cost of sequencing is dropping fast, there is an explosion in the amount of data, and it elevates the necessity of computer infrastructures. Clearly, there is no doubt that NGS technology brought the “Gene-omics” revolution to the field and opened up a whole new HPC market. However, from a Biologist's viewpoint, there are some important aspects missed by media and Sci-fi movies.

    Even if we were done with Genomics, but that is not all …

    Figure 1 Central Dogma (Obtained from http://www.dbriers.com/tutorials/wp-content/uploads/2013/12/Central_Dogma-GODS-wikimedia.jpg)

    Life forms arose from the edge of Chaos. This is a statement that well describes the uncertainty and complexity of life forms. Unlike many other scientific fields, especially quantitative research areas, Biology is a science where the analysis and predictions were probabilistic. Most outcomes of biological research depend on hypothesis, speculations and their controlled environment. Nonetheless, the field of Genomics equipped with mighty sequencers still provide only a part of the entire story. As you can see from the central dogma illustration, Figure 1, genetic information flow does not stop at DNA or RNA level. The ultimate goal is to gain detailed insight on proteins, the final products of gene expression as well as work-horses for maintaining life. There are complex controlling mechanisms between each step in Figure 1. Although proteins are the final products, they also regulate transcription and translation at DNA and RNA levels. It is like a feedback-loop in a simple view, but the complexity of genomic system is way beyond a single line feedback-loop. They are actually multilayer feedback-networks intertwined such as transcriptional, post-transcriptional and translational regulatory networks. We still are quite far from the completion!

    All the recent excitement about NGS is reminiscent of gene therapy from the late 80s. The rush to apply gene therapy led to a series of failures caused tragic deaths and unexpected cancers and put the field into dark days. Without having complete and accurate knowledge, manipulating genes should be taken cautiously since we do not know yet what we do not know now.

    The role of Next Generation Sequencing in Genomics and Medicine

    Figure 2 NGS usages in Genomics


    Knowledge in DNA sequences is not only important in basic life science research, but also indispensable in biotechnology, medical diagnosis and forensic biology. NGS technology brings tremendous advances in those fields consuming DNA sequences as a key information: by lowering sequencing cost, shortening sequencing time and increasing its accuracy.

    In Figure 2, typical NGS applications based on the sequencing targets are listed. It is natural that these applications fall into two major groups, DNA-Seq or RNA-Seq. Most of the DNA-Seq studies are focused on genomic mutations or Single Nucleotide Polymorphism (SNP) in chromosomes while RNA-Seq studies are obtaining snapshots of the entire gene expressions at a given point in time. RNA-Seq rapidly replaces microarray technology and becomes a standard procedure for differentially expressed gene and RNA-Profiling studies.

    The ability to perform DNA-Seq and RNA-Seq at the same time for the same sample opened up a new opportunity in translational research area (Figure 3). Identifying mutations from genes in chromosomes and knowing how those mutations change the gene expressions brings a tremendous advantage to completing the puzzle from ‘the central dogma’. Although we still do not have comprehensive protein information, it is much easier to look into a small number of target proteins instead of testing large number of proteins. Currently, there are various projects to label genomic variations in humans according to their disease type, phenotypic characters, drug treatments, clinical outcomes and so on. This data integration efforts will provide the basis for precision medicine and personalized medicine in the near future.

    Figure 4 Overview of NGS data flow

    NGS technology has brought a broad impact on the different layer of studies (see Figure 3), but the computational infrastructure requirements for these areas have not been correctly assessed. By the nature of basic life science and translational study, these two areas require the most flexible, powerful, and customizable infrastructure in terms of analysis phase in Figure 4. Large amount of samples and existing data are constantly regrouped and reanalyzed to elucidate biological phenomena in basic life science research while experts in translational research area seek any applicable knowledge toward to medical area. The data and workloads should be similar in basic life science research and translational research.

    However, translational research brings simple and consolidated methods which can be applied at the clinical area, to help making treatment decisions, or to provide any useful process that can improve human health and well-being. For an example, a set of key genes, say 10 genes, are identified as breast cancer progression in basic life science area. Experts in translational research pick that knowledge to develop method to suppress those identified genes, to stop or slow down cancer progression efficiently in different individuals through clinical trials or large cohort studies. As the results of the study, experts might figure out there are different subtypes of breast cancers which respond differently to various drug treatments based on the expression of those 10 key genes. This will allow development of a gene panel to detect such a subtype to go through different drug treatment options. Dealing with only 10 genes out of 20,000 genes will be a lot more applicable. As you can guess from this scenario, there are different computational requirements among the different layer of studies. While high throughput is an important metric in basic life science and translational research areas, the speed of single sample processing time is more important in the clinical field.