Dell Community
High Performance Computing Blogs

High Performance Computing

High Performance Computing
A discussion venue for all things high performance computing (HPC), supercomputing, and the technologies that enables scientific research and discovery.
  • Dell EMC Ready Bundle for HPC Life Sciences - Whitepaper 20180319

    Refresh with 14th Generation servers

    Dell EMC's flexible HPC architecture for Life Sciences has been through a dramatic improvement with new Intel® Xeon® Scalable Processors. Dell EMC Ready Bundle for HPC Life Sciences equipped with better 14G servers, faster CPUs, and more memory bring a much higher performance for genomic data processing.

  • Dell EMC Ready Bundle for HPC Life Sciences - Reference Architectures 20180319

    Refresh with 14th Generation servers

    Dell EMC’s flexible HPC architecture for Life Sciences has been through a dramatic improvement with new Intel® Xeon® Scalable Processors. Dell EMC Ready Bundle for HPC Life Sciences equipped with better 14G servers, faster CPUs, and more memory bring a much higher performance in terms of throughput compared to the previous generation especially in genomic data processing.

  • Cryo-EM – Relion benchmarks on Skylake

    Optimized Relion Tests on Dell EMC PowerEdge C6420s with Dell EMC Isilon

    Overview

    The purpose of this blog is to validate the optimized Relion (for REgularised LIkelihood OptimizatioN) on Dell EMC PowerEdge C6420s with Skylake CPUs. Relion was developed from the Scheres lab at MRC Laboratory of Molecular Biology. It uses an empirical Bayesian approach to refining multiple 3D images or 2D class averages for the data generated from CryoElectron Microscopy (Cryo-EM). The impressive performance gain from Intel®’s efforts in the collaboration of Relion development team reduced the performance gap between CPUs and GPUs. The CPU/GPU performance comparison results are not shown here; however, the performance gap becomes single digit fold between Skylake CPU systems and Broadwell CPU/Tesla P100 GPU systems.

    Essentially, Cryo-EM is a type of transmission electron microscopy (TEM)[1] for imaging frozen-hydrated specimens at cryogenic temperatures[2]. Specimens remain in their native state without the need for dyes or fixatives allowing the study of fine cellular structures, viruses and protein complexes at molecular resolution. A rapid vitrification at cryogenic temperature is the key step to avoid water molecule crystallization and forming amorphous solid that does almost no damage to the sample structure. Regular electron microscopy requires samples to be prepared in complex ways, and the sample preparations make hard to retaining the original molecular structures. Cryo-EM is not perfect like X-ray crystallography; however, it gains the popularity in the research community fast due to the simple sample preparation steps and flexibility to the sample size, complexity, and non-rigid structure. As the resolution revolution in Cryo-EM progresses due to the 40+ years of dedicated work from the structural biology community, we now have the ability to yield accurate, detailed, 3D models of intricate biological structures at the sub-cellular and molecular scales.

    The tests were performed on 8 nodes of Dell PowerEdge C6420s which is a part of Dell EMC Ready Bundle for HPC Life Sciences. A summary of the configuration is listed in Table 1.

    Table 1 Test Cluster Configuration
    Dell EMC PowerEdge C6420 Dell EMC PowerEdge R740xd
    CPU 2x Xeon® Gold 6148F 20c 2.4GHz (Skylake) 2x Xeon® Gold 6148F 20c 2.4GHz (Skylake)
    RAM 12x 32GB @2666 MHz 12x 32GB @2666 MHz
    OS RHEL 7.4 RHEL 7.4
    Interconnect Intel® Omni-Path Intel® Omni-Path
    BIOS System Profile Performance Optimized Performance Optimized
    Logical Processor Enabled Enabled
    Virtualization Technology Dissabled Disabled
    Compiler and Libraries Intel Compiler and Libraries 2017.4.196 gcc 4.8.5, CUDA 9.1, OpenMPI 3.0.0
    GLIBC 2.17-196.el7 2.17-196.el7
    GPU N/A NVIDIA Tesla V100 with 16GB memory

    The test clusters and F800/H600 storage systems were connected via 4 x 100GbE links between two Dell Networking Z9100-ON switches. Each of the compute nodes was connected to the test cluster side Dell Networking Z9100-ON switch via single 10GbE. Four storage nodes in the Dell EMC Isilon F800/H600 were connected to the other switch via 8x 40GbE links. The configuration of the storage is listed in Table 2. The detailed network topology is illustrated in Figure 1.

    Table 2 Storage Configuration 
    Dell EMC Isilon F800 Dell EMC Isilon H600
    Number of Nodes 4 4
    CPU per node Intel® Xeon™ CPU E5-2697A v4 @2.60 GHz Intel® Xeon™ CPU E5-2680 v4 @2.40GHz
    Memory per node 256GB 256GB
    Storage Capacity Total usable space: 166.8 TB, 41.7 TB per node Total usable space: 126.8 TB, 31.7 TB per node
    SSD L3 Cache N/A 2.9 TB per node
    Network

    Front end network: 40GbE

    Back end network: 40GbE

    Front end network: 40GbE

    Back end network: IB QDR

    OS Isilon OneFS v8.1.0 DEV.0 Isilon OneFS v8.1.0.0 B_8_1_0_011

    Figure 1 Illustration of Networking Topology

    The original tests with an optimized Relion from Intel include Lustre as the main storage; however, we tested the identical optimized Relion binary against Dell EMC Isilon, F800 and H600 due to the limited availability of Dell EMC Lustre Storage at the time of testing. The detailed running scripts were also obtained from Intel as well to ensure that we reproduce the results.

    The test data is the recommend standard benchmark data which is presented in Wong et al., 2014. This Plasmodium ribosome data set is download from the ftp site.

     

    Performance Evaluation

    Performance Comparisons of the Optimized Relion on Dell EMC PowerEdge C6420

    Dell EMC PowerEdge C6420s were configured as listed in Table 1, and two Dell EMC Isilons, F800 and H600, were used for the tests. Although Intel’s results shown here in Figure 2 from the tests with a Lustre storage, we were not able to test with Dell EMC Lustre Storage due to the limited availability. As shown in Figure 2, the running times with various numbers of compute nodes are consistently slightly faster than the Intel’s results. This is small variations in our configurations such as Xeon® Gold 6148F instead of non-F series CPUs Intel used and the performance differences in Lustre and Isilon storages.

    Figure 2 Optimized Relion Benchmark Comparisons with Intel®’s results and Dell EMC PowerEdge C6420. The storage used for the tests were Dell EMC Isilon, F800 and H600. Intel’s results shown here were using Lustre as a storage system; However, we were not able to run our tests with Dell EMC Lustre Storage due to the limited availability.

     

    Performance Comparisons of Relion without the optimization

    Figure 3 shows Relion performances without the code optimization. The bars labeled with ‘Intel E5-2697 v4’ are from the Intel’s publication. In the CPU test results, Relion v2.0.3 took 47.6 hours on Intel E5-2697 v4 (Broadwell) whereas the same version of Relion took 34.5 hours on Dell EMC PowerEdge C6420 and the latest version of Relion (Relion2-beta-v2.0) took 32.8 hours on Dell EMC PowerEdge R740xd.

    Unfortunately, a GPU currently available, NVIDIA Tesla V100, is not compatible with Relion v2.0.3. Hence, the test was performed with Relion2-beta-v2.0. Therefore, the results for GPU tests are not fair comparisons.

    NVIDIA Tesla V100 with 16GB memory performs similar to the result from 2 nodes – 8 rank test in Figure 2.

    Figure 3 Relion performance comparisons with CPU and GPU

    Conclusion

    Dell EMC PowerEdge C6420 shows that it is an ideal compute platform for the Optimized Relion. It scales well over various number of compute nodes with Plasmodium ribosome data. In the future study, we plan to use a larger protein data and more compute nodes to accomplish more comprehensive scaling tests.

    Resources

    Internal web page
    External web page

    Contacts
    Americas
    Kihoon Yoon
    Sr. Principal Systems Dev Eng
    Kihoon.Yoon@dell.com
    +1 512 728 4191

    [1] Transmission Electron Microscopy refers to a technique where the image of specimen is formed by directing a high energy electron beam at a thin sample.

    [2] It is not well defined; however, cryogenic temperatures is temperatures below around 123 K, which equals -150°C or -238°F.

    NOTE: Updated on 5/11/2018

  • Variant Calling (BWA-GATK) pipeline benchmark with Dell EMC Ready Bundle for HPC Life Sciences

    13G/14G server performance comparisons with Dell EMC Isilon and Lustre Storage

    Overview

    Variant calling is a process by which we identify variants from sequence data. This process helps determine if there is single nucleotide polymorphisms (SNPs), insertions and deletions (indels), and/or structural variants (SVs) at a given position in an individual genome or transcriptome. The main goal of identifying genomic variations is linking to human diseases. Although not all human diseases are associated with genetic variations, variant calling can provide a valuable guideline for geneticists working on a particular disease caused by genetic variations. BWA-GATK is one of the Next Generation Sequencing (NGS) computational tools that is designed to identify germline and somatic mutations from human NGS data. There are a handful of variant identification tools, and we understand that there is not a single tool performs perfectly (Pabinger et al., 2014). However, we chose GATK which is one of most popular tools as our benchmarking tool to demonstrate how well Dell EMC Ready Bundle for HPC Life Sciences can process complex and massive NGS workloads.

    The purpose of this blog is to provide valuable performance information on the Intel® Xeon® Gold 6148 processor and the previous generation Xeon® E5-2697 v4 processor using a BWA-GATK pipeline benchmark on Dell EMC Isilon F800/H600 and Dell EMC Lustre Storage. The Xeon® Gold 6148 CPU features 20 physical cores or 40 logical cores when utilizing hyper threading. This processor is based on Intel’s new micro-architecture codenamed “Skylake”. Intel significantly increased the L2 cache per core from 256 KB on Broadwell to 1 MB on Skylake. The 6148 also touts 27.5 MB of L3 cache and a six channel DDR4 memory interface. The test cluster configurations are summarized in Table 1.

    Table 1 Test Cluster Configurations
    Dell EMC PowerEdge C6420 Dell EMC PowerEdge C6320
    CPU 2x Xeon® Gold 6148 20c 2.4GHz (Skylake) 2x Xeon® E5-2697 v4 18c 2.3GHz (Broadwell)
    RAM 12x 16GB @2666 MHz 8x 16GB @2400 MHz
    OS RHEL 7.3
    Interconnect Intel® Omni-Path
    BIOS System Profile Performance Optimized
    Logical Processor Disabled
    Virtualization Technology Disabled
    BWA 0.7.15-r1140
    sambamba 0.6.5
    samtools 1.3.1
    GATK 3.6

    The test clusters and F800/H600 storage systems were connected via 4 x 100GbE links between two Dell Networking Z9100-ON switches. Each of the compute nodes was connected to the test cluster side Dell Networking Z9100-ON switch via single 10GbE. Four storage nodes in the Dell EMC Isilon F800/H600 were connected to the other switch via 8x 40GbE links. The configuration of the storage is listed in Table 2. For the Dell EMC Lustre Storage connection, four servers (a Metadata Server pair (MDS) and an Object Storage Server pair (OSS)) were connected to 13/14th Generation servers via Intel® Omni-Path. The detailed network topology is illustrated in Figure 1.

    Figure 1 Illustration of Networking Topology: only 8 compute nodes are illustrated here for the simplicity, 64 nodes of 13G/14G servers used for the actual tests.

    Table 2 Storage Configurations
    Dell EMC Isilon F800 Dell EMC Isilon H600 Dell EMC Lustre Storage
    Number of nodes 4 4

    2x Dell EMC PowerEdge R730 as MDS

    2x Dell EMC PowerEdge R730 as OSS
    CPU per node Intel® Xeon™ CPU E5-2697A v4 @2.60 GHz Intel® Xeon™ CPU E5-2680 v4 @2.40GHz 2x Intel® Xeon™ E5-2630V4 @ 2.20GHz
    Memory per node 256GB
    Storage Capacity Total usable space: 166.8 TB, 41.7 TB per node Total usable space: 126.8 TB, 31.7 TB per node 960TB Raw, 768TB (698 TiB) usable MDS Storage Array: 1 x Dell EMC PowerVault MD3420 ( Total 24 - 2.5" 300GB 15K RPM SAS)OSS Storage Array: 4 x Dell EMC PowerVault MD3460 (Total 240 3.5" 4 TB 7.2K RPM NL SAS)
    SSD L3 Cache N/A 2.9 TB per node N/A
    Network

    Front end network: 40GbE

    Back end network: 40GbE

    Front end network: 40GbE

    Back end network: IB QDR

    Front end network: Intel® Omni-Path

    Internal connections: 12Gbps SAS

    OS Isilon OneFS v8.1.0 DEV.0  Isilon OneFS v8.1.0.0 B_8_1_0_011 Red Hat Enterprise Linux 7.2 (3.10.0-327.el7.x86_64)

    The test data was chosen from one of Illumina’s Platinum Genomes. ERR091571 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. Although the description of the data from the linked website shows that this sample has a 30x depth of coverage, in reality it is closer to 10x coverage according to the number of reads counted.

    Performance Evaluation

    BWA-GATK Benchmark over Generations

    Dell EMC PowerEdge C6320s and C6420s were configured as listed in Table 1. The tests performed here are designed to demonstrate performance at the server level, not for comparisons on individual components. At the same time, the tests were also designed to estimate the sizing information of Dell EMC Isilon F800/H600 and Dell EMC Lustre Storage. The data points in Figure 2 are calculated based on the total number of samples (X axis in the figure) that were processed concurrently. The number of genomes per day metric is obtained from total running time taken to process the total number of samples in a test. The smoothed curves are generated by using a polynomial spline with the piecewise polynomial degree of 3 generating B-spline basis matrix. The details of BWA-GATK pipeline information can be obtained from the Broad Institute web site.

    Figure 2 BWA-GATK Benchmark over 13/14 generation with Dell EMC Isilon and Dell EMC Lustre Storage: the number of compute nodes used for the tests are 64x C6420s and 63x C6320s (64x C6320s for testing H600). The number of samples per node was increased to get the desired total number of samples processed concurrently. For C6320 (13G), 3 samples per node was the maximum number of samples each node can process. 64, 104, and 126 test results for 13G system (blue) were with 2 samples per node while 129, 156, 180, 189 and 192 sample test results were obtained from 3 samples per node. For C6420 (14G), the tests were performed with maximum 5 samples per node. The plot for 14G was generated by processing 1, 2, 3, 4, and 5 samples per node. The number of samples per node is limited by the amount of memory in a system. 128 GB and 192 GB of RAM were used in 13G and 14G system, respectively as shown in Table 1. C6420s show a better scaling behavior than C6320s. 13G server with Broadwell CPUs seems to be more sensitive to the number of samples loaded onto system as shown from the results of 126 vs 129 sample tests on all the storages tested in this study.

    The results with Dell EMC Isilon F800 that indicate C6320 with Broadwell/128GB RAM performs roughly 50 genomes per day less when 3 samples are processed per compute node and 30 genomes per day less when 2 samples are processed in each compute node compared to C6420. It is not clear if C6320’s performance will drop again when more samples are added to each compute node; however, it is obvious that C6420 does not show this behavior when the number of samples is increased on each compute node. The results also allow estimating the maximum performance of Dell EMC Isilon F800. As the total number of genomes increases, the increment of the number of genomes per day metric is slow down. Unfortunately, we were not able to identify the exact number of C6420s that would saturate Dell EMC Isilon F800 with four nodes. However, it is safe to say that more than 64x C6420s will require additional Dell EMC Isilon F800/H600 nodes to maintain high performance with more than 320 10x whole human genome samples. Dell EMC Lustre Storage did not scale as well as Dell EMC Isilon F800/H600. However, we observed that some optimizations are necessary to make Dell EMC Lustre Storage perform better. For example, the aligning, sorting, and marking duplicates steps in the pipeline perform extremely well when the file system’s stripe size was set to 2MB while other steps perform very poorly with 2MB stripe size. This suggests that Dell EMC Lustre Storage needs to be optimized further for these heterogeneous workloads in the pipeline. Since there is not any concrete configuration for the pipeline, we will further investigate the idea of using multiple tier file systems to cover different requirements in each step for both Dell EMC Isilon and Lustre Storage.

    Dell EMC PowerEdge C6320 with Dell EMC Isilon H600 performance reached the maximum around 140 concurrent 10x human whole genomes. Running three 10x samples concurrently on a single node is not ideal. This limit appears to be on the compute node side, since H600 performance is much better with C6420s running a similar number of samples.

    Conclusion

    Dell EMC PowerEdge C6420 has at least a 12% performance gain compared to the previous generation. Each C6420 compute node with 192 GB RAM can process about seven 10x whole human genomes per day. This number could be increased if the C6420 compute node is configured with more memory. In addition to the improvement on the 14G server side, four Isilon F800 nodes in a 4U chassis can support 64x C6420s and 320 10x whole human genomes concurrently.

    Resources

    Internal web page http://en.community.dell.com/techcenter/blueprints/blueprint_for_hpc/m/mediagallery/20442903

    External web page https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3956068/

    Contacts
    Americas
    Kihoon Yoon
    Sr. Principal Systems Dev Eng
    Kihoon.Yoon@dell.com
    +1 512 728 4191

  • Understanding the Role of Dell EMC Isilon SmartConnect in Genomics Workloads

    Coming together with EMC has opened many new opportunities for the Dell EMC HPC Team to develop high-performance computing and storage solutions for the Life Sciences. Our lab recently stood up a ‘starter' 3 node Dell EMC Isilon X410 cluster. As a loyal user of the Isilon X210 in a previous role, I couldn’t wait to start profiling genomics applications using the X410 with Dell EMC HPC System for Life Sciences.

    Because our current Isilon X410 storage cluster is currently fixed at the 3 node minimum, we aren’t set up yet to evaluate the scalability of the X410 with genomics workflows. We will tackle this work once our lab receives additional X nodes and the new the Isilon All-Flash node (formerly project Nitro).

     In the meantime, I wanted to understand how the Isilon storage behaves relative to other storage solutions and decided to focus on the role of Isilon SmartConnect.

    Through a single host name, SmartConnect enables client connection load balancing and dynamic network file system (NFS) failover and failback of client connections across storage nodes to provide optimal utilization of the Isilon cluster resources.

     Without the need to install client-side drivers, administrators can easily manage a large and growing number of clients and ensure in the event of a system failure, in-flight reads and writes will successfully finish without failing. 

     Traditional storage systems with two-way failover typically sustain a minimum 50 percent degradation in performance when a storage head fails, as all clients must fail over to the remaining head. With Isilon SmartConnect, clients are evenly distributed across all remaining nodes in the cluster during failover, helping to ensure minimal performance impact.

     To test this concept, I ran the GATK pipeline varying the number of samples and compute nodes without and with SmartConnect enabled on the Isilon storage cluster.

    The configuration of our current lab environment and whole human genome sequencing data used for this evaluation are listed below. 

    Table 1 System configuration, software, and data

    Dell EMC HPC System for Life Sciences

    Server

    40 x PowerEdge C6320

    Processor

    2 x Intel Xeon E5-2697 v4. 18 cores per socket, 2.3 GHz

    Memory

    128 GB at 2400 MT/s

    Interconnect

    10GbE NIC and switch for accessing Isilon &

    Intel Omni-Path fabric

    Software

    Operating System

    Red Hat Enterprise Linux 7.2

    BWA

    0.7.2-r1039

    Samtools

    1.2.1

    Sambamba

    0.6.0

    GATK

    3.5

    Benchmark Data

    ERR091571

    10x Whole human genome sequencing data from Illumina HiSeq 2000, Total number of reads = 211,437,919

    As noted earlier, the Isilon in our environment is currently set up in a minimum 3 node configuration. The current generation of Isilon is scalable up to 144 nodes. As you add additional Isilon nodes, the aggregate memory, throughput, and IOPS scale linearly. For a deep dive on Isilon and OneFS file system, see this technical white paper

    The Isilon storage cluster in our lab is summarized in Table 2. The Isilon storage is mounted on each compute nodes, up to 40 nodes, through NFS service (version 3) over a 10GbE network.

    Table 2: Current Isilon Configuration

    Dell EMC Isilon

    Server

    3 x X410

    Processor

    2 x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz (1995.20-MHz K8-class CPU), 16 cores.

    Memory

    256 GB at 2400 MT/s

    Back End

    Networking

    2 x IB QDR links

    Front-end Networking

    2 x 1GbE ports and 2 x 10GbE SFP+ ports

    Storage Capacity

    4TB x 30 HDDs, 120TB (usable)

    Software

    Operating System

    OneFS 8.0

    Isilon SmartConnect

    Round Robin Mode

    Table 3 summarizes all the tests we performed. To mimic a storage environment without proper load balancing, all the tests were performed without SmartConnect enabled except the concurrent 120 sample run.

    Each pipeline (job) runs with one sample and uses 11 cores on a single compute node. A maximum three pipelines run concurrently on a single compute node. The tests were performed up to 40 nodes and 120 samples.

    I included the detailed running times for each sub-step in BWA-GATK pipeline in Table 3. The running times for Aligning & Sorting and HaplotypeCaller steps are the bottlenecks in the pipeline, but the Aligning & Sorting step is more sensitive to the number of samples. In this benchmark, GenotypeGVCFs is not a bottleneck since we used an identical sequence data for all the concurrent pipelines. In real data analysis, a large number of different samples are used, and GenotypeGVCFs becomes the major bottleneck.

    Table 3 Test results for BWA-GATK pipeline without and with Isilon SmartConnect enabled.

    Number of samples (Data Size)

    3

    15

    30

    60

    90

    120

    *120 with SmartConnect

    Number of compute nodes

    1

    5

    10

    20

    30

    40

    40

    Aligning & Sorting

    3.48

    3.54

    3.64

    4.21

    4.94

    5.54

    4.69

    Mark/Remove Duplicates

    0.46

    0.49

    0.79

    1.27

    2.52

    3.07

    1.84

    Generate Realigning Targets

    0.19

    0.18

    0.19

    0.19

    0.18

    0.20

    0.18

    Realign around InDel

    2.22

    2.20

    2.24

    2.27

    2.26

    2.27

    2.29

    Base Recalibration

    1.17

    1.18

    1.19

    1.18

    1.16

    1.13

    1.17

    HaplotypeCaller

    4.13

    4.35

    4.39

    4.34

    4.31

    4.32

    4.29

    GenotypeGVCFs

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    Variant Recalibration

    0.58

    0.50

    0.53

    0.55

    0.57

    0.53

    0.41

    Apply Recalibration

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    Total Running Time (Hrs)

    12.3

    12.5

    13.0

    14.1

    16.0

    17.1

    15.0

    Number of Genomes per Day

    6

    29

    53

    96

    129

    156

    185

    As shown in Table 3, after 30 samples on 10 compute nodes, the total running time of BWA-GATK began to increase. When the number of compute nodes doubled, the total run time continued to climb without SmartConnect enabled. Also starting at 30 samples, we saw jobs starting to fail presumably due to unbalanced client connections and inability to failover and failback those connections.

    However, when we enabled SmartConnect using the default round-robin settings, we saw a significant improvement on the total run time and daily sample throughput.

    As expected, SmartConnect, maximized performance by keeping client connections balanced across all three Isilon storage nodes. In the three X410 configuration with SmartConnect enabled, the 120 samples processed with 40 compute nodes showed a 14% speed-up and 19% increased daily sample throughput.

    This test also suggests a starting point identifying the number client connections per Isilon node for this genomics workflow. In our case, adding one additional X410 to the Isilon storage cluster for each 15 additional compute nodes may be a reasonable place to start. 

    As we add additional Isilon nodes to our cluster, we will perform additional studies to refine recommendations for the number of client connections per Isilon node for this genomics workflow. We’ll also take a deeper dive with the advanced SmartConnect load balancing options like CPU utilization, connection count, and network throughput. The Isilon SmartConnect White Paper provides a detailed summary and examples for each of these modes. 

    If you are using Bright Cluster Manger and need some tips to set it up with Isilon SmartConnect, read this post.

  • Scaling behavior of short read sequence aligners

    This blog explores the scaling behaviors of the most popular three short read sequence aligners; BWA-mem, Bowtie and Bowtie2. Aligning process is the beginning of all Next Generation Sequencing (NGS) data analysis and typically the most time-consuming step in any NGS data analysis pipeline.

    It is clear that using more cores for the aligning process will help to speed up the process, but as you already know parallelization comes with cost. Higher complexity, perhaps order of magnitude, due to multiple instruction streams and data flowing between the instruction streams can easily overshadow the speed gain by parallelizing the process. Hence, it is not wise to use the entire cores in a compute node to speed up blindly especially when the overall throughput is more important. Identifying a sweet spot for the optimal number of cores for each aligning process and maximizing the overall throughput at the same time is not an easy task in a complex workflow.

    Table 1 Server configuration and software

    Component

    Detail

    Server

    PowerEdge FC430 in FX2 chassis - 2 Sockets

    Processor

    Intel® Xeon® Dual E5-2695 v3 - 14 cores, total 28 physical cores

    Memory

    128GB - 8x 16GB RDIMM, 2133 MT/s, Dual Rank, x4 Data Width

    Storage

    480TB IEEL (Lustre)

    Interconnect

    InfiniBand FDR

    OS

    Red Hat Enterprise 6.6

    Cluster Management tool

    Bright Cluster Manager 7.1

    Short Sequence Aligner

    BWA 0.7.2-r1039, Bowtie 1.1.2, Bowtie 2.2.6

     

    Table 1 shows the summary of the system for the test here. Hyperthreading was not enabled for the test although it helps to improve the overall performance.

    In Figure 1, BWA running times were measured for each different sizes of paired end read sequencing data. The size of sequence read data is represented as million fragments (MF) here. For example, 2MF means that the sequence data consist of two fastq files containing two million sequence reads in each file. One read is in a fastq file, and the corresponding paired read is in the other fastq file. A sweet spot for BWA-mem is from 4 to 16 cores roughly depending on the sequence read file size. Typically, the sequence read size is larger than 10 million and hence, 2MF and 10MF results are not realistic. However, larger sequence read data follow the behavior of the smaller input sizes as well. In the speedup chart, blue solid line labeled as 'ideal' represents the theoretical speedup we could obtain by increasing the number of cores.

      

    Bowtie results in Figure 2 shows that the similar sweet spot comparing to BWA-mem results. However, the running time is slightly faster than BWA-mem’s. It is notable that Bowtie and Bowtie2 are sensitive to the read lengths while BWA-mem shows more consistent scaling behavior regardless of the sequence read lengths.

     

    Although Bowtie2 is even faster than Bowtie, it is more sensitive to the length of the sequence reads as shown in Figure 3. Actually, the total number of nucleotides could be a better matrix to estimate the running time for a given sequence read data.

    In practice, there are more factors to consider to maximize the overall throughput. One of critical factors is the bandwidth of existing storages in order to utilize the sweet spot instead of using the entire cores in a compute node. For example, if we decided to use 14 cores for each aligning process instead of using 28 cores, this will double up the number of samples processed simultaneously. However, the twice number of processes will fill up the limited storage bandwidth, and overly used storages will slow down significantly the entire processes running simultaneously.

    Also, these aligning processes are not used alone typically in a pipeline. It is frequently tied up with a file conversion and a sorting process since subsequent analysis tools requiring these alignment results sorted in either chromosome coordinates or the name of the reads. The most popular approach is to use ‘pipe and redirection’ to save time to write multiple output files. However, this practice makes the optimization harder since it generally requires more computational resources. More detailed optimization for NGS pipelines in this aspect will be discussed in the next blogs.

  • Variant Calling Benchmark -- Not Only Human

    Variant call process refers to the identification of a nucleotide difference from reference sequences at a given position in an individual genome or transcriptome. It includes single nucleotide polymorphism (SNPs), insertion/deletions (indels) and structural variants. One of the most popular variant calling applications is GenomeAnalysisTK (GATK) from Broad Institute. Often this GATK is used with BWA to compose a variant calling workflow focusing on SNPs and indels. After we published Dell HPC System for Genomics White Paper last year, there were significant changes in GATK. The key process, variant call step UnifiedGenotyper is no longer recommended in their best practice. Hence, here we recreate BWA-GATK pipeline according to the recommended practice to test whole genome sequencing data from mammals and plants in addition to human’s whole genome sequencing data. This is a part of Dell’s effort to help customers estimating their infrastructure needs for their various genomics data loads by providing a comprehensive benchmark.

    Variant Analysis for Whole Genome Sequencing data

    System

    The detailed configuration is in Dell HPC System for Genomics White Paper, and the summary of system configuration and software is in Table 2.

    Table 1 Server configuration and software

    Component

    Detail

    Server

    40x PowerEdge FC430 in FX2 chassis

    Processor

    Total of 1120 cores: Intel® Xeon® Dual E5-2695 v3 - 14 cores

    Memory

    128GB - 8x 16GB RDIMM, 2133 MT/s, Dual Rank, x4 Data Width

    Storage

    480TB IEEL (Lustre)

    Interconnect

    InfiniBand FDR

    OS

    Red Hat Enterprise 6.6

    Cluster Management tool

    Bright Cluster Manager 7.1

    Short Sequence Aligner

    BWA 0.7.2-r1039

    Variant Analysis

    GATK 3.5

    Utilities

    sambamba 0.6.0, samtools 1.2.1

     

    BWA-GATK pipeline

    The current version of GATK is 3.5, and the actual workflow tested obtained from the workshop, ‘GATK Best Practices and Beyond’. In this workshop, they introduce a new workflow with three phases.

    • Best Practices Phase 1: Pre-processing
    • Best Practices Phase 2A: Calling germline variants
    • Best Practices Phase 2B: Calling somatic variants
    • Best Practices Phase 3: Preliminary analyses

    Here we tested out phase 1, phase 2A and phase3 for germline variant call pipeline. The details of commands used in benchmark are listed below.

    Phase 1. Pre-processing

    Step 1. Aligning and Sorting

    bwa mem -c 250 -M -t [number of threads] -R ‘@RG\tID:noID\tPL:illumine\tLB:noLB\tSM:bar’ [reference chromosome] [read fastq 1] [read fastq 2] | samtools view -bu - | sambamba sort -t [number of threads] -m 30G --tmpdir [path/to/temp] -o [sorted bam output] /dev/stdin

    Step 2. Mark and Remove Duplicates

    sambamba markdup -t [number of threads] --remove-duplicates --tmpdir=[path/to/temp] [input: sorted bam output] [output: bam without duplicates]

    Step 3. Generate Realigning Targets

     java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt [number of threads] -R [reference chromosome] -o [target list file] -I [bam without duplicates] -known [reference vcf file]

    Step 4. Realigning around InDel

    java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T IndelRealigner -R [reference chromosome] -I [bam without duplicates] -targetIntervals [target list file] -known [reference vcf file] -o [realigned bam]

    Step 5. Base Recalibration

    java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct [number of threads] -l INFO -R [reference chromosome] -I [realigned bam] -known [reference vcf file] -o [recalibrated data table]

    Step 6. Print Recalibrated Reads - Optional

    java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T PrintReads -nct [number of threads] -R [reference chromosome] -I [realigned bam] -BQSR [recalibrated data table] -o [recalibrated bam]

    Step 7. After Base Recalibration - Optional

    java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct [number of threads] -l INFO -R [reference chromosome] -I [recalibrated bam] -known [reference vcf file] -o [post recalibrated data table]

    Step 8. Analyze Covariates - Optional

    java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R [reference chromosome] -before [recalibrated data table] -after [post recalibrated data table] -plots [recalibration report pdf] -csv [recalibration report csv]

    Phase 2. Variant Discovery – Calling germline variants

    Step 1. Haplotype Caller

    java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -nct [number of threads] -R [reference chromosome] -ERC GVCF -BQSR [recalibrated data table] -L [reference vcf file] -I [recalibrated bam] -o [gvcf output]

    Step 2. GenotypeGVCFs

    java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -nt [number of threads] -R [reference chromosome] -V [gvcf output] -o [raw vcf]

    Phase 3. Preliminary Analyses

    Step 1. Variant Recalibration

    java -d64 -Xms512m -Xmx2g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R [reference chromosome] --input [raw vcf] -an QD -an DP -an FS -an ReadPosRankSum -U LENIENT_VCF_PROCESSING --mode SNP --recal_file [raw vcf recalibration] --tranches_file [raw vcf tranches]

    Step 2. Apply Recalibration

    java -d64 -Xms512m -Xmx2g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R [reference chromosome] -input [raw vcf] -o [recalibrated filtered vcf] --ts_filter_level 99.97 --tranches_file [raw vcf tranches] --recal_file [raw vcf recalibration] --mode SNP -U LENIENT_VCF_PROCESSING

    Job Scheduling

    Torque/Maui is used to manage a large number of jobs to process sequencing samples simultaneously. Optional steps, 6, 7 and 8 in phase 1 were not included in the benchmark since Step 6 PrintRead took 12.5 hours with 9 threads for Bos Taurus sample (18 hours with single thread). These optional steps are not required, but these steps are useful for the reporting purpose. If it is necessary, it can be added as a side workflow to the main procedure. For each job, 9 cores were assigned when 120 concurrent jobs were processed concurrently and 13 cores were used for the test of 80 concurrent jobs.

    Data

    In addition to the benchmark for human whole genome sequencing data published in the whitepaper, we gathered cow, pig, two sub-species of rice (japonica and indica) and corn reference genomes from Illumina’s iGenome site and Ensembl database. Fortunately, reference variant call data exist as a standard VCF file format for human, cow and pig. A variant data for japonica rice were obtained from 3000 Rice Genome on AWS and was modified according to the standard VCF file format. However, the chromosome coordinates in this VCF file do not match to the actual reference chromosome sequences, and we were not able to find matching version of reference variant information from public databases. For indica rice and corn, we gathered the variant information from Ensembl and converted them into a compatible VCF format. Whole genome sequencing data were obtained from European Nucleotide Archive (ENA). ENA Sample IDs in Table 1 are the identifiers allow to retrieve sequence data from the site. Although it is not ideal to test an identical input for large number of processes, it is not feasible to obtain large number of similar sample data from public databases.

    Table 2 WGS test data for the different species: * x2 indicates the data is paired end reads. ƚTest ID column represent identifiers for the sequence data used throughout the test.

    Species

    Test IDƚ

    ENA Sample ID

    Sample Base Count

    Single file Size x2*

    Reference Genome size (bp)

    Depth of Coverage

    Number of variants in Ref

    Homo sapiens (human)

    Hs1

    ERR091571

    42,710,459,638

    17 GB x2

    3,326,743,047

    13x

    3,152,430

    Hs2

    ERR194161

    171,588,070,386

    54 GB x2

    3,326,743,047

    52x

    Bos Taurus (cow)

    Bt1

    SRR1706031

    82,272,305,762

    35 GB x2

    2,649,685,036

    31x

    93,347,258

    Bt2

    SRR1805809

    32,681,063,800

    12 GB x2

    2,649,685,036

    12x

    Sus scrofa (pig)

    Ss1

    SRR1178925

    41,802,035,944

    19 GB x2

    3,024,658,544

    14x

    52,573,286

    Ss2

    SRR1056427

    24,901,150,040

    10 GB x2

    3,024,658,544

    8x

    Oryza sativa (rice)

    japonica

    Osj

    SRR1450198

    49,676,959,200

    22 GB x2

    374,424,240

    132x

    19,409,227

    indica

    Osi

    SRR3098100

    12,191,702,544

    4 GB x2

    411,710,190

    30x

    4,538,869

    Zea mays (corn)

    Zm

    SRR1575496

    36,192,217,200

    14 GB x2

    3,233,616,351

    11x

    51,151,183

    Benchmark Results

    Data Quality

    After mapping and sorting of the sequence input files, quality statistics were obtained from the output files of Phase 1, Step 1. SRR17060031 sample is from bovine gut metagenomics study and was not well mapped onto Bos taurus UMD3.1 reference genome from Ensembl as expected. The majority of DNAs from bovine gut is foreign and has different sequence composition.

    Table 3 Mapping qualities of sequence reads data; obtained by using ‘samtools flagstat’. ‘Total QC-passed reads’ is the number of reads passed the criteria of sequencing quality. Among all QC-passed reads, the number of reads actually mapped on a reference genome and its percentage is on ‘Mapped reads (%)’ column. ‘Paired in sequencing’ column is the number of paired reads properly paired by a sequencer. Among the reads properly paired in sequencing, the number of those paired reads mapped on a reference genome as paired reads is listed in ‘Properly paired (%) in mapping.

    Species

    Sequencing Reads

    Test ID

    Total QC-passed reads

    Mapped reads (%)

    Paired in sequencing

    Properly paired (%) in mapping

    Human

    ERR091571

    Hs1

    424,118,221

    421,339,198 (99.34%)

    422,875,838

    412,370,120 (97.52%)

    ERR194161

    Hs2

    1,691,135,957

    1,666,486,126 (98.54%)

    1,686,908,514

    1,621,073,394 (96.10%)

    Cow

    SRR1706031

    Bt1

    813,545,863

    29,291,792 (  3.60%)

    813,520,998

    28,813,072 (  3.54%)

    SRR1805809

    Bt2

    327,304,866

    316,654,265 (96.75%)

    326,810,638

    308,600,196 (94.43%)

    Pig

    SRR1178925

    Ss1

    416,854,287

    379,784,341 (91.11%)

    413,881,544

    344,614,170 (83.26%)

    SRR1056427

    Ss2

    249,096,674

    228,015,545 (91.54%)

    246,546,040

    212,404,874 (86.15%)

    Rice

    SRR1450198

    Osj

    499,853,597

    486,527,154 (97.33%)

    496,769,592

    459,665,726 (92.53%)

    SRR3098100

    Osi

    97,611,519

    95,332,114 (97.66%)

    96,759,544

    86,156,978 (89.04%)

    Corn

    SRR1575496

    Zm

    364,636,704

    358,393,982 (98.29%)

    361,922,172

    315,560,140 (87.19%)

    The rest of samples were properly aligned on the reference genome with high quality; more than 80% of reads paired in sequencing data properly mapped as pairs on reference genomes.

    It is also important to check what level of mismatches exists in the aligning results. The estimated variance in human genome is one in every 1,200 to 1,500 bases. This makes 3 million base differences between any two people randomly picked. However, as shown in Table 4, the results are not quite matched to the 3 million base estimation. Ideally, 36 million mismatches should be shown in Hs1 data set since it covers the human reference genome 13 times. However, the rate of mismatches is quite higher than the estimation, and at least one out of two variants reported by the sequencing might be an error.

    Table 4 The number of reads mapped perfectly on a reference genome and the number of reads mapped partially

    Test ID

    Depth

    Mapped reads

    Number of reads mapped with mismatches (mm)

    Perfect match (%)

    One mm (%)

    Two mm (%)

    Three mm (%)

    Four mm (%)

    Five mm (%)

    Hs1

    13x

    421,339,198

    328,815,216 (78.0)

    53,425,338 (12.7)

    13,284,425 (3.2)

    6,842,191 (1.6)

    5,140,438 (1.2)

    4,082,446 (1.0)

    Hs2

    52x

    1,666,486,126

    1,319,421,905 (79.2)

    201,568,633 (12.1)

    47,831,915 (2.9)

    24,862,727 (1.5)

    19,052,800 (1.1)

    15,568,114 (0.9)

    Bt1

    31x

    29,291,792

    25,835,536 (88.2)

    2,684,650 (9.2)

    338,781 (1.2)

    147,841 (0.5)

    89,706 (0.3)

    70,789 (0.24)

    Bt2

    12x

    316,654,265

    158,463,463 (50.0)

    68,754,190 (21.7)

    29,544,252 (9.3)

    17,337,205 (5.5)

    12,639,289 (4.0)

    10,015,029 (3.2)

    Ss1

    14x

    379,784,341

    228,627,231 (60.2)

    69,912,403 (18.4)

    29,142,572 (7.7)

    16,701,248 (4.4)

    11,036,852 (2.9)

    7,652,513 (2.0)

    Ss2

    8x

    228,015,545

    112,216,441 (49.2)

    53,739,562 (23.6)

    25,132,226 (11.0)

    13,874,636 (6.1)

    8,431,144 (3.7)

    5,375,833 (2.4)

    Osj

    132x

    486,527,154

    208,387,077 (42.8)

    113,948,043 (23.4)

    61,697,586 (12.7)

    37,520,642 (7.7)

    23,761,302 (4.9)

    15,370,422 (3.2)

    Osi

    30x

    95,332,114

    54,462,837 (57.1)

    17,325,526 (18.2)

    8,190,929 (8.6)

    5,146,096 (5.4)

    3,394,245 (3.6)

    2,322,355 (2.4)

    Zm

    11x

    358,393,982

    150,686,819 (42.1)

    82,912,817 (23.1)

    44,823,583 (12.5)

    28,375,226 (7.9)

    19,093,235 (5.3)

    12,503,856 (3.5)

     

    Time Measurement


    Total run time is the elapsed wall time from the earliest start of Phase 1, Step 1 to the latest completion of Phase 3, Step 2. Time measurement for each step is from the latest completion time of the previous step to the latest completion time of the current step as described in Figure 1.

    The running time for each data set is summarized in Table 4. Clearly the input size, size of sequence read files and reference genomes are the major factors affecting to the running time. The reference genome size is a major player for ‘Aligning & Sorting’ step while the size of variant reference affects most on ‘HaplotypeCaller’ step.

    Table 5 running time for BWA-GATK pipeline

    Species

    Homo sapiens (human)

    Bos taurus (cow)

    Sus scrofa (pig)

    Oryza sativa (rice)

    Zea mays (corn)

    japonica

    Indica

    Depth of Coverage

    13x

    52x

    31x

    12x

    14x

    8x

    132x

    30x

    11x

    Test ID

    Hs1

    Hs2

    Bt1

    Bt2

    Ss1

    Ss2

    Osj

    Osi

    Zm

    Total read size, gzip compressed (GB)

    34

    108

    70

    22

    38

    20

    44

    8

    28

    Number of samples ran concurrently

    80

    80

    120

    80

    120

    80

    120

    80

    80

    Run Time (hours)

    Aligning & Sorting

    3.93

    15.79

    7.25

    5.77

    7.53

    3.04

    9.50

    1.18

    11.16

    Mark/Remove Duplicates

    0.66

    2.62

    3.45

    0.73

    1.07

    0.27

    1.27

    0.12

    0.72

    Generate Realigning Targets

    0.29

    1.08

    3.12

    1.57

    0.47

    0.27

    0.22

    0.05

    0.26

    Realign around InDel

    2.50

    8.90

    4.00

    3.15

    2.87

    1.83

    7.37

    1.25

    3.18

    Base Recalibration

    1.82

    6.80

    1.39

    1.96

    2.37

    1.01

    3.16

    0.36

    1.91

    HaplotypeCaller

    4.52

    10.28

    2.75

    9.33

    26.21

    14.65

    8.95

    1.77

    16.72

    GenotypeGVCFs

    0.03

    0.03

    0.20

    0.05

    0.34

    0.06

    1.12

    0.01

    0.04

    Variant Recalibration

    0.67

    0.37

    0.32

    0.86

    0.58

    0.56

    0.92

    0.04

    0.46

    Apply Recalibration

    0.04

    0.04

    0.03

    0.06

    0.03

    0.08

    0.03

    0.01

    0.05

    Total Run Time

    14.5

    45.9

    22.5

    23.5

    41.5

    21.8

    32.5

    4.78

    34.5

    Number of Genomes per day

    133

    42

    128

    82

    46

    88

    89

    402

    56

     

    Discussion

    The running time of the current version, GATK 3.5 is overly slower than the version of 2.8-1 we tested in our white paper. Particularly, HaplotypeCaller in the new workflow took 4.52 hours while UnifiedGenotyper in the older version took about 1 hour. Despite of the significant slow-down, GATK team believes HaplotypeCaller brings a better result, and that is worthy for the five times longer run.

    There are data issues in non-human species. As shown in Table 4, for the similar size of inputs, Hs1 and Ss1 show large difference in the running time. The longer running time in non-human species can be explained by the quality of reference data. Aligning and sorting process takes more than twice times in other mammals, and it became worse in plants. It is known that plants genomes contain large number of repeat sequences which make mapping process difficult. It is important to note that the shorter running time for HaplotypeCaller in rice does not reflect a real time since the size of the reference variant file was reduced significantly due to the chromosome length/position mismatches in the data. All the variant records outside of chromosome range were removed, but position mismatches were used without corrections. The smaller size of the reference variant information and wrong position information the running time of HaplotypeCaller shorter. Corn’s reference data is not any better in terms of the accuracy of these benchmark. These data errors are the major causes of longer processing time.

    Nonetheless, the results shown here could serve as good reference points for the worst case running time. Once reference data are cleaned up by researchers, the overall running time for other mammals should be similar to the one from Hs1 in Table 4 with a proper scaling of input size. However, it is hard to estimate an accurate running times for non-human species at this moment. 

  • Genomics at a Glance – Part 2 of 2

    Author - Kihoon Yoon

    NGS Workflow

    First step in NGS workflow is to obtain samples for the sequencing. Either DNAs or RNAs need to be extracted from samples. Unfortunately, sequencing technology is not at the stage that we could load samples directly onto a sequencer. The samples have to be sheared and become random ‘short’ fragments. Also, depending on NGS platforms, DNA or RNA short fragments have to be amplified in order to obtain sufficient amount of short DNA/RNA fragments before sequencing. This non glamorous traditional laboratory work is still very critical for obtaining high quality sequencing results.

    There are small number of different NGS platforms available currently, but Illumina sequencers are the industry leading platform. More or less, these different platforms are based on sequencing by synthesis (SBS) technology except Oxford Nanopore. The fundamental of SBS technology is to capture signals when DNA polymerases add labeled complementary nucleotides on target sequences.

    After a sequencer generates short sequence reads, these reads need to be mapped onto a reference genome to figure out the origins since a sequencer generates nothing, but millions of short DNA/RNA fragments. Adding meaningful labels on these short sequences, so called aligning, is the beginning of NGS data analysis. This enables downstream analyses. There are many different flavor of aligners available, but not all aligners are RNA-Seq analysis ready.

    Figure 5 DNA-Seq vs RNA-Seq (The original image was obtained form https://upload.wikimedia.org/wikipedia/commons/9/91/Chromosome.gif and modified.)

    As illustrated on Figure 2, there are many ways to apply NGS in different studies. However, in terms of target sequences, those applications can be organized into two groups, DNA-Seq and RNA-Seq. Only about 9.2 percent of human DNA does something, and little over 1 percent of human genomes codes proteins. And, these coding regions are shared by about less than 20,000 genes. Hence, a small fraction of chromosome sequences transcribed to mRNAs in the nucleusof cell as shown in Figure 5. Once pre-RNAs are synthesized, splicing mechanism process removes introns and rejoins exons to create messenger RNAs (mRNAs). These matured RNAs are called messenger RNAs since they are transported out to cytoplasm to deliver genetic information to protein synthesis apparatus located in cytoplasm. This makes mRNA sequences (transcriptome) quite different from gene sequences on chromosomes.

    DNA-Seq

    DNA sequences tell scientists which part of chromosomes carries genetic information (locations of genes) and which stretch carries regulatory information. More importantly, scientists can spot the changes in genes or regulatory regions that may cause disease. Unfortunately, these gene information do not come with sequence itself. There has been tremendous effort to label these sequences according to their functions such as ENCODE project. Without having a proper annotation for genome sequences, sequences are just very expensive garbage. Nonetheless, the ability to sequence genomes quickly and cheaply allows doctors to identify the particular type of disease a patient has. It will still takes some time that sequencing becomes a routine in the doctor’s office, but the cheaper and faster NGS technology creates vast potentials for diagnostics and therapies.

    Preparation of DNA sequencing sample is called as ‘library preparation’. There are many different version of reagent kits commercially available. All these kits rely on a series of standard molecular biology reactions. A common procedure includes purifying genomic DNAs, making random DNA fragments and adding adaptors (known short DNA fragments attached to both ends of purified- fragmented DNAs). Genomic DNA library is then amplified through PCR (polymerase chain reaction) to increase the amount of target sequences to meet the requirements for various sequencers. Once sequencing is completed, sequence read files need to be mapped onto proper reference sequences with annotations. This is the so-called ‘aligning’ process. Since genomic DNAs were sheared randomly, the final sequence read files do not contain any information of their origin. Aligning process is used to map the locations of sequence reads on chromosomes and finding what genes are on that locations by looking up the annotation comes with a reference sequence. After alignment is done, further statistical analyses reveal where nucleotide variations are located on genes. This is an example of variance analysis, and the procedure is subjective and differs for different types of studies.

    RNA-Seq

    Not all genes are expressed at a given moment in a cell, and the distribution of mRNAs is cell or organ specific. Although all cells have the same copy of genes in their chromosomes, cells in different organs exhibit distinct gene expression patterns. There is not a direct correlation between the distributions of mRNAs and proteins, but mRNA profiling provides a better clue to estimate protein level activities.

    Library preparation for RNA-Seq is nearly identical to the one for DAN-Seq. Extracting RNAs is the first step, but due to the instability of RNA molecules, these RNA sequences have to be converted to complementary DNAs before sequencing. The purpose of RNA-Seq is to obtain the information of which genes are expressed and the amount they are expressed at a given moment. Normally, RNA-Seq data is much harder to automate since there are many potential workflows. There should not be an expectation of uniform results for all RNA-Seq like what you expect from DNA-Seq because gene expression is changed over the time. However, this variability provides a unique opportunity to compare differences between two cells under different conditions. For example, RNA-Seq is useful to compare gene expressions between normal cells and cancer cells from a patient.

    Limitations of NGS

    Neither DNA-Seq nor RNA-Seq alone can provide a complete answer for any biological question. This is not only NGS’s problem, but also any experimental process or single technology in life sciences. A conclusion drawn from a controlled experiment will not actually reflect the real conditions in a life form. There is always the possibility that the same experiment with similar conditions might end up with a completely different result. Besides these innate difficulties NGS comes with errors, like any other technology; hence it requires a careful interpretation. Again, NGS is currently the best technology we can use, but nothing magical.

  • Genomics at a Glance – Part 1 of 2

    Author - Kihoon Yoon

    Recently, Genomics has been getting spotlights upon the completion of the Human Genome Project in 2003 and a series of innovations in sequencing technologies. As NGS technology matures and the cost of sequencing is dropping fast, there is an explosion in the amount of data, and it elevates the necessity of computer infrastructures. Clearly, there is no doubt that NGS technology brought the “Gene-omics” revolution to the field and opened up a whole new HPC market. However, from a Biologist's viewpoint, there are some important aspects missed by media and Sci-fi movies.

    Even if we were done with Genomics, but that is not all …

    Figure 1 Central Dogma (Obtained from http://www.dbriers.com/tutorials/wp-content/uploads/2013/12/Central_Dogma-GODS-wikimedia.jpg)

    Life forms arose from the edge of Chaos. This is a statement that well describes the uncertainty and complexity of life forms. Unlike many other scientific fields, especially quantitative research areas, Biology is a science where the analysis and predictions were probabilistic. Most outcomes of biological research depend on hypothesis, speculations and their controlled environment. Nonetheless, the field of Genomics equipped with mighty sequencers still provide only a part of the entire story. As you can see from the central dogma illustration, Figure 1, genetic information flow does not stop at DNA or RNA level. The ultimate goal is to gain detailed insight on proteins, the final products of gene expression as well as work-horses for maintaining life. There are complex controlling mechanisms between each step in Figure 1. Although proteins are the final products, they also regulate transcription and translation at DNA and RNA levels. It is like a feedback-loop in a simple view, but the complexity of genomic system is way beyond a single line feedback-loop. They are actually multilayer feedback-networks intertwined such as transcriptional, post-transcriptional and translational regulatory networks. We still are quite far from the completion!

    All the recent excitement about NGS is reminiscent of gene therapy from the late 80s. The rush to apply gene therapy led to a series of failures caused tragic deaths and unexpected cancers and put the field into dark days. Without having complete and accurate knowledge, manipulating genes should be taken cautiously since we do not know yet what we do not know now.

    The role of Next Generation Sequencing in Genomics and Medicine

    Figure 2 NGS usages in Genomics


    Knowledge in DNA sequences is not only important in basic life science research, but also indispensable in biotechnology, medical diagnosis and forensic biology. NGS technology brings tremendous advances in those fields consuming DNA sequences as a key information: by lowering sequencing cost, shortening sequencing time and increasing its accuracy.

    In Figure 2, typical NGS applications based on the sequencing targets are listed. It is natural that these applications fall into two major groups, DNA-Seq or RNA-Seq. Most of the DNA-Seq studies are focused on genomic mutations or Single Nucleotide Polymorphism (SNP) in chromosomes while RNA-Seq studies are obtaining snapshots of the entire gene expressions at a given point in time. RNA-Seq rapidly replaces microarray technology and becomes a standard procedure for differentially expressed gene and RNA-Profiling studies.

    The ability to perform DNA-Seq and RNA-Seq at the same time for the same sample opened up a new opportunity in translational research area (Figure 3). Identifying mutations from genes in chromosomes and knowing how those mutations change the gene expressions brings a tremendous advantage to completing the puzzle from ‘the central dogma’. Although we still do not have comprehensive protein information, it is much easier to look into a small number of target proteins instead of testing large number of proteins. Currently, there are various projects to label genomic variations in humans according to their disease type, phenotypic characters, drug treatments, clinical outcomes and so on. This data integration efforts will provide the basis for precision medicine and personalized medicine in the near future.

    Figure 4 Overview of NGS data flow

    NGS technology has brought a broad impact on the different layer of studies (see Figure 3), but the computational infrastructure requirements for these areas have not been correctly assessed. By the nature of basic life science and translational study, these two areas require the most flexible, powerful, and customizable infrastructure in terms of analysis phase in Figure 4. Large amount of samples and existing data are constantly regrouped and reanalyzed to elucidate biological phenomena in basic life science research while experts in translational research area seek any applicable knowledge toward to medical area. The data and workloads should be similar in basic life science research and translational research.

    However, translational research brings simple and consolidated methods which can be applied at the clinical area, to help making treatment decisions, or to provide any useful process that can improve human health and well-being. For an example, a set of key genes, say 10 genes, are identified as breast cancer progression in basic life science area. Experts in translational research pick that knowledge to develop method to suppress those identified genes, to stop or slow down cancer progression efficiently in different individuals through clinical trials or large cohort studies. As the results of the study, experts might figure out there are different subtypes of breast cancers which respond differently to various drug treatments based on the expression of those 10 key genes. This will allow development of a gene panel to detect such a subtype to go through different drug treatment options. Dealing with only 10 genes out of 20,000 genes will be a lot more applicable. As you can guess from this scenario, there are different computational requirements among the different layer of studies. While high throughput is an important metric in basic life science and translational research areas, the speed of single sample processing time is more important in the clinical field.