Our community is talking about the new Dell Technologies. Join the discussion in the Dell EMC Community Network:
Stampede2 system, is the result of collaboration between the Texas Advanced Computing Center (TACC), Dell EMC and Intel. Stampede2 consists of 1,736 Dell EMC PowerEdge C6420 nodes with dual-socket Intel Skylake processors, 4,204 Dell EMC PowerEdge C6320p nodes with Intel Knights Landing bootable processors, a total of 5,940 compute nodes, and 24 additional login and management servers, Dell EMC Networking H-series switches, all interconnected by an Intel Omni-Path Architecture (OPA) fabric.
Two technical white papers were recently published through the joint efforts of TACC, Dell EMC and Intel. One white paper describes the Network Integration and Testing Best Practices on the Stampede2 cluster. The other white paper discusses the Application Performance of Intel Skylake and Intel Knights Landing Processors on Stampede2 and highlights the significant performance advantage of Intel Skylake processor at a multi node scale in four commonly used applications: NAMD, LAMMPS, Gromacs and WRF. For build details, please contact your Dell EMC representative. If you have VASP license, we are happy to share VASP benchmark results as well.
Deploying Intel Omni-Path Architecture Fabric in Stampede2 at the Texas Advanced Computing Center–Network Integration and Testing Best Practices (H17245)
Application Performance of Intel Skylake and Intel Knights Landing Processors on Stampede2 (H17212)
This blog is written by Dell Hypervisor Engg.
Persistent Memory(also known as Non Volatile Memory (NVM)) is a random access memory type which retains it’s contents even when system power goes down in the event of an unexpected power loss, user initiated shutdown, system crash etc. Dell EMC introduced support for NVDIMM-N from their 14th generation of PowerEdge servers. VMware announced support for NVDIMM-N from vSphere ESXi 6.7 onwards. The NVDIMM-N resides in a standard CPU Memory slot, placing data closer to the processor thus reducing the latency and fetch maximum performance. This document detail about the support stance of NVDIMM-N and VMware ESXi specific to Dell EMC PowerEdge servers. This paper provides an insight into the usecases where NVDIMM is involved and the behavior caveats of the same.
Dell EMC support for Persistent Memory (PMem) and VMware ESXi
This blog helps to understand why the transition happened from 512 bytes sector disk to 4096 bytes sector disk. The blog also gives answers to why 4096 bytes (4K) sector disk should be opted for OS installation. The blog first explains about sector layout to understand the need of migration, then gives reasoning behind the migration and finally it covers the benefits of 4K sector drive over 512 bytes sector drive.
A sector is the minimum storage unit of a hard disk drive. It is a subdivision of a track on a hard disk drive. The sector size is an important factor in the design of Operating System because it represents the atomic unit of I/O operations on a hard disk drive. In Linux, you can check the size of the disk sector using "fdisk -l" command.
Figure-1: The disk sector size in Linux
As shown in Figure-1, both the logical and physical sectors are 512bytes long for this Linux system.
The sector layout is structured as follows:1) Gap section: Each sector on a drive is separated by gap section.2) Sync section: It indicates the beginning of the sector.3) Address Mark section: It contains information related to sector identification e.g. sector’s number and location.4) Data section: It contains the actual user data.5) ECC section: It contains error correction codes that are used to repair and recover data that might be damaged during the disk read/write process.
Each sector stores a fixed amount of user data, traditionally 512 bytes for hard disk drives. But because of better data integrity at higher densities and robust error correction capabilities newer HDDs now store 4096 bytes (4K) in each sector.
Need for large sector
The number of bits stored on a given length of track is termed as areal density. Increasing areal density is a trend in the disk drive industry not only because it allows greater volumes of data to be stored in the same physical space but it also improves transfer speed at which that medium can operate. With the increase in areal density, the sector has now consumed a smaller and smaller amount of space on the hard drive surface. This creates a problem because the physical size of the sectors on hard drives has shrunk but media defects have not. If the data in a hard drive sector consumes smaller areas then error correction becomes challenging. This is because media defects of the same size can damage a higher percentage of the data in the disk which has small area for a sector than the disk which has large area for a sector.
There are two approaches to solve this problem. The first approach is to invest more disk space to ECC bytes to assure continued data reliability. But if we invest more disk space to ECC bytes this will lead to less disk format efficiency. Disk format efficiency is defined as (number of user data bytes X 100) / total number of bytes on disk. Another disadvantage is that the more ECC bits included, the disk controller requires more processing power to process the ECC algorithm.
Second approach is to increase the size of the data block and slightly increase the ECC bytes for each data block. With the increase of data block size, the amount of overhead required for each sector to store control information like gap, sync, address mark section etc. would reduce. For each sector the ECC bytes will increase but overall ECC bytes required for a disk would reduce because of larger sector. Reducing the overall amount of space used for error correction code improves format efficiency and increased ECC bytes for each sector gives capability to use more efficient and powerful error-correction algorithms. Thus, transition to a larger sector size has two benefits: improved reliability and greater disk capacity.
Why 4K only?
From a throughput perspective, the ideal block size should be roughly equal to the characteristic size of a typical data transaction. We have to acknowledge that the average file size today is more than 512 bytes. Now a days applications in modern systems use data in large blocks, much larger than the traditional 512-byte sector size. Too small block sizes cause too much transaction overhead. While in case of large block sizes each transaction transfers a large amount of unnecessary data.
The size of a standard transaction in relational data Base systems is 4K. The consensus of opinion in the hard disk drive industry has been that physical block sizes of 4K-Block would provide a good compromise. It also corresponds to paging size used by operating systems and processors.
Figure-2: 512 bytes block vs 4096 bytes block
Figure-3: Format Efficiency improvement in 4K disk
Table-1: Format Efficiency improvement in 4K disk
As we see in Figure-2, 4K sectors are 8 times as large as traditional 512 byte ones. Hence for the same data payload one need 8 times less gap, sync and address mark sections and 4 times less error correction code section. Reducing the amount of space used for error correction code and other non-data section improves format efficiency for 4K Format. Format efficiency improvement is shown in Figure-3 and Table-1, there is a gain of 8.6% format efficiency for 4K sector disk over 512byte sector disk.
Figure-4: Effect of media defect on disk density
As shown in Figure-4, the effect of media defect on disk with higher areal density is more than the disk with the lower areal density disk. As areal density increases we need more ECC bytes to retain same level of error correction capability. The 4K format provides enough space to expand the ECC field from 50 to 100 bytes to accommodate new ECC algorithms. The enhanced ECC coverage improves the ability to detect and correct processed data errors beyond the 50-byte defect length associated with the 512-byte sector format.
4K drive Support on OS & Dell PowerEdge Servers
4K Data disks are supported on Windows Server 2012 but as boot disk only supported in UEFI mode. For Linux, 4K hard drives require a minimum of RHEL 6.1 and SLES 11 SP2. 4K boot drives are only supported in UEFI mode in Linux. Kernel support for 4K drives is available in kernel versions 2.6.31 and above.PERC H330, H730, H730P, H830, FD33xS, and FD33xD cards support 4K block size disk drives, which enables you to efficiently use the storage space. 4K disks can be used on the Dell PowerEdge Servers supporting above PERC cards.
The physical size of each sector on the disk has become smaller as a result of increase in areal densities in disk drives. If the number of disk defects does not scale at the same rate, then we expect more sectors to be corrupted and we need strong error correction capability for each sector. Disk drives with larger physical sectors and more ECC bytes for each sector provide enhanced data protection and correction algorithms. The 4K format helps to achieve better format efficiencies and improves the reliability and error correction capability. This transition will result in better user experiences, hence the 4K drive should be opted for OS installation.
We published the whitepaper, “Dell EMC PowerEdge R940 makes De Novo Aseembly easier”, last year to study the behavior of SOAPdenovo2 . However, the whitepaper is limited to one De Novo assembly application. Hence, we want to expand our application coverage little further. We decided to test SPAdes (2012) since it is a relatively new application and reported for some improvement on the Euler-Velvet-SC assembler (2011) and SOAPdenovo. SPAdes is also based on de Bruijn graph algorithm like most of the assemblers targeting Next Generation Sequencing (NGS) data. De Bruijin graph-based assemblers would be more appropriate for larger datasets having more than a hundred-millions of short reads.
As shown in Figure 1, Greedy-Extension and overlap-layout-consensus (OLC) approaches were used in the very early next gen assemblers . Greedy-Extension’s heuristic is that the highest scoring alignment takes on another read with the highest score. However, this approach is vulnerable to imperfect overlaps and multiple matches among the reads and leads to an incomplete assembly or an arrested assembly. OLC approach works better for long reads such as Sanger or other technology generating more than 100bp due to minimum overlap threshold (454, Ion Torrent, PacBio, and so on). De Bruijin graph-based assemblers are more suitable for short read sequencing technologies such as Illumina. The approach breaks the sequencing reads into successive k-mers, and the graph maps the k-mers. Each k-mer forms a node, and edges are drawn between each k-mer in a read.
Figure 1 Overview of de novo short reads assemblers. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3056720/
SPAdes is a relatively recent application based on de Bruijn graph for both single-cell and multicell data. It improves on the recently released Euler Velvet Single Cell (E +V- SC) assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data).
All tests were performed on Dell EMC PowerEdge R940 configured as shown in Table 1. The total number of cores available in the system is 96, and the total amount of memory is 1.5TB.
The data used for the tests is a paired-end read, ERR318658 which can be downloaded from European Nucleotide Archive (ENA). The read generated from blood sample as a control to identify somatic alterations in the primary and metastatic colorectal tumors. This data contains 3.2 Billion Reads (BR) with the read length of 101 nucleotides.
SPAdes runs three sets of de Bruijn graphs with 21-mer, 33-mer, and 55-mer consecutively. This is the main difference with regards to SOAPdenovo2 which run a single k-mer, either 63-mer or 127-mer.
In Figure 2, the runtimes, wall-clock times, are plotted in days (blue bars) with various number of cores, 28, 46, and 92 cores. Since we do not want to use the entire cores of each socket, 92 cores were picked as the maximum number of cores for the system. One core per socket was reserved for OS and other maintenance processes. Subsequent tests were done by reducing the number of cores in half. Peak memory consumptions for each case is plotted as a line graph. SPAdes runs significantly longer than SOAPdenovo2 due to the multiple iterations on three different k-mers.
The peak memory consumption is very similar to SOAPdenovo2. Both applications require slightly less than 800GB memory to process 3.2 BR.
Utilizing more cores helps to reduce the runtime of SPAdes significantly as shown in Figure 2. For SPAdes, it is recommendable to use the highest core count CPUs like Intel Xeon Plantinum 8180 processor with 28 cores and 3.80GHz to bring down the runtime further.
Internal web page
External web page
Contacts Americas Kihoon Yoon Sr. Principal Systems Dev Eng Kihoon.Yoon@dell.com +1 512 728 4191
It refers an earlier version of SOAPdenovo, not SOAPdenovo2.
I am excited to announce the availability of Quick Start on the newly launched Wyse 5070 WIE10 thin client. The Quick Start product runs on first boot and can be launched manually as required. Quick Start provides the end user with an enhanced first time out-of-box experience aka OOBE and informs the user about the product details-both hardware as well as software. Upon walking through the screens, the end user is prompted to configure the thin client if they chose to, or simply proceed with using their brand-new Dell Wyse 5070 thin client.
Here are some screenshots:
A bit of background on the write filter- Microsoft has the UWF or Unified Write Filter available for Windows 10 IoT Enterprise thin clients (they had other write filters like EWF and FBWF for WES7, WE8S, etc previously). The write filter starts on boot (enabling or disabling prompts a reboot) and captures any and all writes to disk to an overlay called write filter cache. This cache can be in RAM (typically the case for thin clients and what we are discussing today) or on storage. So, apps think they are writing persistently to disk when in actuality, they are writing to volatile RAM and those writes are lost when unit is rebooted. Since these are typically VDI apps writing temporary data and the user generated data is in the back-end VDI infrastructure, this is actually ideal. While not entirely relevant to this discussion, I should note that the UWF does provide a mechanism to bypass itself with file, folder and registry exclusions for programs like Windows Defender, for example, that needs to frequently persist it's writes.
The write filter has two main functions:
Typical Windows thin clients have at most 1 GB of UWF cache (some with 8 GB RAM have up to 2 GB). Once this UWF cache fills up, the OS starts complaining about low memory or low UWF cache size. Usually, once the UWF cache reaches a critical 90% level, the unit has to be rebooted. In most case, this doesn't happen for weeks, but in some deployments, this happens almost daily or more (this could be excessively verbose logging by some applications, browser cache, etc). This is an industry-wide issue for all Windows thin clietns running with the write filter enabled.
I am excited to announce the release of our brand-new patent-pending product, Overlay Optimizer that solves this very issue. Without getting into the details aka "sausage-making", I will say that Overlay Optimizer will ensure that your Windows thin client doesn't reboot as frequently as it needed to. Not only does this mean that you have a greater system up-time and therefore, a much better end-user experience; you also can avoid the need to upgrade your thin clients from say 4 GB RAM to 8 GB RAM. This patent-pending software is only available on Dell Thin Clients and will help our customers extract more performance/up-time from their thin clients.
Overlay Optimizer is available for all Dell Thin Clients running Windows 10 IoT Enterprise and can be downloaded for free from here:
Hope this helps-please comment is this does help solve your issues.
Gene expression analysis is as important as identifying Single Nucleotide Polymorphism (SNP), InDel or chromosomal restructuring. Eventually, the entire physiological and biochemical events depend on the final gene expression products, proteins. Many quantitative scientists, non-biologists tend to oversimplify the flow of genetic information and forget about what the actual roles of proteins are. Simplification is the beginning of most science fields; however, it is too optimistic to think that this practice also works for biology. Although all the human organs contain the identical genomic composition, the actual protein expressed in various organs are completely different. Ideally, a technology enables to quantify the entire proteins in a cell could excel the progress of Life Science significantly; however, we are far from to achieving it. Here, in this blog, we test one popular RNA-Seq data analysis pipeline known as Tuxedo pipeline. The Tuxedo suite offers a set of tools for analyzing a variety of RNA-Seq data, including short-read mapping, identification of splice junctions, transcript and isoform detection, differential expression, visualizations, and quality control metrics.A typical RNA-Seq data set consists of multiple samples as shown in Figure 1. Although the number of sample sets depends on the biological experimental designs, two sets of samples are used to make comparisons between normal vs. cancer samples or untreated vs. treated samples, for examples.
Figure 1 Tested Tuxedo pipeline workflow
All the samples are aligned individually in Step 1. In this pipeline, the Tophat process uses Bowtie 2 version 2.3.1 as an underlying short sequence read aligner. Step 3, Cuffmerge job has a dependency from all the previous jobs in Step 2. The results from Cufflinks jobs are collected at this step to merge together multiple Cufflinks assemblies which is required for Cuffdiff step. Cuffmerge also runs Cuffcompare in the background and automatically filters out transcription artifacts. Cuffnorm generates tables of expression values that are properly normalized for library size, and these tables can be used for other statistical analysis instead of CummeRbund. At Step 5, CummeRbund step is set to generate three plots, gene density, gene box and volcano plots by using R script.A performance study of RNA-Seq pipeline is not trivial because the nature workflow requires non-identical input files. 185 RNA-Seq paired-end read data are collected from public data repositories. All the read data files contain around 25 Million Fragments (MF) and have similar read lengths. The samples for a test randomly selected from the pool of 185 paired-end read files. Although these randomly selected data will not have any biological meaning, certainly these data will put the tests on the worst-case scenario with very high level of noise.The test cluster configurations are summarized in Table 1.
The test clusters and H600 storage system was connected via 4 x 100GbE links between two Dell Networking Z9100-ON switches. Each compute node was connected to the test cluster side Dell Networking Z9100-ON switch via single 10GbE. Four storage nodes in Dell EMC Isilon H600 was connected to the other switch via 4x 40GbE links. The configuration of the storage is listed in Table 2.
Front end network: 40GbE, Back end network: IB QDR
DEG analysis requires at least two samples. In Figure 2, each step described in Figure 1 is submitted to Slurm job scheduler with proper dependencies. For example, Cuffmerge step must wait for all the Cufflinks jobs are completed. Two samples, let’s imagine one normal and one treated sample, begin with Tophat step individually and followed by Cufflinks step. Upon the completion of all the Cufflinks steps, Cuffmerge aggregates gene expressions in the entire samples provided. Then, subsequent steps, Cuffdiff and Cuffnorm begin. The output of Cuffnorm can be used for other statistical analysis. Cuffdiff steps generates gene expression differences at the gene level as well as isoformer level. CummeRbund step uses R-package CummeRbund to visualize the results as shown in Figure 3. The total runtime with 38 cores and two PowerEdge C6420s is 3.15 hours.
Figure 2 Tuxedo pipeline with two samples
Figure 3 shows differentially expressed genes in red with significantly lower p-values (Y-axis) compared to other gene expressions illustrated in black. X-axis is fold changes in log base of 2, and these fold changes of each genes are plotted against p-values. More samples will bring a better gene expression estimation. The right upper plot are gene expressions in sample 2 in comparisons with sample 1 whereas the left lower plot are gene expressions in sample 1 compared to sample 2. Gene expressions in black dots are not significantly different in both samples.
Figure 3 Volcano plot of the Cuffdiff results
Typical RNA-Seq studies consist of multiple samples, sometime 100s of different samples, normal versus disease or untreated versus treated samples. These samples tend to have high level of noisy due to their biological reasons; hence, the analysis requires vigorous data preprocessing procedure. Here, we tested various numbers of samples (all different RNA-Seq data selected from 185 paired-end reads data set) to see how much data can be processed by 8 nodes PowerEdge C6420 cluster. As shown in Figure 4, the runtimes with 2, 4, 8, 16, 32 and 64 samples grow exponentially when the number of samples increases. Cuffmerge step does not slow down as the number of samples grows while Cuffdiff and Cuffnorm steps slow down significantly. Especially, Cuffdiff step becomes a bottle-neck for the pipeline since the running time grows exponentially (Figure 5). Although Cuffnorm’s runtime increases exponentially like Cuffdiff, it is ignorable since Cuffnorm’s runtime is bounded by Cuffdiff’s runtime.
Figure 4 Runtime and throughput results
Figure 5 Behaviors of Cuffmerge, Cuffdiff and Cuffnorm
The throughput test results show that 8 node PowerEdge C6420s with Isilon H600 can process roughly 1 Billion Fragments which is little more than 32 samples with ~50 million paired reads each (25 MF) through Tuxedo pipeline illustrated in Figure 1.
Since Tuxedo pipeline is relatively faster than other popular pipelines, it is hard to generalize or utilize these results for sizing a HPC system. However, this provides a good reference point to help designing a right size HPC system.
Internal web page External web page
AmericasKihoon YoonSr. Principal Systems Dev EngKihoon.Yoon@dell.com +1 512 728 4191
 “For RNA sequencing, determining coverage is complicated by the fact that different transcripts are expressed at different levels. This means that more reads will be captured from highly expressed genes, and few reads will be captured from genes expressed at low levels. When planning RNA sequencing experiments, researchers usually think in terms of numbers of millions of reads to be sampled.” – cited from https://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf  Runtime refers Wall-Clock time throughout the blog.
Written by Bruce Wagner
This white paper compares the compute throughput and energy consumption ofthe user-selectable system power profiles available on the 14th generationPowerEdge R740 2U/2S rack server
Shane Kavanagh – Member Technical Staff, ESI Architecture, Dell EMC
For years now the tech industry has been talking about the “data tsunami” – the ongoing trend for increasing amounts of data that need to be stored and analyzed. This is primarily driven by the explosion in active connected devices and the desire to use the data they collect to provide better services (i.e., more efficient homes, smart cities, self-driving vehicles). While this trend has been going on for quite some time, its end is nowhere in sight.
In response to this pressing market need, the industry is delivering ever-increasing storage density to match the growth in generated data. But more than just density is at play. As data centers apply these solutions at scale, cost becomes a limiting concern, and as ever, performance is always a consideration. So, the challenge is really to provide greater density at lower cost and acceptable performance.
The density challenge
The storage density challenge for the industry is to try to deliver one petabyte of storage into a 1U form factor. There are multiple ways that this can be achieved. One approach is to use 10 x 128 TB U.2 SSD devices. But at today’s prices that would be cost prohibitive. You could consider using a custom form factor in your solution, but this makes it difficult to leverage cost and supply benefits of the high volume offerings in the market and requires changes to platform designs.
In response to the cost challenge of deploying large SSDs, an innovative approach that the Dell EMC Extreme Scale Infrastructure group is employing with a select group of customers is to use smaller, relatively low capacity, lower cost SSDs in standard form factors (i.e., M.2 devices) integrated in proven platforms (i.e., PowerEdge C4140). This allows us to provide highly dense NVMe based systems at costs approaching today’s SATA SSD cost points – and this approach has the added benefit of a more granular failure domain.
The solution we are exploring with these customers delivers M.2 devices on a PCIe card that conforms to a standard GPU adapter size, making it easy to plug into existing platforms that accommodate GPUs. (See example illustration.)
One of the keys to success with this approach is the inclusion of a high performance PCIe switch that fans out the PCIe lanes among the M.2 devices.
At today’s M.2 capacities this results in almost 100 TB per card, but note the capacities for M.2 devices are about to double in the next year – allowing the card to approach almost 200 TB of capacity. Once this higher capacity is reached, placing four of these cards in a PowerEdge C4140 provides in excess of half a petabyte, and as M.2 capacities grow, this design readily scales beyond one petabyte in 1U.
PerformanceKeep in mind, while this dense storage capacity is being delivered at SATA level costs, it is also significantly faster. Because we are delivering SSDs using the NVMe interface, the system will have performance levels well in excess of those available with SATA in an equivalent 1U system.
When delivered with a bandwidth optimized system, like a PowerEdge C4140, and paired with two 100 Gb NICs, this solution can deliver 200 Gb of bandwidth in a 1U form factor. So, in just 5U that quickly adds up to 1Tb of throughput and millions of IOPs – along with more than 1.5 petabytes of storage! (Readily scaling to 3 PB when the M.2 capacities double!)
4 high density NAND modules in a C4140
Data Driven Workloads
This high density all-flash solution is ideal for handling the sustained ingest of massive amounts of data, for example, as a front end in an edge computing architecture. It can work in conjunction with a Machine Learning backend, or any number of IoT functions that require large amounts of data to feed real-time analytics, like self-driving vehicles, satellite imagery, and weather telemetry.
Impressive storage density, extremely high bandwidth, and easy, technology-paced scalability - Dell EMC can offer large scale customers innovative, all flash solutions for their toughest data challenges. Inquiries about Extreme Scale Infrastructure solutions can be made at ESI@dell.com
This blog post is written by Revathi A from Dell Hypervisor Engg. team
NOTE: This list gets updated with new DellEMC Platforms launch. We recommend to review VMware HCL page for updates.