By Garima Kochhar and Kihoon Yoon. Dell EMC HPC Innovation Lab. October 2016

This blog presents performance results for the 2D alignment and 2D classification phases of the Cryo-electron microscopy (Cryo-EM) data processing workflow using the new Intel Knights Landing architecture, and compares these results to the performance of the Intel Xeon E5-2600 v4 family. A quick description of Cryo-EM and the different phases in the process of reconstructing 3D molecular structures with electron microscopy is provided below, followed by the specific tests conducted in this study and the performance results.

Cryo-EM allows molecular samples to be studied in near-native states and down to nearly atomic resolutions. Studying the 3D structure of these biological specimens can lead to new insights into their functioning and interactions, especially with proteins and nucleic acids, and allows structural biologists to examine how alterations in their structures affect their functions. This information can be used in system biology research to understand the cell signaling network which is part of a complex communication system. This communication system controls fundamental cell activities and actions to maintain normal cell homeostasis. Errors in the cellular signaling process can lead to diseases such as cancer, autoimmune disorders, and diabetes. Studying the functioning of the proteins responsible for an illness enables a biologist to develop specific drugs that can interact with the protein effectively, thus improving the efficacy of treatment. 

The workflow from the time a molecular sample is created to the creation of a 3D model of its molecular structure involves multiple steps. These steps are briefly (and simplistically!) described below.

  1. Samples of the molecule (protein, enzyme, etc.) are purified and concentrated in a solution.
  2. This sample is placed on an electron microscope grid and plunge-frozen.  This forms a very thin layer of vitreous ice to that surrounds and immobilizes the sample in its near-native state.
  3. The frozen sample is now placed in the microscope for imaging.
  4. The output of the microscope consists of many large image files and across multiple fields of view (many terabytes of data).
  5. Due to the low energy beams used in Cryo-EM (to avoid damaging the structures being studied), the images produced by the microscope have a bad signal-to-noise ratio. To improve the results, the microscope takes multiple images for each field of view.  Motion-correction techniques are then applied to allow the multiple images of the same molecule to be added together into an image with less noise.
  6. The next step is a manual process of picking “good-looking” molecule images from a few fields of view.
  7. The frozen sample consists of many molecules that are in many different positions. The resultant Cryo-EM images therefore also consists of images or shadows of the particle from different angles. So, the next step is a 2D alignment phase to uniformly orient the images by image rotation and translation.
  8. Next a 2D classification phase searches through these oriented images and sorts them into “classes”, grouping images that have the same view.
  9. After alignment and classification, there should be multiple collections of images, where each collection contains images showing a view of the molecule from the same angle and showing the same shape of the molecule (a “class”).  The images in a class are now combined into a composite image that provides a higher quality representation of that shape.
  10. Finally a 3D reconstruction of the molecule is built from all the composite 2D images.
  11. This 3D model can then be handed back to the structural biologist for further analysis, visualization, etc.

As is now clear, the Cryo-EM processing workflow must comprehend a lot of data, requires rich compute algorithms and considerable compute power for the 2D and 3D phases, and must move data efficiently across the multiple phases in the workflow. Our goal is to design a complete HPC system that can support the Cryo-EM workflow from start to finish and is optimized for performance, energy efficiency and data efficiency.

 

Performance Tests and Configuration

Focusing for now on the 2D phases of the workflow, this blog presents results for the steps #7 and #8 listed above - the 2D alignment and 2D classification phases. Two software packages in this domain, ROME and RELION were benchmarked on the Knights Landing (KNL, code name for the Intel Xeon Phi 7200 family) and Broadwell (BDW, code name for Intel Xeon E5-2600 v4 family) processors.

The tests were run on systems with the following configuration.

Broadwell-based systems

Server

12 * Dell PowerEdge C6320

Processor

Intel Xeon E5-2697 v4. 18 cores per socket, 2.3 GHz

Memory

128 GB at 2400 MT/s

Interconnect

Intel Omni-Path fabric

KNL-based systems

Server

12 * Dell PowerEdge C6320p

Processor

Intel Xeon Phi 7230.  64 cores, 1.3 GHz

Memory

96 GB at 2400 MT/s

Interconnect

Intel Omni-Path fabric

Software

Operating System

Red Hat Enterprise Linux 7.2

Compilers

Intel 2017, 17.0.0.098 Build 20160721

MPI

Intel MPI 5.1.3

ROME

1.0a

RELION

1.4

Benchmark Datasets

RING11_ALL

Set1. Inflammasome data: 16306 images of NLRC4/NAIP2 inflammasome with a size of 2502 pixels

DATA6 

Set4. RP-a: 57001 images of proteasome regulatory particles (RP) with a size of 1602 pixels

DATA8

Set2. RP-b: 35407 images of proteasome regulatory particles (RP) with a size of 1602 pixels

 

ROME

ROME performs the 2D alignment (step #7 above) and the 2D classification (step #8 above) in two separate phases called the MAP phase and the SML phase respectively. For our tests we used “-k” for MAP equal to 50 (i.e. 50 initial classes) and “-k” for SML equal to 1000 (i.e. 1000 final 2D classes).

The first set of graphs below, Figure 1 and Figure 2 show the performance of the SML phase on KNL. The compute portion of the SML phase scales linearly as more KNL systems are added into the test bed, from 1 to 12 servers as shown in Figure 1. The total time to run shown in Figure 2 is slightly lower than linear, and includes an I/O component as well as the compute component. The test bed used in this study did not have a parallel file system and used just local disks on the KNL servers. Future work for this project includes evaluating the impact of adding a Lustre parallel file system to this test bed and its effect on total time for SML.

Figure 1 - ROME SML scaling on KNL, compute time

Figure 2 - ROME SML scaling on KNL, total time

The next set of graphs compare the ROME SML performance on KNL and Broadwell. Figure 3, Figure 4 and Figure 5 plot the compute time for SML on 1 to 12 servers. The black circle on the graph shows the improvement in KNL runtime when compared to BDW. For all three datasets that were benchmarked, KNL is about 3x faster than BDW. Note we’re comparing one single-socket KNL server to a dual-socket Broadwell server, so this is a server to server comparison (not socket to socket). KNL is 3x faster than BDW across different numbers of servers, showing that ROME SML scales well on Omni-Path on both KNL and BDW, but the absolute compute time on KNL is 3x faster irrespective of the number of servers in test.

Considering total time to run on KNL versus BDW, we measured KNL to be 2.4x to 3.3x faster than BDW at all node counts. Specifically, DATA6 is ~ 2.4x faster on KNL, DATA8 is 3x faster on KNL and RING11_ALL is 3.4x faster on KNL when considering total time to run. As mentioned before, the total time includes an I/O component and one of the next step in this study is to evaluate the performance improvement if adding a parallel file system to the test bed.

Figure 3 - DATA8 ROME SML on KNL and BDW

Figure 4 - DATA6 ROME SML on KNL and BDW.

  

Figure 5 - RING11_ALL ROME SML on KNL and BDW

 

RELION

RELION accomplishes the 2D alignment and classification steps mentioned above in one phase. Figure 6 shows our preliminary results on RELION on KNL across 12 servers and on two of the test datasets. The “--K” parameter for RELION was set to 300, i.e., 300 classes for 2D classification. There are several things to be still tried here – the impact of a parallel file system on RELION (as we discussed for ROME earlier) and dataset sensitivity to the parallel file system. Additionally we plan to benchmark RELION on Broadwell, across different node counts and with different input parameters.

Figure 6 - RELION 2D alignment and classification on KNL

Next Steps

The next steps in this project include adding a parallel file system to measure the impact on the workflow, tuning the test parameters for ROME MAP, SML and RELION, and testing on more datasets. We also plan to measure the power consumption of the cluster when running Cryo-EM workloads to analyze performance per watt and performance per dollar metrics for KNL.