ExaBiome

The microbiome plays a critical role in human and environmental health, yet the makeup and interactions of complex communities are not well understood. Genome sequencing on DNA extracted from microbiomes is used to study the diversity, integration, and dynamics of organisms in microbiomes. Due to the size and complexity of the datasets involved and the growth of community datasets, assembly and analysis are the most computationally demanding aspects of this branch of bioinformatics. The ExaBiome project is developing scalable data assembly and analysis tools to address current needs and, through the use of exascale computing power, provide solutions for anticipated increases in biological data.

Project Details

Metagenomics—the application of high-throughput genome sequencing to DNA extracted from microbial communities—is a powerful and general method for studying microbial diversity, integration, and dynamics. Since the introduction of metagenomics over a decade ago, it has become an essential and routine tool. The assembly and analyses of metagenomic datasets are among the most computationally demanding tasks in bioinformatics. The scale and rate of growth of these datasets will require exascale resources to process (i.e., assemble) and interpret through annotation and comparative analysis.

The ExaBiome project aims to provide scalable tools for three core computational problems in metagenomics: (1) metagenome assembly, which takes raw sequence data and produces long genome sequences for each species; (2) protein clustering and annotation, which finds families of closely related proteins and identifies their functional behavior; and (3) signature-based approaches to enable scalable and efficient comparative metagenome analysis, which might show, for example, the variability of an environmental community over time.

The ExaBiome team has developed MetaHipMer, the first scalable distributed-memory metagenome assembler. MetaHipMer scales to thousands of compute nodes on today’s petascale architectures and has assembled the largest environmental datasets to date (up to 7.7 TB), something that was impossible with previous assemblers. The team continues to work on further scalability improvements and node-level optimizations to take advantage of fine-grained on-node parallelism and memory structures, including GPUs. MetaHipMer exhibits competitive quality with other assemblers, and the team continues to add new features driven by the experience of science teams. MetaHipMer is designed for short reads (Illumina) data, but a second assembler (DiBella) for long reads is also under development and shows even higher computational intensity, which might be a good fit for exascale systems, especially GPU-based exascale systems.

ExaBiome’s challenge problem is to demonstrate a high-quality assembly or set of assemblies on at least 50 TB of environmental data (reads) that runs across a full-exascale machine. The intent is to assemble an environmental sample from multiple temporal or spatial samples, which could reveal new species and insights into the makeup of the complex communities. This approach has been demonstrated to improve current state-of-the-art assembly pipelines, which are forced to use subsampling when datasets get large. This limits researchers’ ability to assemble rare, low-coverage species and can result in confusing genome duplications. Furthermore, assembling data across both time and spatial scales will not only enhance the assembly quality but could also reveal functions that otherwise would remain hidden. Addressing this challenge problem will demonstrate a first-in-class science capability by using the power of exascale computing combined with novel graph algorithms.

This project is expected to provide many potential beneficial science impacts, such as enhancing our understanding of microbial functions that can aid in environmental remediation, food production, and medical research. Given the growth of genomic data, a scientifically interesting 50 TB environmental sample should be available by 2022 and is expected to be large enough to fully use an exascale machine. However, the challenge problem could also use synthetic data with environmental characteristics or an ensemble assembly of multiple independent environmental datasets. It might also use short reads, long reads, or a hybrid of the two.

More details about the Exabiome project, including publicly available software, can be found at https://sites.google.com/lbl.gov/exabiome

Principal Investigator(s):

Katherine Yelick, Lawrence Berkeley National Laboratory

Collaborators:

Lawrence Berkeley National Laboratory, Joint Genome Institute, Los Alamos National Laboratory

Progress to date

  • Scalable HipMer and MetaHipMer performance was demonstrated on over 1,000 nodes.
  • An assembly of a 7.7 TB tropical soils dataset was completed, which is the largest metagenome ever assembled. Overall, the computation required approximately 1.4 h on 512 Summit nodes and used a total of 84 TB of memory.
  • In less than 1 h, 383 million proteins were clustered on 729 Summit nodes. Scalable HipMCL performance was demonstrated over 1,000 Summit nodes. In very dense protein similarity networks, the GPU-accelerated HipMCL achieves an order of magnitude speedup compared with CPU-only HipMCL.
  • A high-performance distributed memory overlap/aligner was implemented for long reads.

The ExaBiome project is providing exascale solutions for the assembly and analysis of metagenomic data that will address both current and future data processing needs in bioinformatics.

National Nuclear Security Administration logo U.S. Department of Energy Office of Science logo