The microbiome plays a critical role in human and environmental health, yet the makeup and interactions of complex communities are not well understood. Genome sequencing on DNA extracted from microbiomes is used to study the diversity, integration, and dynamics of organisms in microbiomes. Due to the size and complexity of the datasets involved and the growth of community datasets, assembly and analysis are the most computationally demanding aspects of this branch of bioinformatics. The ExaBiome project is developing scalable data assembly and analysis tools to address current needs and, through the use of exascale computing power, provide solutions for anticipated increases in biological data.
Metagenomics—the application of high-throughput genome sequencing to DNA extracted from microbial communities—is a powerful and general method for studying microbial diversity, integration, and dynamics. Since the introduction of metagenomics over a decade ago, it has become an essential and routine tool. The assembly and analyses of metagenomic datasets are among the most computationally demanding tasks in bioinformatics. The scale and rate of growth of these datasets will require exascale resources to process (i.e., assemble) and interpret through annotation and comparative analysis.
The ExaBiome project aims to provide scalable tools for three core computational problems in metagenomics: (1) metagenome assembly, which takes raw sequence data and produces long genome sequences for each species; (2) protein clustering and annotation, which finds families of closely related proteins and identifies their functional behavior; and (3) signature-based approaches to enable scalable and efficient comparative metagenome analysis, which might show, for example, the variability of an environmental community over time.
The ExaBiome team has developed MetaHipMer, the first scalable distributed-memory metagenome assembler. MetaHipMer scales to thousands of compute nodes on today’s petascale architectures and has assembled the largest environmental datasets to date (up to 7.7 TB), something that was impossible with previous assemblers. The team continues to work on further scalability improvements and node-level optimizations to take advantage of fine-grained on-node parallelism and memory structures, including GPUs. MetaHipMer exhibits competitive quality with other assemblers, and the team continues to add new features driven by the experience of science teams. MetaHipMer is designed for short reads (Illumina) data, but a second assembler (DiBella) for long reads is also under development and shows even higher computational intensity, which might be a good fit for exascale systems, especially GPU-based exascale systems.
ExaBiome’s challenge problem is to demonstrate a high-quality assembly or set of assemblies on at least 50 TB of environmental data (reads) that runs across a full-exascale machine. The intent is to assemble an environmental sample from multiple temporal or spatial samples, which could reveal new species and insights into the makeup of the complex communities. This approach has been demonstrated to improve current state-of-the-art assembly pipelines, which are forced to use subsampling when datasets get large. This limits researchers’ ability to assemble rare, low-coverage species and can result in confusing genome duplications. Furthermore, assembling data across both time and spatial scales will not only enhance the assembly quality but could also reveal functions that otherwise would remain hidden. Addressing this challenge problem will demonstrate a first-in-class science capability by using the power of exascale computing combined with novel graph algorithms.
This project is expected to provide many potential beneficial science impacts, such as enhancing our understanding of microbial functions that can aid in environmental remediation, food production, and medical research. Given the growth of genomic data, a scientifically interesting 50 TB environmental sample should be available by 2022 and is expected to be large enough to fully use an exascale machine. However, the challenge problem could also use synthetic data with environmental characteristics or an ensemble assembly of multiple independent environmental datasets. It might also use short reads, long reads, or a hybrid of the two.
More details about the Exabiome project, including publicly available software, can be found at https://sites.google.com/lbl.gov/exabiome