The microbiome plays a critical role in human and environmental health, yet the makeup and interactions of complex communities are not well understood. Genome sequencing on DNA extracted from microbiomes is used to study the diversity, integration, and dynamics of organisms in microbiomes. Due to the size and complexity of the datasets involved and the growth of community datasets, assembly and analysis are the most computationally demanding aspects of this branch of bioinformatics. The ExaBiome project is developing scalable data assembly and analysis tools to address current needs and, through the use of exascale computing power, provide solutions for anticipated increases in biological data.
Metagenomics—the application of high-throughput genome sequencing to DNA extracted from microbial communities—is a powerful and general method for studying microbial diversity, integration, and dynamics. Since the introduction of metagenomics over a decade ago, it has become an essential and routine tool. The assembly and analyses of metagenomic datasets are among the most computationally demanding tasks in bioinformatics. The scale and rate of growth of these datasets will require exascale resources to process (i.e., assemble) and interpret through annotation and comparative analysis.
The ExaBiome project aims to provide scalable tools for three core computational problems in metagenomics: (1) metagenome assembly, which takes raw sequence data and produces long genome sequences for each species; (2) protein clustering and annotation, which finds families of closely related proteins and identifies their functional behavior; and (3) signature-based approaches to enable scalable and efficient comparative metagenome analysis, which might show, for example, the variability of an environmental community over time.
The ExaBiome team has developed MetaHipMer, the first scalable distributed-memory metagenome assembler. MetaHipMer scales to thousands of compute nodes on today’s petascale architectures and has assembled the largest environmental datasets to date (up to 30 TB), something that was impossible with previous assemblers. The team continues to work on further scalability improvements and node-level optimizations to take advantage of fine-grained on-node parallelism and memory structures, including GPUs. MetaHipMer exhibits competitive quality with other assemblers, and the team continues to add new features driven by the experience of science teams. MetaHipMer is designed for short reads (Illumina) data, but a second assembler (DiBella) for long reads is also under development and shows even higher computational intensity, which might be a good fit for exascale systems, especially GPU-based exascale systems.
ExaBiome’s challenge problem is to demonstrate a high-quality assembly or set of assemblies on at least 50 TB of environmental data (reads) that runs across a full-exascale machine. The reads will be drawn from the Tara Oceans dataset, which consists of multiple temporal and spatial environmental samples from oceans around the world. Coassembly of this dataset could reveal new species and insights into the makeup of complex ocean microbial communities. The coassembly approach has been demonstrated to improve current state-of-the-art assembly pipelines, which are forced to use subsampling when datasets get large. This limits researchers’ ability to assemble rare, low-coverage species and can result in confusing genome duplications. Furthermore, coassembling data across both time and spatial scales will not only enhance the assembly quality but could also reveal functions that otherwise would remain hidden. Addressing this challenge problem will demonstrate a first-in-class science capability by using the power of exascale computing combined with novel graph algorithms.
For protein analysis, the ExaBiome team has developed the similarity search tool PASTIS (Protein Alignment via Sparse Matrices) and the clustering tool HipMCL (High-performance Markov Clustering). Scalability and high performance are of paramount importance for uncovering novel phenomena that occur at very large scales in proteins arising in metagenomics research. The quality and sensitivity are often sacrificed by other tools in this field when the scale of data grows beyond a few tens of millions. Both of our tools cut the time required from weeks to hours without sacrificing quality and sensitivity. In addition to customized load balancing techniques, PASTIS contains novel algorithms such as Distributed Blocked 2D Sparse SUMMA in overlap detection. This overcomes the memory requirements of the many-against-many search and enables in-memory similarity search. HipMCL relies on fast randomized algorithms to estimate the memory required by the clustering and performs it in stages to reduce the memory overhead. Both tools have extensive GPU support and make use of all the resources on the nodes by distributing the components among GPU and CPU resources according to factors such as memory footprint and computational intensity. Recently, HipMCL enabled clustering of more than half a billion protein sequences and helped in the exploration of protein dark matter – a task that proved infeasible with other clustering tools.
This project is expected to provide many potential beneficial science impacts, such as enhancing our understanding of microbial composition that can aid in environmental remediation and understanding impacts of climate change including wildfires, algae blooms, and nature-based carbon capture, in addition to improving food production and medical research. This work also helps to answer fundamental biological questions such as the exploration of functional dark matter and revealing novel protein structures.
More details about the ExaBiome project, including publicly available software, can be found at https://sites.google.com/lbl.gov/exabiome