ExaBiome

The microbiome plays a critical role in human and environmental health, yet the makeup and interactions of complex communities are not well understood. Genome sequencing on DNA extracted from microbiomes is used to study the diversity, integration, and dynamics of organisms in microbiomes. Due to the size and complexity of the datasets involved and the growth of community datasets, assembly and analysis are the most computationally demanding aspects of this branch of bioinformatics. The ExaBiome project is developing scalable data assembly and analysis tools to address current needs and, through the use of exascale computing power, provide solutions for anticipated increases in biological data.

Summary

Accelerating foundational biological research and biotechnology development are critical steps in addressing the ecological problems of the 21st century, from the spread of antibiotic-resistant diseases to the accelerating formation of deserts and ocean dead zones due to pollution and climate change. The CDC estimates that antibiotic-resistant diseases infect more than 2.8 million U.S. citizens and cost more than $55 billion in treatment and lost productivity annually, and the collapse of ocean and terrestrial ecosystems has been identified as a catalyst to starvation and water shortage, the spread of disease, and mass migration from affected areas.

Historically, identification and reverse engineering of microbes—the most widespread and diverse form of life on Earth—has yielded powerful technologies such as uniquely effective antibiotics, revolutionary tools for genome editing such as CRISPR-Cas, and advanced biomanufacturing techniques for vaccines and agricultural products. Applying high performance computing to this field will greatly improve the speed with which microorganisms can be identified and reverse engineered for new biotechnologies, providing new solutions in medicine, agriculture, and environmental science.

The ExaBiome team within the Department of Energy’s Exascale Computing Project has created a software platform which enables unprecedented insight into a staple biological research technique called metagenomics—the reconstruction and comparison of genomic information from entire communities of microbes found in soil, water, or tissue samples. With the application of exascale computation, researchers can more quickly identify new species of microbes and viruses, map population changes in a community or environment over time, and understand the function of unique cellular machinery by comparing huge numbers of similar genes and proteins.

Without exascale computing researchers cannot fully analyze complex microbial datasets—which are often dozens of terabytes in size—in a reasonable timeframe, and must resort to analyzing and computationally aggregating subsamples of large datasets. This approach limits researchers’ ability to discover and characterize uncommon species, reduces the accuracy of assembled genomes, and obscures the function of genes and proteins.

To address these issues, the ExaBiome team has created a scalable metagenome assembler, protein analysis, and similarity search tool. These tools allow researchers to take raw data and form genome sequences for individual species, cluster proteins and accurately identify their function, and analyze multiple metagenomes—which can show how an environment has changed with time. The ExaBiome team has used these tools to analyze over 400 million protein sequences in less than 4 hours, a process that would take weeks using previous methods. The team has also used metagenomic techniques to assemble the largest environmental dataset to date, allowing for the discovery of new species and insight into the composition of microbial communities that cannot be replicated with subsampling.

Exascale delivers unprecedented performance and fidelity to move research forward at a faster pace. These new capabilities will dramatically accelerate the development of new biotechnologies by unlocking the potential of as-yet poorly understood biological systems. As an example, these technologies can be implemented in the development of next-generation fertilizers for increased agricultural yield and improved soil health, bioremediation tools to stabilize critical ecosystems and keep our environment livable, new methods to analyze and treat infection and diseases of the human microbiome, and beyond.

Technical Discussion

Metagenomics—the application of high-throughput genome sequencing to DNA extracted from microbial communities—is a powerful and general method for studying microbial diversity, integration, and dynamics. Since the introduction of metagenomics over a decade ago, it has become an essential and routine tool. The assembly and analyses of metagenomic datasets are among the most computationally demanding tasks in bioinformatics. The scale and rate of growth of these datasets will require exascale resources to process (i.e., assemble) and interpret through annotation and comparative analysis.

The ExaBiome project aims to provide scalable tools for three core computational problems in metagenomics: (1) metagenome assembly, which takes raw sequence data and produces long genome sequences for each species; (2) protein clustering and annotation, which finds families of closely related proteins and identifies their functional behavior; and (3) signature-based approaches to enable scalable and efficient comparative metagenome analysis, which might show, for example, the variability of an environmental community over time.

The ExaBiome team has developed MetaHipMer, the first scalable distributed-memory metagenome assembler. MetaHipMer scales to thousands of compute nodes on today’s petascale architectures and has assembled the largest environmental datasets to date (up to 30 TB), something that was impossible with previous assemblers. The team continues to work on further scalability improvements and node-level optimizations to take advantage of fine-grained on-node parallelism and memory structures, including GPUs. MetaHipMer exhibits competitive quality with other assemblers, and the team continues to add new features driven by the experience of science teams. MetaHipMer is designed for short reads (Illumina) data, but a second assembler (DiBella) for long reads is also under development and shows even higher computational intensity, which might be a good fit for exascale systems, especially GPU-based exascale systems.

ExaBiome’s challenge problem is to demonstrate a high-quality assembly or set of assemblies on at least 50 TB of environmental data (reads) that runs across a full-exascale machine. The reads will be drawn from the Tara Oceans dataset, which consists of multiple temporal and spatial environmental samples from oceans around the world. Coassembly of this dataset could reveal new species and insights into the makeup of complex ocean microbial communities. The coassembly approach has been demonstrated to improve current state-of-the-art assembly pipelines, which are forced to use subsampling when datasets get large. This limits researchers’ ability to assemble rare, low-coverage species and can result in confusing genome duplications. Furthermore, coassembling data across both time and spatial scales will not only enhance the assembly quality but could also reveal functions that otherwise would remain hidden. Addressing this challenge problem will demonstrate a first-in-class science capability by using the power of exascale computing combined with novel graph algorithms.

For protein analysis, the ExaBiome team has developed the similarity search tool PASTIS (Protein Alignment via Sparse Matrices) and the clustering tool HipMCL (High-performance Markov Clustering). Scalability and high performance are of paramount importance for uncovering novel phenomena that occur at very large scales in proteins arising in metagenomics research. The quality and sensitivity are often sacrificed by other tools in this field when the scale of data grows beyond a few tens of millions. Both of our tools cut the time required from weeks to hours without sacrificing quality and sensitivity. In addition to customized load balancing techniques, PASTIS contains novel algorithms such as Distributed Blocked 2D Sparse SUMMA in overlap detection. This overcomes the memory requirements of the many-against-many search and enables in-memory similarity search. HipMCL relies on fast randomized algorithms to estimate the memory required by the clustering and performs it in stages to reduce the memory overhead. Both tools have extensive GPU support and make use of all the resources on the nodes by distributing the components among GPU and CPU resources according to factors such as memory footprint and computational intensity. Recently, HipMCL enabled clustering of more than half a billion protein sequences and helped in the exploration of protein dark matter – a task that proved infeasible with other clustering tools.

This project is expected to provide many potential beneficial science impacts, such as enhancing our understanding of microbial composition that can aid in environmental remediation and understanding impacts of climate change including wildfires, algae blooms, and nature-based carbon capture, in addition to improving food production and medical research. This work also helps to answer fundamental biological questions such as the exploration of functional dark matter and revealing novel protein structures.

More details about the ExaBiome project, including publicly available software, can be found at https://sites.google.com/lbl.gov/exabiome

Summary

Technical Discussion

Principal Investigator(s)

Collaborators