Providing Exascale Solutions for the Assembly and Analysis of Metagenomic Data
By Scott Gibson
The latest episode of the Let’s Talk Exascale podcast features Kathy Yelick and Lenny Oliker of Lawrence Berkeley National Laboratory (LBNL). Yelick is principal investigator and Oliker is executive director of a project named ExaBiome: Exascale Solutions for Microbiome Analysis, which is part of the US Department of Energy’s Exascale Computing Project (ECP).
Yelick, a professor of electrical engineering and computer sciences at UC Berkeley and the associate lab director for Computing Sciences at LBNL, has for many years focused mostly on programming models and tools, systems for making high-performance computing (HPC) easier and more efficient. Oliker, leader of the Performance and Algorithms Research Group in the Computational Research Division of LBNL, works on various performance issues for HPC systems for scientific computing with attention to a variety of questions pertaining to optimization, evaluation, and modeling.
Microbiome Ubiquity and Interconnectedness
The ExaBiome project is developing computational tools to analyze microbial species—bacteria or viruses that typically live in communities of hundreds of different species.
“For example, the human microbiome is made up of microbes that live in our digestive systems, skin, or other organs, and they’re linked to many important health issues: obesity, mental health, and cancer,” Oliker said. “In fact, it’s estimated that the human body has at least as many bacteria in its microbiome as human cells. This is a pretty significant community living inside each of us. To understand the behavior and application of this rich genomic community, we first have to learn to analyze what’s called the metagenome.”
A genome is all of the DNA information of a particular organism; a metagenome is all of the genetic information of the community of microorganisms found in an environmental sample. Metagenomics—the application of high-throughput genome sequencing technologies to DNA extracted from microbiomes—is a powerful and general method for studying microbial diversity, integration, and dynamics.
Microbiomes are ubiquitous and dominant not only in humans but also the atmosphere, the ocean, and soils and sediments.
“Beyond health care, understanding the microbiome and these species is really important to studying the Earth system and things like climate change and environmental remediation,” Yelick said. “The microbial communities that live naturally in the environment are some of the largest and most complex. Scientists are interested in learning more about how to respond to major fires or chemical spills and how we can use microbial communities to manufacture chemicals such as antibiotics or industrial materials and things like that. The ExaBiome project is about using high-performance computing, and eventually exascale computing, to analyze these very large microbial communities, understand their functional behavior, and then compare them with different communities to learn how different communities might behave in reference to carbon capture or to other applications.”
Examining metagenomics is a relatively new endeavor. “I think computational analysis of it has been going on for about a decade,” Oliker said. “And it’s my understanding that we’ve cultured less than 1 percent of the microbiomes that are out there, and we’ve sequenced an even smaller percentage of those species. So really, our understanding of these communities is in its very early stages. I think it has tremendous potential, but we’re at the dawn of our understanding.”
Metagenomic study is essential due to the interconnectedness of the microbiomes. “One of the big challenges is that you can’t culture the microbiomes because many of them exist only in these communities, and to understand their behavior, you also need to understand their entire community and how it fits together; so that’s the reason for sequencing and interpreting, which is where metagenomics comes into play,” Yelick said.
A New Approach to Computation
With respect to computation, genomic analysis is a departure from the traditional approach to simulation problems. “The initial high-level structure of the relationship between the different sequences or of the genomics is unknown,” Oliker said. “So that makes it much more difficult to parallelize, and it requires data structures that are much harder to handle at large scale: hash tables, histograms, graphs, and very sparse unstructured matrices and structures. We also have to worry about dynamic load balancing. We have little locality and unpredictable communication, and the connections between the processors are arbitrary, so there’s irregularity in both space and time. Putting all of those things together creates a very complex computational problem, especially as we scale up toward the exascale regime.”
The ExaBiome project aims to provide scalable tools for three core computational problems in metagenomics. “The first is genome assembly, which is a problem of turning raw sequencing data into genomes,” Yelick said. “So in this case, we’re examining sequencing data that comes from, say, a scoop of soil or from the human microbiome—where all the microbes are mixed together and we’re trying to then turn those into complete genomes for each species or something that at least has much longer strands so that we can find out what genes they have, what proteins they code for, and so on. The second problem is what we call protein clustering. That’s exploring the relationships between the different proteins that come from those genes. And then the third problem is a comparative metagenome analysis where you have maybe two different samples of soil from different points in time or from nearby locations and you’re trying to understand the similarities or how they may change over time.”
The three core computational problems lead to very fine-grained communication and irregular patterns of communication. “For that reason, we deploy one-sided communication and partition global address space languages, at least in the assembly problem,” Yelick said. “And we work closely with other parts of ECP on the software support for this communication in the algorithms.”
The ExaBiome team interacts especially closely with the Pagoda project in ECP’s Software Technology research focus area. “Pagoda is looking at how to support this kind of one-sided communication,” Yelick said. “Imagine you have a big exascale computer and each processor has its own memory. With one-sided communication, a processor can directly read and write the memory of another processor without asking that other processor to help. And so that’s what we need when we’re building something like a hash table because the hash table is spread out over all the memories of all the processors. If you want to look something up in that hash table, you want to do it without getting the processor on the other side involved with that operation.” A hash table is a data structure to store a large number of items where the time to look up an item is very fast even if the table is huge.
Massive-Scale Datasets
Pushing past the traditional shared-memory-system approach, the ExaBiome team has developed efficient distributed memory implementations and analyzed some of the largest datasets in the metagenomics community. At SC18 in Dallas, the team presented a paper that described the first whole assembly of the Twitchell Wetlands. “This is a complex massive-scale metagenome dataset of 7.5 billion reads and over 2.6 terabytes,” Oliker said. “These time-series soil samples are collected in the San Francisco delta. At the time, this was the largest metagenome sample ever collected. Since then, we have examined an even larger dataset, 3.3 terabytes, from soil representing carbon cycle experiments.”
In the computational area of protein clustering, biologists are creating datasets that contain hundreds of millions of proteins and other cellular components. Clustering algorithms must be applied to these datasets to identify key patterns of new classes of proteins. Although these techniques have been used for many years, they cannot process the emerging datasets without distributed memory implementation. “In our recent work—another nice collaboration with the Joint Genome Institute—we developed an efficient distributed memory protein analysis based on sparse matrix methods,” Oliker said. “It’s called HipMCL, and HipMCL has recently clustered the largest biological network to date—it was over 300 million proteins with over 37 billion connections.”
ExaBiome’s Expected Enduring Legacy
The ExaBiome project has already made contributions that will leave an enduring legacy. “We have new classes of codes that take problems that formerly ran in shared memory and enable them to run on current petascale machines, and therefore, they can effectively handle metagenomic datasets,” Yelick said. “For an HPC community, the most obvious thing is we’re looking now at an 8-terabyte dataset. Our exascale goal is a 50-terabyte dataset, so you can see the quantitative increase in this. From a biological standpoint, what that means is being able to find species that are represented at very small levels in a sample and still assemble them because you have this enormous dataset to work with; whereas, when you break it up into pieces as people were doing before, you can’t assemble those kinds of rare species in the sample. And yet those rare species can still be very important in the function of the microbial community. So we believe that both the clustering code, HipMCL, and the assembly codes are going to be a long-lasting legacy of the project. And in addition to the assemblies done through collaborations with the science community, we believe we’re going to be able to understand datasets that hadn’t been interpreted before.”