Delivering Exascale Machine Learning Algorithms and Tools for Scientific Research

By Scott Gibson

Experiments, observations, and computer simulations enable scientists to ask questions and form hypotheses about the natural world that lead to breakthroughs. Now machine learning (ML), artificial intelligence, and data analytics are converging with high-performance computing (HPC) to open up new opportunities for scientific discovery on upcoming exascale computing systems and to influence the design and use of those systems.

The US Department of Energy Department’s (DOE) Exascale Computing Project (ECP) is leveraging the ML revolution in the development of various science applications. The pre-exascale systems Summit and Sierra are packed with thousands of cores and GPUs to carry the workloads in the progression to exascale.

Aurora and Frontier, set to be America’s first exascale supercomputers in 2021, are expected to hasten the convergence of traditional HPC, data analytics, and ML. In 2023 the El Capitan exascale machine is to be deployed. Meanwhile, an effort called ExaLearn, one of six ECP co-design centers, is focused on ML and related activities to inform the requirements for these pending exascale machines.

Peter Nugent, Berkeley Lab; Frank Alexander, Brookhaven Lab; Brian Van Essen, Livermore Lab

From left, Peter Nugent (Lawrence Berkeley National Laboratory), Frank Alexander (Brookhaven National Laboratory), and Brian Van Essen (Lawrence Livermore National Laboratory)

Frank Alexander, deputy director of the Computational Science Initiative at Brookhaven National Laboratory and leader of ExaLearn, is a guest on ECP’s podcast, Let’s Talk Exascale. He is joined by ExaLearn team members Peter Nugent of Lawrence Berkeley National Laboratory (LBNL) and Brian Van Essen of Lawrence Livermore National Laboratory (LLNL).

Nugent is a senior scientist, division deputy for Scientific Engagement, and department head for Computational Science at LBNL. Van Essen is LLNL’s informatics group leader and project lead for the Livermore Big Artificial Neural Network (LBANN) open-source deep learning toolkit. The interview was conducted this past November in Denver at SC19: The International Conference for High Performance Computing, Networking, Storage, and Analysis.

ExaLearn is a co-design partnership composed of experts from Argonne National Laboratory, Brookhaven Lab, LLNL, LBNL, Los Alamos National Laboratory, Oak Ridge National Laboratory, Pacific Northwest National Laboratory, and Sandia National Laboratories.

While Big Tech such as Google and Amazon use trained ML models to predict what users and customers may want based on previous behavior, ExaLearn’s focus is entirely different, Alexander said. He stressed that ExaLearn is providing exascale ML software for scientific research. ExaLearn’s algorithms and tools will be used by the ECP applications, other ECP co-design centers, and DOE experimental facilities and leadership-class computing facilities.

As one of ExaLearn’s early successes, Nugent highlighted using ML on cosmology datasets to generate surrogate models to make complicated simulations less expensive. Nugent’s team at LBNL set up a project and established a method for creating the surrogate models that Van Essen’s team at LLNL implemented using deep neural networks and the Sierra and Summit resources.

“The hypothesis was that if we could teach neural networks to see all of the data, they could fundamentally get a better understanding of the underlying science and predict better results in the training,” Van Essen said. “So, we as part of the ExaLearn multi-lab collaboration, have been able to work on scaling up and training these deep neural networks to datasets that have been previously unattainable throughout the community. We’ve delivered new results on the CosmoFlow network at Berkeley Lab. Fundamentally, this is a great scientific achievement that we can give back to other ECP projects like ExaSky, but it also provides a foundational capability to the ExaLearn project and to DOE and ECP at large.”

Van Essen said that ExaLearn has achieved two major technical advances its first year: it has developed a unique ability to train surrogate models and has impacted scientific research through the creation of models that are an order of magnitude more capable than previous ones.

The ExaLearn team is developing four major types of ML algorithms, Alexander said: surrogates, control, inverse problem solvers, and design. By the end of the project, ExaLearn hopes to have used all of the four in a given domain area, he added.

A key legacy of the ExaLearn project, Van Essen said, will be to use the strength of the eight-lab collaboration to find technologies across the science, ML, and HPC space and combine them for the benefit of the DOE national laboratories and the scientific community at large. And, Nugent pointed out, ExaLearn is making all of its datasets publicly available.