Exascale Machine Learning Technologies

The development of capable exascale systems was made possible by a collaborative interdisciplinary co-design strategy. The Exascale Computing Project (ECP) established collaborations among software developers and hardware technology experts in co-design centers such as Exalearn, thereby fostering a participatory development process to meet the complex and often conflicting needs of current and future exascale applications. The co-design teams worked closely with application developers to deliver efficient and reliable software products that are integral to the unprecedented results generated on exascale supercomputers such as Frontier.

Summary

Artificial intelligence (AI) is catalyzing paradigm shifts in scientific research and industry, thereby enabling automated data analysis, far more efficient production practices, and new predictive capabilities in fields such as biology, materials science, and cosmology. Integrating AI models with the world’s fastest computers offers powerful synergistic effects: allowing these systems to process huge datasets and run complex experimental workflows, improving the accuracy and applicability of AI systems, and enhancing the computational efficiency of supercomputers. This synergy will open broad new research avenues and greatly accelerate the pace of many existing projects. However, successfully integrating AI and supercomputing requires overcoming several technical hurdles.

The Exascale Computing Project (ECP) sponsored the ExaLearn Co-design Center to advance the use of AI and machine learning (ML) in the fastest supercomputers on earth. The ExaLearn team collaborated with ECP hardware and application developers to improve several common learning methods and to enable their efficient integration into exascale systems. Using this co-design approach, ExaLearn reduced simulation costs while creating and expanding a unique exascale-enabled AI and ML tool set that incorporates information from relevant fields and provides researchers with tools for uncertainty quantification and data analysis.

Integrating AI and ML methods with modern supercomputers is not a straightforward task. Exascale supercomputers’ unique hardware and software designs—especially their extensive use of graphics processing units for computation and their reliance on specific programming languages—require extensive adjustments to AI and ML frameworks for successful integration. Simply ensuring the function of such a system is already a difficult task, but these adjustments must also contribute to unique and scalable programs that present useful ways to increase the combined efficiency of both systems.

The ExaLearn team met and exceeded their goals by applying new AI, ML, and reinforcement learning methods to exascale-class supercomputers and advancing research methodologies in key areas. These methods and advances include new neural network emulators for reduced simulation costs, ML-based solvers for improved data analysis from neutron scattering data, reinforcement learning–based frameworks for benchmarking system performance and modeling atomic and molecular interactions with advanced physics-aware algorithms, and much more. The team also created a catalog of ML training data for a variety of fields, thereby enabling reproducible and efficient training and testing of new models on exascale systems.

The software tool set enabled by ExaLearn will be applied to problems across the US Department of Energy mission space and will greatly accelerate research progress on exascale computing systems. These advances will support solutions in multiple domains, including fundamental work in chemistry, physics, and biology and research in materials science and cosmology. The advances also provide flexibility for developing new exascale-enabled AI and ML systems with easy access to robust training data, thereby ensuring that these technologies remain applicable to the next generation of computational problems.

Technical Discussion

The ExaLearn Co-design Center continues to advance how artificial intelligence (AI) and machine learning (ML) are developed to run on the world’s fastest supercomputers. In addition to providing scalable AI/ML tools that enhance Exascale Computing Project (ECP) applications, the center is improving the efficiency and effectiveness of US Department of Energy (DOE) leadership-class computing resources and large-scale experimental user facilities. For its overall focus, ExaLearn selected four classes of learning problems, specifically using ML to develop surrogate models, inverse solvers, control policies, and design strategies. Each class is being demonstrated on a different ECP application area, employing a focused co-design process that targets common learning methods using deep neural networks, transformer methods, kernel and tensor methods, decision trees, ensemble methods, graphical models, and reinforcement learning.

To understand the limitations posed by constraints related to application development costs, application fidelity, performance portability, scalability, and power efficiency, ExaLearn has engaged directly with developers of ECP hardware, system software, programming models, learning algorithms, and applications. These collaborations have enhanced the program by:

Reducing the development risk of ML software for ECP application teams by investigating crucial performance trade-offs related to implementing and applying learning methods in science and engineering
Producing high-performance implementations of learning methods
Enabling simple, efficient integration of those methods with applications
Contributing to the co-design of effective exascale applications, software, and hardware.

To replace costly simulation methods, ExaLearn uses the latest techniques found in generative adversarial networks and variational autoencoders to construct fast, accurate surrogate models, notably in applications involving computational cosmology, replacing complex N-body and hydrodynamics algorithms with fast neural network emulators. Working with ML-based inverse solvers, ExaLearn is applying these inverse methods to solve problems to extract complex materials structure from neutron scattering data at Oak Ridge National Laboratory’s Spallation Neutron Source. In areas of optimal control and steering of complex computer simulation workflows, ExaLearn also provides scalable ML software for various ECP applications.

EXARL, a software framework that enables exascale reinforcement learning for science and benchmarking, is demonstrating and testing an initial use case for the temperature control of block copolymer self-annealing in light source experiments on DOE Leadership Computing facilities. In the design area, ExaLearn is tailoring reinforcement learning algorithms with physics-aware ML algorithms to develop interpretable ML models for use with graph-based models of atomic/molecular structure (e.g., generating novel electrolyte molecules and water cluster models). ExaLearn’s design and control groups also have created a reinforcement learning pipeline for graph-based networks.

To facilitate reproducible experiments by organizing and distributing data, ExaLearn has established and is populating a searchable catalog of ML training data. This system (https://petreldata.net/exalearn/) enables large quantities of data organized in forms suitable for training and testing ML models to be browsed, searched, and accessed at high speeds.

ExaLearn is succeeding in its goal to build a software tool set that can be applied to multiple problems within the DOE mission space, use exascale platforms directly, and provide essential components to an exascale workflow. This AI/ML tool set does not replicate capabilities easily obtainable from existing, widely available packages and builds in domain knowledge (e.g., physics, chemistry, biology) wherever possible. Uncertainty is quantified in a predictive manner and is interpretable, reproducible, and based on solid mathematical methods.

Francis Alexander, Brookhaven National Laboratory

Collaborators

Brookhaven National Laboratory, Argonne National Laboratory, Lawrence Berkeley National Laboratory, Lawrence Livermore National Laboratory, Los Alamos National Laboratory, Oak Ridge National Laboratory, Pacific Northwest National Laboratory, Princeton University, Sandia National Laboratories

Summary

Technical Discussion

Principal Investigator(s)

Collaborators