Building a Capable Computing Ecosystem for Exascale and Beyond

 

The exascale computing era is here.

With the delivery of the U.S. Department of Energy’s (DOE’s) first exascale system, Frontier, in 2022, and the upcoming deployment of Aurora and El Capitan systems by next year, researchers will have the most sophisticated computational tools at their disposal to conduct groundbreaking research. Exascale machines, which can perform more than a quintillion operations per second, are 1,000 times faster and more powerful than their petascale predecessors, enabling simulations of complex physical phenomena in unprecedented detail to push the boundaries of scientific understanding well beyond its current limits. This incredible feat of research, development, and deployment has been made possible through a national effort to maximize the benefits of high-performance computing (HPC) for strengthening U.S. economic competitiveness and national security. The Exascale Computing Project (ECP) has been an integral part of that endeavor.

Seven years ago, DOE’s Office of Science and National Nuclear Security Administration embarked on a fundamentally different approach to advance HPC capabilities in the national interest. Within the HPC community, application developers, software technology experts, and hardware vendors tend to work independently toward producing products that are later integrated together. While effective, this process can at times create an implementation gap between software tools and how the applications can best utilize them to exploit the full performance of new, more advanced machines—slowing the realization of full computing capability. ECP recognized this challenge and strategically brought these different groups together as one community at the outset—fostering a computing ecosystem that supports co-design of applications, software, and hardware to accelerate scientific innovation and technical readiness for exascale systems.

Fast forward to 2023—the project’s final year—and ECP collaborations have involved more than 1,000 team members working on 25 different mission-critical applications for research in areas ranging from energy and environment to materials and data science; 70 unique software products; and integrated continuous testing and delivery of ECP products on targeted DOE systems. The results achieved as part of the ECP ecosystem development reflect the synergy, interdependency, and collaboration forged between the project’s three focus areas—Application Development, Software Technology, and Hardware and Integration—and the close working relationships with DOE HPC facilities and the vendors that are fielding the exascale machines. “ECP emphasizes the commonalities between each of the focus areas and provides an environment where we can identify with each other, share experiences and ideas, and understand one another while still being unique in our abilities,” says Andrew Siegel, a senior scientist at Argonne National Laboratory and the director for ECP Applications Development. “ECP has provided the stability and needed vision for a diverse community to work together to achieve targets we all care about.”

A Computing Symphony

ECP is an extensive, seven-year, $1.8 billion project that harnesses the collective brainpower of computer science experts from DOE national laboratories, universities, and industrial partners under a single funding umbrella.ECP logo

With this funding paradigm, integrated teams have been able to surpass their target goal of 50 times the application performance of the 20-petaflop (floating-point operations per second) systems in use when ECP began in 2016 and 5 times the performance of the 200-petaflop Summit supercomputer (ranked the world’s most powerful computer in 2018 and 2019). Mike Heroux, a senior scientist at Sandia National Laboratories and the director for ECP Software Technology, says “ECP is unique in that everyone involved in the project has the same mission, and we have healthy funding that is holistic across all the participating organizations, so we can collaborate in ways that have been essential to our success.”

At a basic level, software technology products such as math libraries, input/output (I/O) libraries, and performance analysis tools, provide the building blocks for applications—sophisticated computer programs that run complex underlying mathematical calculations to deliver the necessary predictive capabilities. Applications are dependent upon the available software products, and both must be developed with computing architectures in mind—what types of processors are used, for example—so that they will run efficiently and effectively when integrated. Together, these three pieces—applications, software technology, and hardware—orchestrate the computing symphony that enables advanced scientific simulation.

According to Erik Draeger, the Scientific Computing group leader in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory and the deputy director for ECP Applications Development, the functional model that has existed for most of computational science can be likened to frontier homesteading. “In general, homesteaders supply for their own needs and build their own structures. They may get input, but they still want to build and fix it themselves. They want to understand how it works,” he says. “In this case, the homesteader isn’t going to be comfortable with having a service where someone shows up and fertilizes his field once a week. He would have a hard time trusting that the person would do it the way he would want it done because the situation is counter to the model.” As the computing landscape becomes more complex, this homesteader mentality becomes less tractable. HPC has become so involved that it’s not an efficient practice for one person to be an expert in every subdomain. Draeger continues, “ECP enabled these different groups—applications, software, and hardware—to establish healthy, collaborative working relationships where specialists in each area came together to create something greater than the sum of its parts.”

In the past, development paths for applications and software technologies have often been somewhat disconnected from one another. Part of the reason for this disconnect is that technology products are not typically created with specific applications in mind. Although this more isolated approach has produced software products that have become extremely useful for many applications with large user communities, these are the exceptions rather than the rule. With ECP, working together was a prerequisite for participation. “From the beginning, the teams had this so-called ‘shared fate,’” says Siegel. When incorporating new capabilities, applications teams had to consider relevant software tools developed by others that could help meet their performance targets, and if they didn’t choose to use them, they needed to justify why not. Simultaneously, software technology teams had their success measured by the number of sustainable integrations they achieved with applications and other users of the products. “This early communication incentivized teams to be knowledgeable of each other’s work and identify gaps between what the application teams needed and what the software technologies could provide,” says Siegel. “Initially, we had to foster these types of collaborations, but eventually the process gained momentum. Teams wanted to help each other and demonstrate that effort quantitatively.”

Creating this type of push–pull effect, where teams can iterate back and forth, offers other benefits in addition to improved applications performance. Heroux says, “A substantial level of effort is required to integrate a library or utilize a tool, which results in a short-term loss in productivity because you first must learn how to do it. However, once you’ve made the investment, then you reap the benefits going forward if those libraries and tools are high quality, and for us, quality was a top priority.” Such collaborations also boost confidence in the products being provided.

By having the application developers leverage the libraries and tools from the software technology teams, software experts gleaned important information about how to adapt and build upon existing technologies to meet the needs of the exascale user. ECP enabled this type of interaction by providing an environment where that creative problem solving could occur in a collaborative space, and the benefits of that paradigm are evident in the types of projects that have thrived over the last seven years.

Collaboration Pays Dividends

A significant challenge that ran across the entire ECP community was adapting codes and technologies to work on heterogenous architectures—systems based on a mix of graphics-processing units (GPUs) and central processing units (CPUs). GPUs, which were originally designed to accelerate computer graphics workloads, have proven to be much more energy efficient than CPUs. Given the substantial power demands of exascale computers, GPUs became an effective way to increase a machine’s processing capability with much lower power consumption than a system fully dependent on CPU performance. All present exascale computing architectures achieve more than 95 percent of their performance from GPUs and less than 5 percent from CPUs. Thus, to fully exploit the capability of exascale systems, applications must successfully utilize GPUs. “Refactoring codes to run well on exascale systems, which have heterogeneous architectures from different vendors, has required innovations in data structures, algorithms, and software development methodologies that perform well independently of specific accelerator features,” says previous ECP Director Doug Kothe. “ECP projects are first movers in exercising performance portable programming models.”

The ECP paradigm of cross-project collaboration allowed application development and software technology teams to coordinate with vendors and pivot when the need arose. Heroux says, “We knew we needed to be running on GPUs. Working directly with the vendors across all the laboratories allowed us to address the need to adapt and change our algorithms and software to run on these machines more quickly than if we had been operating in the traditional, more siloed way.” Teams worked together to develop models, algorithms, and methods that would work with the requisite software technologies and hardware to improve applications performance and demonstrate successful execution of “challenge problems”—high-priority strategic problems of national interest that would be impossible to address without exascale operability.

Several ECP projects serve as exemplars of how the “shared fate” philosophy of well-executed multidisciplinary teamwork has propelled the exascale ecosystem forward. Kokkos and RAJA are production-level solutions for writing modern C++ applications that enable codes to be hardware agnostic. Essentially, applications written using Kokkos and RAJA can be portably compiled to work on a variety of different hardware architectures. This software was game-changing for application developers, especially given that the exascale ecosystem uses GPUs from three different vendors with different hardware characteristics and supporting software stacks. Instead of having to write specialized code for each system, they only had to write the code once (or with minimal revision), and Kokkos and RAJA would do the rest. Says Siegel, “Kokkos and RAJA are examples of higher level, more application friendly ways of programming for GPUs that represent an intermediate layer between the application and the native programming languages of the platform.”

RAJA was instrumental for porting ECP’s EQSIM (earthquake simulation) application to Frontier and future exascale systems. With EQSIM, scientists for the first time can run physics-based simulations that predict earthquake ground motion at a high enough resolution that the effects on buildings and infrastructure can be modeled to improve hazard and risk assessments. According to Draeger, this work is a classic example of how applications and software teams have come together through ECP. “Initially, the application developers were planning to write the code for the GPU architecture on their own, but they soon realized what a monstrous task it would be, so they reached out for expertise on the software technology side. With RAJA, the EQSIM team didn’t have to write all the underlying machine-specific programming to port their code; it was done for them with this intermediate layer.” He continues, “EQSIM application developers iterated back and forth with the RAJA team and worked with them to evaluate what worked well and what didn’t, and the combination was a massive success.” The EQSIM team has successfully run their challenge problem on Frontier and far exceeded their performance target while also demonstrating the portability of the code to different machines.

The development team for the Exascale Atomistics for Accuracy, Length, and Time (EXAALT) application was one of several groups that adopted Kokkos to successfully port their code to Frontier. In a recent run, the team achieved a more than 500x speed up when the results were extrapolated to the machine’s full capability compared to its 2016 baseline, almost a factor of 10 higher than its target. Notably, the functionality of Kokkos has already extended beyond the bounds of ECP. Heroux says, “Kokkos is part of more than 100 projects and is now being used in the majority of ECP application domains but also even more broadly in the external HPC community. We’ve seen widespread adoption of Kokkos through open-source avenues, such as GitHub, which allows the larger HPC community to efficiently leverage exascale computing platforms.”

The benefits of integrated teams are perhaps best illustrated through a small subset of projects in ECP known as co-design centers. As the name implies, co-design projects were established from the beginning as blended application development–software technology teams so that the two areas would evolve together. The goal of co-design activity is to integrate the rapidly developing exascale software stack with emerging hardware technologies while developing software components that embody the most common patterns, or “motifs,” of computation and communication in ECP applications. These projects include development of capabilities such adaptive mesh refinement, particle-based applications, and exascale machine-learning technologies. Siegel says, “We have six designated co-design projects that involve collaborators working together daily to fit the software products into an application based on its unique need,” says Siegel. “For these projects, separating the teams into two different focus areas would be an unnecessary wall to progress. By having the areas come together early on, the whole application and software development process works better. We’ve learned this process is also beneficial for developing vendor software and hardware, too, because the teams can help provide feedback as the hardware is being designed.”

As an example, the adaptive mesh refinement for exascale (AMReX) co-design project supports the development of block-structured AMR algorithms for solving systems of partial differential equations on exascale architectures. It is a numerical library that enables the computational power of the machine to focus on the most interesting parts of a simulation in the most efficient way. Block-structured AMR is integral to ECP applications in the areas of accelerator design, additive manufacturing, astrophysics, combustion, cosmology, multiphase flow, and wind plant modeling.

WarpX, one of several application groups working with the AMReX team exemplifies the co-design strategy. Over the course of ECP, the two teams have worked together to create a highly parallel and optimized simulation code for modeling plasma-based particle colliders on exascale systems. Insights from this work could lead to the design of more affordable particle accelerators for a host of applications ranging from fundamental science research to disease treatment. WarpX has exceeded its project goal by running at scale on Frontier—achieving a 500x speed up over the original Warp application. In 2022, the WarpX team also won the coveted Gordon Bell Prize for successfully implementing and deploying WarpX to deliver significantly advanced particle-in-cell simulations of kinetic plasma optimized on 4 of the 10 fastest supercomputers in the world.

Exascale-ready applications will enable simulations of previously intractable scientific problems related to scientific discovery; health care; and energy, economic, and national security. Heroux notes that the successes achieved so far would not have been possible without the teams at the DOE facilities who have worked so hard to run applications and deploy the software technologies on systems as they became available. Early on, these teams helped establish requirements and supported collaboration with vendors to realize the necessary hardware, but later work also included ramping up performance to exascale, running the codes, facilitating time on the machines, developing deployment plans, and offering training to support sustainable practices. Richard Gerber, HPC Department Head and Senior Science Advisor at the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory and director for ECP Hardware and Integration says, “The investments ECP made in the entire ecosystem, from application development to execution, are having payoffs that exceed expectations. The project specifically supported the effort needed to integrate, deploy, and test applications and software on first-of-their-kind systems—a task that requires specialized knowledge, expertise, and advanced tools. Our goal is that these applications and software technologies are realized on the hardware, and we’ve worked together to make that possible.”

Sustainability for the Future of Computing

As ECP winds down, retrospection and reflection are leading to important insights into the challenges and successes of ECP and future sustainability. Draeger says, “Collaboration is not without risks. The more you depend on other people, the more potential modes of failure there are if others don’t want to prioritize what you need them to. ECP has been a good model of stability. It incentivized people to work together and eventually teams began to broaden their own collaborations.” Notably, ECP provided a space for experts in different scientific disciplines to relate and engage with one another—a climate scientist may be on the same project as a particle physicist, for example—and this environment allows for cross domain conversations that enable progress. Heroux notes, “We learn from our conversations, on the edges, between people who are involved throughout ECP.” Today, what has been achieved through those collaborations and conversations is remarkable.

Frontier, the world’s fastest supercomputer according to the TOP500 (2022, 2023) and first exascale system, is exceeding its performance expectations as are many of the software technologies and applications. The array of applications being run on Frontier are demonstrating exascale capability for conducting scientific investigations in realms previously out of reach.

With sustained commitment to the work performed through ECP, the HPC community will have an exceptional methodology for addressing future computing needs as supercomputers become increasingly more powerful and complex.

Heroux says, “What we’ve learned is that the model works. We should avoid going back to the siloed approach to work without an avenue to continue this model of product development as we’ve seen it can and does produce results.”

“With every generation of machine, it becomes much harder for a domain specialist to be an expert in everything needed to do both the science on the machine and get the applications to run efficiently or run at all,” says Draeger. As the machinery gets more exotic, computational specialists will need to rely on the expertise of others in the different subdomains to fully maximize future computing potential. ECP’s shared fate paradigm has promoted unity between the different DOE laboratories, academia, and industry and has pushed computing innovation to new heights. ECP Director Lori Diachin says, “The Exascale Computing Project has provided the foundation for the next generation of computational science breakthroughs. The lessons learned in how to effectively use accelerator-based computing will impact the HPC community for the next decade and beyond.”

In nature, ecosystems are fundamental to the survival of the organisms living within them, each one relying on the other to achieve ecological sustainability. Turns out, for computational science, when each area comes together to work toward a common goal, a capable “computing” ecosystem emerges. Exascale systems will drive breakthroughs in energy production, storage, and transmission; national security; materials science; additive manufacturing; chemical design; artificial intelligence and machine learning; cancer research and treatment; and many other areas. With continued focus on delivering computing solutions and sustainment of this capable ecosystem, the nation will be well positioned for delivering future technological advances and the next generation of advanced machines. ECP has been a key component of bringing exascale to fruition, opening the door to scientific discoveries beyond our current comprehension. Revolutionary science awaits.

 

Topics: