By Matt Lakin, Oak Ridge National Laboratory
When the U.S. Department of Energy presses the start button on the world’s first generation of exascale supercomputers, scientists won’t want to wait to take full advantage of the unprecedented power behind the screens. The inaugural pair of machines, Aurora at Argonne National Laboratory in Chicago, and Frontier at Oak Ridge National Laboratory (ORNL) in Tennessee are set to run at speeds that top 1.5 exaflops apiece.
For the scorekeepers, that’s 1.5 quintillion (1018) calculations per second, or 15 million times the average power of the human brain and its 100 billion neurons. Summit, the world’s #2 supercomputing champion at ORNL, clocks in at 200 petaflops—200×1015 calculations per second, or 200 quadrillion. Frontier and Aurora promise to run nearly eight times faster.
That scale of processing power could help resolve fundamental questions of modern science, from designing cleaner-burning combustion engines to identifying new treatments for cancer. More than 50 teams of researchers will be waiting to test out applications aimed at tackling such questions. Making the most of exascale’s power means building a software stack that exploits every advancement in parallel computing and heterogeneous architecture to leave no FLOP behind, said Doug Kothe, director of the DOE’s Exascale Computing Project (ECP).
“We’re building what I expect to be the tools of the trade for decades to come, and they need to be ready to go on Day One,” Kothe said. “At the base of this whole pyramid is the software stack. One node on the exascale system will be the same or greater power as the highest-grade consumer processor. These nodes are going to be complicated beasts, and they’re going to be very challenging to program efficiently. But if we don’t find a way to do that, we’re going to leave a lot of computing power on the floor, and that’s not acceptable.”
Software engineers and computer scientists with the ECP have spent the past five years laboring to build a pyramid that will support exascale’s full processing power—part of a twin effort as the DOE’s Exascale Computing Project works to make the massive leaps in computing speed.
Unlike the average developer, these teams don’t have the luxury of testing out approaches on a finished product with trial audiences over time.
“The more traditional approach would be to build every application and install them one at a time,” said Mike Heroux, a senior scientist at Sandia National Laboratories and director of software technology for the ECP. “We want to be able to take this software and install it as the operating system on any device from a laptop to the largest supercomputer and have it up and running immediately. This is a first-of-its-kind effort, because if it were available before, we would have done it before.
“A lot of this hardware and software is all brand new, and it takes a lot of time to just debug something. Not only does the software need to run quickly, it needs to run smoothly, and it needs to run consistently across all platforms. It’s like writing software for both iPhone and Android while the phones are still in production.”
Achieving those levels of speed and consistency requires rethinking the classical software architecture of processing and memory. To run efficiently at exascale speed, the new supercomputing giants will need to balance parallel operations running on parallel processors at an unprecedented scale.
“Because any single processing element runs on the order of a few gigahertz, you can do at most a billion operations per second per processor,” Heroux said. “The transistors aren’t getting any faster. So the only way you can do a billion billion operations is to have a billion of these processors doing a billion operations per second, all at once. It scales to human endeavor as well.
“Say you have a printing shop. You have a person who can print 100 pages per minute. If you want a hundred hundred pages—which is 10,000—then you have to have 100 printers running in parallel. That doesn’t make the first page come out any faster. It’s still going to take just as long to get a single page out, but if you’ve got 10,000 pages to do, you can get those done in one minute instead of 100 minutes, thanks to that concurrent assembly line. It’s all about getting more work done at the same time. In our case, we have quintillions of operations to do. We can’t do it consecutively. So we need algorithms that say, ‘Let’s do these 1 billion things at once, these next 1 billion things at once,’ and so on. The reality is even more complicated in the sense that most operations on a processing element take many clock cycles to complete, so we need to start the first billion and then start the next billion, and so on—before the first billion are even done!”
The hardware foundation for Frontier and Aurora, like Summit’s, will rest on graphics processing units (GPUs), which have proven ideal for splitting up and juggling the computation tasks necessary for high-performance computing. That means software libraries for applications originally designed to run on more traditional central processing units (CPUs) must be translated, sharpened, and brought up to date.
“We’re rethinking the architecture in the sense that the broader community is mapping their applications to GPUs like they haven’t been forced to do in the past,” said Jeffrey Vetter, an ORNL corporate fellow and the lead for ECP’s Development Tools group. “Software vendors have largely been able to either choose to not participate or just run on platforms without GPUs. Now, the next-generation supercomputing platforms for DOE are all GPU-based, so if you want to make use of them, you’re going to have to reprogram your application to make use of GPUs. We’re working on development tools for writing software easily and compilers for translating codes on these new heterogeneous systems.”
Writing that software requires creating new languages and building out existing languages to make the new systems run effectively. Most of the solutions Vetter and his team have developed rely on Low-Level Virtual Machine (LLVM), an open-source compiler that translates widely-used computer languages such as Fortran and C++ into machine-specific code for the processors made by major vendors such as Intel, IBM, NVIDIA and AMD.
“The biggest challenge is developing a programming system that is simultaneously portable and efficient on all these systems,” Vetter said. “There’s not really one programming model that runs across all of them. So our approach is layered. The higher-level programming models rely on the features of lower-level systems, such as the compiler and runtime system provided by the target architecture. But we’ve got to have that higher-level programming model to abstract enough detail so applications can be portable across systems.
“There’s no one silver bullet. With so much software out there already, most of the work is improving and enhancing existing compilers, so it’s more evolutionary than revolutionary. LLVM is a compiler used by virtually everybody in the industry—Google, Facebook, IBM, they all converge on LLVM. Any time we add something to LLVM, all those companies can benefit from it and vice versa. In the end, we’re making LLVM that’s better for all users.”
Besides LLVM, development teams plan to use OpenMP, a multi-threaded parallel programming model based on compiler directives, to express computations for the nodes on the forthcoming exascale systems. That means enhancing OpenMP’s existing features, working with software vendors to add new features, and making sure all the pieces fit together.
“It was sort of like jumping onto a moving train at first,” said Barbara Chapman, chair of computer science and mathematics at Brookhaven National Laboratory, who’s leading the effort to scale up OpenMP and LLVM for exascale. “OpenMP emerged as the favorite approach for exploiting parallelism across computing cores. Our goal was to add in all the new features that the applications need, especially for GPUs, and the OpenMP standards committee was already working on their next release. With help from the application teams, we were able to convince the vendors to adopt these features in a very short time.
“Since then we’ve had to focus on clarifications, improvements, all the little things you haven’t thought about when you try to specify features, such as some details of how two of them interact. We have to encourage vendors to quickly implement the extensions that we need for exascale, and we have to work directly on the open-source LLVM compiler to make it ready to provide the OpenMP features and performance we need. We’ve had some very encouraging results, especially in the last six months. What we’re going to be moving onto is the phase of getting more vendor compilers that meet our needs and gaining experience with them.”
The only challenge equal to ramping up performance for a machine that hasn’t been built yet might be building and calibrating tools to measure that performance. Jack Dongarra—a supercomputing veteran and fellow of the Institute of Electrical and Electronics Engineers, the National Academy of Engineering, and the Royal Society of London—hasn’t blinked.
“It’s always challenging whenever you face building software for a new architecture,” said Dongarra, a distinguished researcher at ORNL and professor of computer science and electrical engineering at the University of Tennessee. “You end up having to constantly retool the apps, but that’s normal in the face of a new system. It’s a challenge we can deal with.”
Dongarra leads a group at the university to develop the Exascale Performance Application Programming Interface (ExaPAPI), a diagnostic tool to measure exascale output and efficiency.
“Think of it as a dashboard for what’s going on inside the machine,” he said. “This way you can see how efficiently it’s running, how much is coming in vs. how much is going out, how much energy is being used. We want to make sure all these apps will perform at peak levels, and to make the apps more efficient, we need that feedback. We already have the basic performance counters: How many flops, how much memory is in use, how much power is running. But without the other tools, the user is faced with this black box and unable to understand what’s going on inside. ExaPAPI is what’s going to let us really get a view of what’s going on inside that black box.”
ExaPAPI provides a detailed assessment of exascale performance. HPCToolkit, another diagnostic tool for exascale, acts as the zoom lens, intended to pinpoint opportunities for optimization in application codes.
“What ExaPAPI does is great, but if your piece of code is 100,000 lines, you need to know where to fix it if there’s a problem,” said John Mellor-Crummey, a professor of computer science and electrical and computer engineering at Rice University, who’s leading development of HPCToolkit. “It’s not necessarily enough just to know how long it takes to run the program. What we acquire is not only the place where a metric was measured, but we find out exactly where we are in the context of program execution when the problem was encountered. That way we can track down where you’re spending your time, attribute this to individual source lines in the program, and tell you: Here’s where it’s working effectively, here’s where it’s wasting time, here’s where it’s spending time waiting for synchronization.”
The vast number of parallel operations required for exascale presents a particular challenge for measurement and analysis.
“We’re building tools that are very different from what’s been built in the past,” Mellor-Crummey said. “It’s a billion threads, each doing a billion things a second, and we’ve got to measure it and figure out what’s going wrong. How can we analyze all the measurements we collect and do it fast enough? We’ve got to build our own parallel applications to process all this performance data and our own visualizations as well. But I’m confident in our approach.”
The ECP scientists ultimately envision a rich, diverse ecosystem of software for exascale, where advanced libraries for math, visualization, and data analytics build upon these foundational programming models and tools to provide the versatile capabilities needed by scientific applications teams.
“The developers of reusable software libraries are pushing new frontiers of research in algorithms and data structures to exploit emerging architectural features, while relying on these advances in compilers and performance tools,” said Lois Curfman McInnes, a senior computational scientist at Argonne and deputy director of software technology for the ECP. “Developers of applications and libraries will leverage these new features in programming models, compilers, and development tools in order to advance their software on emerging architectures, while also providing important feedback about their wish lists for future functionality.”
That work won’t end when Frontier and Aurora or the third planned exascale system, El Capitan at Lawrence Livermore National Laboratory in California, are stood up. The next generation of exascale will need fine-tuning, and so will whatever comes next—whether zettascale, quantum, or neuromorphic computing.
“The goal has always been that what we’re building would still translate onto the newer systems,” said Vetter, the software technology tools manager. “It’s hard to understand if what we’re doing now would apply as far out as quantum, but for some of the next generation systems, we can anticipate what the software architectures need to look like, and we’ve got people doing active research trying to find solutions. In some cases, we’re not even looking at the same questions. There are too many fun things to do, too many possibilities to ever think about stopping.”
The problems to be solved won’t stop, either. The scientists wouldn’t have it any other way.
“This is the nature of high-performance computing,” said Heroux, the ECP’s software technology director. “We’re always trying to get more performance and innovate new ways to get that, always bringing in new technology. It’s like building racecars: Racing is grabbing onto the edge of disaster and not letting go, progressively reaching further as we try to go faster. We’re always on the edge. That’s where we want to be.”