An exascale software discussion with HPC veteran Jack Dongarra
Professor Jack Dongarra is one of the distinguished SC Perennials, a group of 13 individuals who have attended each SC conference since the first event in 1988. This article is based on an interview with Jack conducted by Mike Bernhardt, also an SC Perennial, discussing the role of Dongarra’s team as they tackle several ECP-funded software development projects.
Article by: Katie Jones, ORNL
Jack Dongarra, director of the University of Tennessee’s Center for Information Technology Research and a distinguished research staff member at Oak Ridge National Laboratory, is perhaps best known for his development of the LINPACK benchmark application, which is used to evaluate high-performance computing (HPC) performance and to rank supercomputers in the international Top500 list. Dongarra is also principal investigator on three of the 35 software development proposals funded for the first year of the US Department of Energy’s (DOE’s) Exascale Computing Project (ECP):
- Software for Linear Algebra Targeting at Exascale (SLATE), a numerical linear algebra library
- Distributed Tasking for Exascale, a run time system (PaRSEC),
- The Exascale Performance Application Programming Interface (EXA-PAPI)
These and other software development projects are enabling ECP to create a comprehensive software stack for exascale systems, including programming models and run time libraries, mathematical libraries and frameworks, tools, lower-level system software, data management and I/O, and in situ visualization and data analysis.
Math to motherboard
First used to rate supercomputing performance in 1993, the LINPACK benchmark had incidentally resulted from the development of a package of linear algebra equations for HPC (hence “lin” and “pack”). Because these fundamental equations for calculating spatial relationships are so important to computer simulations, they were also an effective way to evaluate performance.
As supercomputers adopted increasingly parallel architectures, Dongarra’s team developed LAPACK to serve as a linear algebra library for highly parallel systems. Widely used, LAPACK is often incorporated into commercial software. One of the algorithms used in LINPACK remains the instrument for the Top500 ranking.
“[LAPACK] really is thought of as the gold standard for numerical software,” Dongarra said.
Now, with ECP funding, Dongarra’s team is preparing a new generation of linear algebra software known as SLATE for exascale systems.
“What we’re planning with ECP is to take the algorithms and the problems that are tackled with LAPACK and rearrange, rework, and reimplement the algorithms so they run efficiently across exascale-based systems,” he said.
As with the move from LINPACK to LAPACK, Dongarra said software must be upgraded about every 10 to 15 years to complement a new class of supercomputers.
Although the development of SLATE may be part of a natural progression, it is no less challenging.
“The software libraries need to be enhanced to effectively deal with that underlying structure that we have, and compilers, operating systems, communication—all that needs to be enhanced,” Dongarra said.
That underlying structure that connects the math (linear algebra equations in this case) with the hardware is another project Dongarra is spearheading.
“That project is called PaRSEC. It’s the run time system that we’re planning to use within the linear algebra library,” he said.
PaRSEC will allocate tasks from SLATE to the hardware components that are available for use on the supercomputer—a complicated undertaking considering there are many ways to prioritize and execute these tasks.
Dongarra expects PaRSEC will be incorporated into other libraries as well.
“There will be a number of other numerical libraries developed in the course of ECP, and this run time system could be used as a mechanism for doing the scheduling on the hardware itself,” he said.
The third project, EXA-PAPI, a performance application programming interface (API), helps users understand how well certain software components are performing on the hardware by tracking information about network communication, use of memory bandwidth, and other diagnostics.
“It provides the capability for performance debugging and gives a better understanding of where some of the bottlenecks are in applications or in software,” Dongarra said.
EXA-PAPI can help during development of the application, run time, and other software programs by identifying where communication lags or breaks down.
And communication is everything for science applications that are trying to rapidly simulate realistic processes with many variables that respond or adapt to each other.
“Communication is one of the critical things today [in petascale systems that] in our exascale systems will limit performance,” he said. “We’re talking about systems which are going to have billions of threads of execution and looking at passing information around in this very complex network.”
Dongarra said that understanding when communication failures occur in a supercomputer of exascale magnitude and learning how developers can adapt and transition through failure to carry on computation are among exascale’s great challenges—but also among its most interesting.
Research domains like biology, nuclear physics, combustion, materials science—in other words “the key computational problems we’re trying to solve,” Dongarra said—are pushing for better computational performance.
“Computational science is such an important driver of all scientific discovery that it’s critical in the path to expand the computational platform that we have and move to exascale,” he said. “I have to say, this is one of the most exciting times that I have faced in my career. There are so many problems to overcome, and those problems bring about great research interests and require us to rethink and reinvestigate ideas.”