This conversation recorded on video and podcast audio with Doug Kothe, director of the US Department of Energy’s (DOE) Exascale Computing Project (ECP), highlights how disciplined and tailored project management led to very impressive results in what was likely the most comprehensive independent review of the project to date; the documentation of lessons learned will capture aspects of ECP’s many enduring legacies; Software Technology is developing a diverse, robust software stack to seamlessly port and optimally and sustainably perform on accelerated architectures; many Applications are showing promise for a 50x performance improvement; Hardware and Integration, through continuous software deployment and integration at facilities, is ensuring that ECP’s products will be robust, production ready, and functional right out of the box; and ECP is driving the sharing of information through regular training not only with ECP participants but also the broader US high-performance computing community to lower barriers to using exascale systems and accelerated architectures in general. The interview, conducted at the end of January 2020 by ECP’s Communications Manager Mike Bernhardt, also looks at what is ahead for the project—its aggressive plans and myriad deliverables and milestones as well as annual reviews.
Bernhardt: Let’s start this update by putting the spotlight on the ECP project office team. It’s right at the heart of the ECP activity, but it’s an area that’s not often very visible to the outside world.
ECP recently went through this CD-2/3 review, and I heard one person—not affiliated with the project—describe CD-2 as the approval of a project’s plans, and CD-3 as the approval to release the project’s funds. Now, that’s way oversimplification; I understand that.
But the point is, a project of this size and one that operates under such very strict DOE formal guidelines and processes, well, it requires a tremendous amount of discipline and talent to keep the engine oiled and keep everything moving smoothly. During that recent review, much of what the reviewers focused on was the myriad reports and the spreadsheets and the analysis that’s prepared and assembled primarily by the ECP project office team.
Maybe you could give our viewers and listeners your perspective of what this CD-2/3 review is, why it’s important, and then maybe recap the role that the project office played in putting it all together.
Kothe: OK, Mike. Boy, there’s a long answer to that, and I’ll try to be short here. ECP follows a DOE order, meaning there’s a set of requirements for formal project management. And it’s order 413.3b. I guess that means there are lots of orders, but, in this case, it really is a set of required best-practice processes for project management. This is done for larger projects, historically for procurement projects, construction projects, and what we’ve done is tailor it for research, development, and deployment at ECP. Certainly, the tailoring has not been easy because software development has certainly been shown over the last decade to usually be best done within a more agile process where your requirements are better understood as things evolve. So, we worked hard, really leaning on the project office to tailor what we want to do to 413.
In the 413 project there are five critical decisions from zero to four, and we went through—in our mind—the most important, critical decision review, which was critical decision 2/3. And as you noted, it really is setting your performance baseline and formalizing that baseline so we could fund with fair amounts of stability from this period to the end of the project, which is formally for us targeted to 2024.
That said, the review really involved laying out a case for hundreds of millions of dollars of investment over the next four years—from now until project completion—and that constitutes many hundreds and, frankly, thousands of milestones and mileposts along the way. Our technical team—most of us are PhD scientists in our various domains who are not formally trained project managers. And that’s where the project office comes in.
The project office consists of thirty to forty staff. Most are formally trained—or at least the key people that helped us with the baseline—in project management. What they’ve done is worked carefully with us to make sure what we’re laying out can be tracked from a cost and schedule point of view, is attainable, has a line of sight, so to speak, to our key metrics.
These reviews are fairly onerous. They’re typically three days. They look very carefully at what we call our performance baseline, and that is our detailed schedule from 2020 to 2024. Does it make sense? Is it attainable? Have you taken into account the risk that might arise—the uknown unknowns and known unknowns? Do you have cost and schedule contingency to address those risks? We have four key performance parameters that are quantitative metrics for the project. Are we going to be able to meet those metrics by 2023?
The review itself had five formal charge questions, the answers to which needed to be yes for us to be able to proceed. The review concluded that all five charge questions, which related to what I just spoke to, were indeed resounding yeses and recommended that we proceed to CD-2/3. And the Department of Energy will then go through a formal process to say, “This team has been reviewed. We concur. We would like to have approval for moving forward.”
This is very likely our most difficult and comprehensive breadth and depth review to date, and as a leadership team, at least on the technical side, about twenty-five of us, we worked since February 2019—about eight months—preparing this schedule and this baseline, which sounds like a long time. But when you’re talking about such a large investment over multiple years, it’s important to do that.
We felt like the review went very well.
We received two recommendations from the review team, which basically were to publish and share our project management approach—look, it’s never perfect, but we’ve learned a lot in terms of what to do, and probably more importantly, what not to do. We also were given guidance to be more aggressive in our outreach. And, you know, certainly, videos and interviews like this are part of that, but to really take seriously documenting our lessons learned and best practices for our software development, our applications development, and our integration with facilities.
You know, you mentioned we prepare a lot of detailed documentation for these reviews. The recommendation was to sift through that documentation, pull out the pieces that, really, we think the larger HPC community would benefit from with regard to preparing for exascale, preparing for accelerated nodes. So, those two recommendations were certainly welcome. I mean, it was basically saying, “Hey, we think you’re doing a pretty good job; get the material out that will help the broader community.”
There were a number of other more detailed—what we call comments—which are opinions from the review team as to things we ought to tweak and improve in terms of corrective action, and we’ll take those into account as well.
We’re very happy the review is behind us. It went very well. And now we’re taking a deep breath and saying, “OK, we’re going to execute on this baseline and deliver.”
Bernhardt: And also then it sounds like basically the learning experience at the project level will become part of ECP’s legacy.
Kothe: I think so. You know, I can’t commend the project office well enough. I’m not going to get into the details of the staff, but we have people who help us lay out the plan, people who help us track our plan—they’re called project controls people. In particular, Kathlyn Boudwin, our project manager/director; Doug Collins; and Manuel Vigil really led the project office.
They were really more on point here for this review than our technical staff and really delivered. We’re really very pleased with the outcomes of the review, and the recommendations and comments are only just going to help us.
Bernhardt: All right. That’s awesome. Let’s move the spotlight over to Software Technology, Doug. I’d like to try and connect some dots for our listeners and our viewers. I think in general the people who follow the ECP are aware that this is the team that’s responsible for the nation’s first exascale software stack.
They think of ECP bringing together dozens of the software products, you know, to work in harmony with the targeted applications and the computing environments, but the question is how does what ECP is architecting in terms of the exascale software stack work hand in hand with these new accelerated architectures that are being developed by exascale system and hardware providers? How do you know what we’re building at this point is going to work with the systems that we won’t even see for quite some time?
Kothe: Very, very good question. You can only develop so long with a hardware that you can’t get your hands on. A really key point for us is to be able to work closely with the vendors involved in ECP, and, you know, we have six PathForward vendors, many of which are going to be deploying these early systems: Cray, HPE, IBM, Intel, NVIDIA, and AMD.
It’s important to work closely with the vendors to understand their hardware roadmaps, to understand their portions of the software stack. But the rubber hits the road, so to speak, when you actually get your hands on early hardware. So, early hardware we would view as one or two generations upstream in terms of time relative to the actual systems that are going to be deployed.
As of last fall we were able to begin taking an early look at some of the AMD hardware. And a lot of this information was rolled out at SC19 in November. We’ve also been working closely with Intel, and when I say “we,” it’s really through the leadership computing facilities that are going to be deploying these systems. We’ve also had, last fall, some pretty in-depth, deep-dive hack-a-thons and training sessions with the vendors with regard to what’s coming and what to prepare for.
That said, in ECP—in our Software Technology portfolio led by Mike Heroux—we’re not narrowing down to one particular programming model. We do see a very diverse accelerated node ecosystem coming, and we think that’s good for the community and good for us, meaning not just one type of accelerator but multiple types of accelerators, say, from NVIDIA, AMD, and Intel.
And so that’s really forcing us—and I think this is for the good of the community and moving forward—to have a diverse, robust software stack that can enable applications to, ideally, seamlessly port and get performance on multiple GPUs. This is a very difficult and daunting task, but we’re now really getting into the details of how to develop whether it’s abstraction layers or push for certain programming models that best allow our applications to achieve performance on these different types of accelerators.
Bernhardt: OK. Software development kits. A lot of talk around E4S/SDK. What’s it all about? Why is it so important?
Kothe: Yeah, so in ECP we have identified in our Software Technology portfolio seventy different unique products. I’m a little bit biased here, but many of these products have been evolving for years or for decades, kind of independently and autonomously on their own, and have been deployed and utilized very effectively.
In many cases we’re continuing this development of existing products but more aggressively with accelerated nodes in mind. We realized that many of these products have similar functionalities or they were meeting similar requirements, and by grouping these together, let’s say in programming models or in math libraries or in I/O or in DataVis, we can really ensure interoperability, nice sort of horizontal integration, meaning applications can ideally plug and play some of these techniques, some of these technologies. So we realized by grouping them together in five or six different related thematic areas that we could create software development kits along these themes—say, math libraries is probably our most mature—containerize them in different types of containers, and then deploy them for the community writ large.
The requirement for an application is to not swallow an SDK whole. An SDK in Math Libraries might contain right now, say, a dozen different types of math libraries. But by being able to pull in an SDK, now say an application can literally plug and play and test different types of math libraries, maybe sparse linear solvers or dense solvers or Eigensolvers or whatever. And so it’s going to be a tremendous advantage for applications in the HPC and the software community in general to be able to have these things containerized and put together.
The SDKs roll up into what we call the Extreme-scale Scientific Software Stack, or E4S. And we’ve released several versions of E4S; if you go to E4S.io, our latest release, 1.0, occurred in November, last fall. That release has fifty different full-release products, and I think a half dozen partial-release products out there for folks to try in four different types of containers. And we’re really optimistic, and we’re really seeing the returns on our investment in doing things like this, not just for ECP but the community at large, both nationally and internationally. So that’s a key responsibility of ECP, to ensure what I’ll claim is better software quality, better robustness, better interoperability. That’s going to benefit us all.
Bernhardt: Without going into a lot of detail—because we can point people to these on our website—there are a few specific software projects that you might call out as good examples in this area.
Kothe: Yeah, so in our programming models area, what we’re seeing is a lot of traction with what we call abstraction layers and, in particular, the Kokkos abstraction layer developed by Sandia National Labs and the RAJA abstraction layer by Lawrence Livermore. Those are key abstraction layers for applications in particular but also for software technologies that a lot of our projects internal to ECP are embracing. But also externally as well.
For RAJA and Kokkos, what we mean by an abstraction layer is the details of making sure that your data is laid out in a way that takes advantage of the accelerators and certain for-loops and do-loops, they’re executed in ways that take advantage of the accelerators. That abstraction is essentially—or those details, whether it be for a particular GPU type—are really hidden from the applications and from software technologies wanting to use those layers. So, now I can essentially call on Kokkos or RAJA to lay out the data for me, to execute certain floating-point operations for me, and whether I’m on an Intel or an AMD or an NVIDIA GPU, that complexity is hidden. And so these abstraction layers essentially sit kind of on top of the metal, so to speak; whether you’re using OpenMP or OpenACC or CUDA, those complexities are hidden.
We’re finding in our application projects—we have twenty-four projects that really map to almost fifty different, separate, distinct codes. Order of fifteen or sixteen have already said, “We’re committing to these abstraction layers.” We’re also seeing the vendors do the same, which is, “Hey, we’re going to make sure that Kokkos and RAJA are not only ported but performant for you.” In other words, they’re working closely with us to make sure that those aren’t high-risk bets that the applications make, but lower-risk bets, meaning they’re going to be there. They’re going to be not just ported but performant.
Bernhardt: And just to point people to other examples on the website, I think you’d mentioned Flang?
Kothe: Yeah, other key examples are Flang. We are supporting a Fortran front-end portion of the LLVM compiler, and that’s a low-level virtual machine; but the LLVM compiler is an open-source compiler that we are heavily investing into and so are vendors in terms of back end.
I’ll note that the NNSA, which is a key funding partner for ECP, really started this development pre-ECP, and we’re continuing the investment in a Fortran front end, which is very important for our Fortran codes. I’ll also point out that with LLVM certain programming modes like OpenAC, OpenACC, and OpenMP we’re investing in, as well as optimization of LLVM. So this is an investment we think is really exciting ongoing work in these more recent months.
Bernhardt: Awesome. Let’s move forward from software to apps.
Bernhardt: What’s going on in Application Development?
Kothe: Well, a lot. One of the things that we made the case for in our recent CD-2 review was our twenty-four applications. We committed eleven of those applications to a certain performance metric, 50x, and thirteen to a certain metric. And so we laid that out in a fair amount of detail and the review agreed this is the right thing to do. We recently reviewed all those projects and wrote up the results of those reviews in an application assessment report that is finished and been submitted to DOE as a key milestone. We’re going to work hard to redact pieces of that report and make it publicly available.
But of the eleven applications that are shooting for a 50x performance metric over where we started in 2016, we’re seeing performance gains ranging from 3 to 200. So we’re very confident that many of those applications are going to surpass 50x. And this isn’t easy because we’re not just writing a hardware curve, because a hardware curve won’t get us there, between 16 and 23. There’s been a lot of exciting work going on there. In particular, I’ll call out three applications very briefly. One is a collection of applications in support of small nuclear reactor commercial licensing, design and licensing. Another is in support of fundamental materials science for materials in extreme conditions. And a third is cosmological simulations. These projects tend to have “exa” associated with them, but ExaSMR is the reactor project; ExaSky is the cosmological project; and EXAALT is the materials science project.
I can get into details of what they’ve been able to accomplish, but let me just say that those three projects are really seeing performance gains anywhere from 25 to 200.
It’s really been achieved by understanding the hardware, programming on the metal, so to speak, whether it be with OpenMP or CUDA, in this case, which is a really good proxy for our exascale systems. And in every case as well revisiting algorithms and seeing that improvements can be made with algorithms.
With the EXAALT project—this is a really good example of a best practice. We have spoken to our facilities and brought in some of their expert performance engineers as part of ECP. A performance engineer is somebody who really understands the hardware well. They could be trained in computer science or a domain science. In the EXAALT example, a performance engineer at NERSC at Lawrence Berkeley looked carefully at how the molecular potentials were being calculated, peeled that off into a kernel that we call a proxy app, did lot of analysis, and with various programming models asked, “Is there a way to speed up this potential?” It was a small piece of code. And lo and behold a 40x speedup resulted from this detailed analysis.
And so then that kernel was imported back into the base code, and that really resulted in the EXAALT project now projecting a 200x performance improvement. This potential code had been around for years and not really been looked at with fresh eyes, with the point of view of how do I exploit the accelerators? So that’s just a great example of what ECP is about, which is getting fresh eyes on existing code, thinking about new programming models, thinking about new algorithms with an eye toward accelerators. So, that was a really fantastic example that you can’t plan, but by putting a milepost out there saying, “Look, we want at least 50x; let’s go for it,” and having people think about it often. You know, you actually can see that happen, and this is a great example.
Bernhardt: That’s awesome. Are the teams right now getting enough advance information about the forthcoming Aurora and Frontier systems to be effective in starting to set these applications up for exascale?
Kothe: They are. Now, we’re always hungry for more and more information, but we have gotten, I think, adequate information right now on the software and hardware roadmaps for these systems that we’re able to really press ahead and take more calculated bets on what we think will work. I think information will continue to flow out, and that’s important to interact with our Office of Science and NNSA facility projects that are actually doing the procurement and deployment. But relative to the last time we spoke, we’re getting a lot more information, a lot closer interaction with our vendor partners, and I think we understand what the known unknowns are and are going after and knocking them out.
Bernhardt: OK. So, on this point ECP, as we’ve mentioned many, many times, does not own the responsibility for standing up exascale systems, but yet even though ECP is all about the ecosystem, we have this functional technical area called Hardware and Integration, which kind of makes people think of systems. What is the Hardware and Integration group all about?
Kothe: Well, it really was, you know, as I’d noted last year, it kind of filled a key piece in terms of ECP, which is we can’t just build apps in the software stack. We have to integrate them and deploy them on the facilities; test them, really own that responsibility to make sure that what we’re building is robust, is production quality, works out of the box. And we can’t wait till the exascale system is here to do that.
A key aspect of the Hardware and Integration area, led by Terri Quinn at Livermore and Susan Coghlan at Argonne, is to support the vendor R&D—that’s our PathForward program; I think I’ve mentioned that in the past. But right now we’re really focused on continuous integration of our products and working with facilities to identify those performance engineers that we can fund and matrix onto our apps and software teams to really make sure that what we’re building is going to be performant and portable and robust. It’s hard—I really can’t overemphasize enough the importance of continuous integration, which is a key piece of Hardware and Integration.
This is at a high level, an automated, ideally 24/7, daily/nightly deployment of our products onto the hardware—pre-exascale hardware now, but soon the early hardware and ultimately the exascale hardware to test, test, test for robustness, for performance, the ST products that come out of the E4S release and all of our applications as well. And so that’s something that our review committees have noted, our design review committees have noted as crucial, and really glad we started this over a year ago because it’s really starting to gain momentum. We think it’s really going to be kind of the bow on the gift, so to speak, with regard to really making sure that our stuff is going to be production quality and ready to go.
Bernhardt: So, in the grand scheme of things, if you think about all of HPC, even just if you think US HPC, there’s only a small percentage of folks who are now getting exposed to what it means to put systems together, what it means to get applications and software up and running on the exascale systems. It seems like a big component of moving forward to make this successful for all of DOE and for all of the US is going to be some training. Is ECP playing a role in that?
Kothe: We are, and I’ll note our DOE program sponsors in 2016 said, “You really need to have a training element as part of your project.” And, you know, that’s a no-brainer in retrospect, but the first thing we realized was that we have to train ourselves. We have some teams that really know this well and some teams that don’t, so we’ve done a lot of internal training. But lately, in the past year in particular, we’ve been able to really share a lot of our training more broadly. And I’ll call out Ashley Barker at Oak Ridge in really driving this forward. So, we do have regular training sessions, and most of them are put out on our website, exascaleproject.org.
They’re typically one- or two-hour webcasts, sometimes longer, and the topics are developed more organically, meaning we interact with folks external to ECP but also internal, obviously, and say, “What is it that you really need to know more about?” Or if some team has an aha moment, we want to share that. So, training is very, very important.
We really want to be more aggressive in getting those lessons learned and best practices out.
There are hundreds of people involved in ECP, as you know, but there’s a much larger community that we owe responsibility to in terms of getting this information out and sort of lowering the barrier to get onto not just exascale systems but accelerated node architectures in general, which are here to stay, from desktop to clusters to the largest systems in the world.
Bernhardt: Why don’t we wrap it up with your thoughts for the rest of this calendar year, and what can the US HPC community and the rest of the world expect to see from ECP?
Kothe: OK. You know, because of our performance baseline being approved, we have a very aggressive plan, lots of deliverables, lots of milestones, and they’re month to month, quarter to quarter. I’m not going to get into those details. But thinking about software, we’re going to continue to release E4S. So, there’s going to be a couple of new releases. Mike Heroux at Sandia and his team will release every six months updated capability assessment reports for the public to see in terms of where we are with our software.
On the Application side, now we’ve got a good feeling for how well our apps are doing with existing pre-exascale hardware with regard to accelerated nodes. The next step is to do more quantitative performance projections for the exascale systems, so we’re going to be working hard and working with our US vendors in understanding, now that we know about the hardware and software that’s coming, what can we project in 2023?
And on the Hardware and Integration side, really to take that continuous integration from, you know, we’ve tried a few products and apps and it seems to be working, to really testing our comprehensive apps and software stack. And I think on the Hardware and Integration side as well is even more intimate collaboration and reliance on our facilities to help us with performance engineering.
Of course, we’ll go through our annual reviews. We’ll likely have one this fall so that DOE and external folks can come in and say, “OK, you’ve been executing a year on your performance baseline. How are you doing? Are you staying on cost and schedule?” We certainly look forward to sharing at places like SC and a lot of workshops with the HPC community at large what we’re doing, and, you know, how we need help, frankly, and how we can help them.
Bernhardt: We’ve been talking with Doug Kothe, the director of the US Department of Energy’s Exascale Computing Project. For Let’s Talk Exascale, I’m Mike Bernhardt.