Update with Director Doug Kothe—July 2019
Exascale Computing Project (ECP) Director Doug Kothe sat down with Mike Bernhardt, ECP communications manager, this month to talk about a variety of topics. Covered in the discussion were the aspects of ensuring that a capable exascale computing ecosystem will come to fruition in conjunction with the arrival of the nation’s first exascale systems, objectively assessing whether the project’s efforts are on track, correcting course and instilling confidence through Red Team reviews, addressing the challenges posed by hardware accelerators, consolidating projects for greater technical synergy, acknowledging behind-the-scenes leadership, tracking costs and schedule performance, and reflecting on ECP’s enduring legacy.
The following is a transcript of the interview.
Bernhardt: Doug, now that plans for the first exascale systems have been formally and officially announced—Aurora at Argonne and Frontier at Oak Ridge—and we know that the El Capitan system at Lawrence Livermore is right around the corner.
Kothe: Right. Right.
Bernhardt: The announcement of the nation’s first exascale systems is such a huge milestone for this country and for the Department of Energy. What do we do to ensure that we will have a capable exascale ecosystem, the software stack, and the exascale-ready applications when these systems are actually standing up?
Kothe: A good question. As we’ve talked in the past, Mike, we—the ECP team and the staff—knew enough in terms of what we thought was coming to where we weren’t shooting in the dark, so to speak, with regard to building our software stack and our apps. But now with these announcements coming out, we have a more refined, focused target and for the most part, it’s not surprising; it’s a matter of expectations, and I feel like our preparations have us on a good path.
I do believe that both architectures, as we know of them—Aurora at Argonne and Frontier at Oak Ridge—are very exciting, have tremendous upsides, and are consistent with our overall preparations, meaning the nodes feature what we call accelerators, which is hardware acceleration of certain floating-point operations; so, it allows us to exploit those accelerators for our good. Overall, we’re really excited. We are three years in and we have very specific targets now. The announced systems really met with our expectations, and from what we can tell in terms of the speeds and feeds and the overall design, they really do look very solid, very exciting, and I’m very confident we’re going to be able to deliver on those systems.
Bernhardt: And as you mentioned, we’re into our third year with the project. If you think back three years ago, where you thought the project would be at this point in time, are we on track? Have there been any surprises that have come across that changed scheduling in your vision?
Kothe: First of all, we are on track. There are always surprises in R&D and many of them are good. Some, are, I would say, setbacks or things you have to really prepare for, so relative to three years ago, we have teams now that have really figured out how to fit well into our project structure. We have defined very specific metrics that are quantitative but also directly reflect the overall goals and objectives and, in particular, the science and engineering, and national security objectives.
What we are seeing over the past year and now is that we have a really good sense of how to track performance. So when we say that, it’s not really just a subjective answer. We really have a lot of the objective evidence for being on track and being a formal project with very specific metrics helps us to make collective decisions about what matters and what doesn’t, and it’s all about achieving our objectives, which we’ve mapped into these specific metrics.
So, yeah, we are definitely on track. That doesn’t mean that it’s not going to be a challenging, tough road ahead. I think we understand what the risks are, and certainly with Aurora and Frontier being announced, many of the unknown unknowns become known unknowns and so I think we’re even better prepared moving forward.
Bernhardt: You’ve mentioned that it’s not just simply a subjective view of whether or not we’re on track. As I understand it, the team just went through something called a Red Team Review.
Bernhardt: Could you explain for our followers what is a Red Team Review and what’s the significance and the impact to the project?
Kothe: Yeah. And so Red Team, many years ago for me, 30-plus years ago, in the DOE was a new term. One could Google it and kind of see the historical aspects. I think it’s a term that’s readily used in business, in large organizations, in the military. For us it has a very specific meaning that at a high level is not too dissimilar from other organizations and agencies, and that is, we bring in a team that is there specifically to poke, to prod, to find any sort of flaw or hole in our plan, and the team is there to help us. They’re not there to be punitive, but they’re, also not friends and family. They are there to help us and specifically find problems in our plan, in our objectives. So, typically we go through at least one formal review a year by our Department of Energy sponsors, and so what we do with the Red Team is two to three months before that formal review, we have a Red Team Review. You could view it as an internal review where we essentially mimic the formal DOE review in terms of what we’re going to do, in terms of presentations, and breakout sessions, et cetera, but with an external, independent, separate team with no conflict of interest with ECP, meaning folks working on ECP aren’t a part of this. So it is very formal and independent.
These reviews at times are painful, because it requires a lot of work and a lot of heads-down focus, but in the end, they always help us in terms of finding areas where we need to do better and make necessary course corrections. In the end, at post review, I think we all sit back and go, boy, that was painful, but we’re glad we went through it because we’re better off now; we have a better plan; we have corrective actions in place.
We did recently go through a Red Team Review and design review. Our next big review with DOE is this December, and so we intend to not wait for the last minute to really be ready to show DOE that we’re on a good path or on track, as you say.
Bernhardt: Got it. So, the Red Team Reviews go a long way in helping instill confidence with you, the leadership team, and the program office, your sponsors, that things are, in fact, on track, that you’ve identified the risk, you’re taking all of the mitigation steps, et cetera?
Kothe: They do. I would say that the outcomes of a Red Team Review are typically, hey, we recommend you consider doing this, this, and this; or, fix this, this, and this. So, we call those formal recommendations. Typically we give ourselves two or three months to respond to those recommendations, to make those fixes. Obviously, if there is a systemic problem found by the Red Team that takes longer to fix, that’s a problem for us. In the end, our Red Team reviews have been very successful suggesting kind of minor tweaks, sort of realizing or relaying to us that we’re, for the most part, on track. And so right now we are actually making a few course corrections and a few changes in our plans as we prepare for our next DOE review; but I do feel like they’re all necessary, needed, and probably the best news is they’re consistent with our expectations as to where we needed to work. So, we’ve not heard things that were sort of orthogonal to our own internal assessment. Having independent assessments is good, and it’s even better when the results are consistent with our own view of where we still need to put some work in.
Bernhardt: Awesome. So Doug, I’d like to dive into one discussion just quickly here, and it’s in reference to something we’ve heard recently at a number of conferences. Would it be accurate to say that accelerated computing and the implementation of GPUs is going to play a key role in delivering the necessary performance of our DOE exascale systems, and if that is the case, what’s ECP doing to prepare the community for this?
Kothe: Good question. It is accurate. I think we’re going to see more and more of this, and maybe it’s disingenuous to even call them GPUs, because they’re very purpose-fit, hardware accelerators for specific floating-point operations, or specific operations that may not be floating point. The way I like to think about it is, in ECP—and this is an aspect of co-design—we’re working on hardware-driven algorithm design, but we’re also working on algorithm-driven hardware design; and so there is really a give and take there. Based on our experience with Summit and Sierra, Summit at Oak Ridge, Sierra at Lawrence Livermore, and the coming Perlmutter system, and certainly Titan at Oak Ridge, we have seen, and will continue to see, hardware acceleration on a node. That doesn’t mean it’s easy. The point is we’ve been through this. I think we know what to expect. It is a tremendous potential, this sort of design. So there’s a lot of concurrency, local concurrency, that we can exploit with an accelerator.
I can now embody my simulation with richer physics, with broader, deeper physical phenomena, with higher-confidence results, because I can afford now to offload some additional physics on the hardware acceleration, or the current algorithms I have in place. If they don’t adapt well to the accelerator, I’ve got to redesign and rethink my algorithms. And so, we’ve been doing that, and the recent announcements of Aurora and Frontier basically tell us that we’re on a good path.
I think with regard to acceleration moving into the future, my own opinion is we’ll continue to see this post exascale and it could be even more purpose fit, more along the line of ASICs that are very specific to current algorithms. And again, I think what we’re doing in ECP now is really hardware-driven algorithm design, meaning we know accelerators are here. We are figuring out how to best exploit them. In many cases it’s rethinking of our algorithms. I think the hardest part is to figure out how do I change my data structures, how do I rework my algorithms, and so in some cases it’s a wholesale restructure of an application or a software technology. In some cases it’s very surgical for the compute-intensive portions.
In the end implementation, the hard part is rethinking your algorithms, and the implementation of those reworked algorithms often is much easier than the algorithm rethinking. So whether we’re looking at accelerators from NVIDIA, or AMD, or Intel, the programming models won’t be as dissimilar as one might think. The real challenge is rethinking your algorithms and we’ve been doing that since the start of ECP. So, not that we’re not going to have some challenges and hurdles, but I do think that these recent announcements have pretty much met with where we thought things were going to go, and so in that sense I do believe we really are on track relative to our objectives.
Bernhardt: Recent comments from some conferences indicates folks in the application development community think that this (wider spread use of accelerators) is a pretty big, heavy lift. It’s a learning curve that they’re going to have to go through with the growing use of accelerators. Is that the proper way to frame it, do you think?
Kothe: It certainly isn’t easy, and I don’t want to downplay the fact that this can be difficult and challenging. I think it requires conceptual rethinking of algorithms. Now, in ECP we have a whole spectrum of application software maturity relative to the accelerators. We have many applications in software technology products that have already reworked their algorithm design and are achieving fantastic performance on, say, Summit.
And so we would anticipate, I think with fairly low risk, that moving that implementation from Summit to Aurora or Frontier may not be seamless, but won’t be a heavy lift, so to speak.
We have other applications and software technology products that are not quite there yet in terms of rethinking and redesigning their algorithms, and so these comments certainly do apply to some aspect of ECP.
In terms of, say, our upcoming DOE review, one key aspect of this review is determining if we are prepared to really help those teams move along more quickly, more with a sense of urgency. Can we take the successful experiences of some applications and apply those lessons learned and best practices to others, and I think we can.
In our three focus areas—software, applications, and hardware and integration—we have a number of projects that have more or less a direct line of sight to essentially figuring out the techniques for exploiting those hardware accelerations. So I feel that in terms of the way we’re scoped, we have the efforts in place to help bring along everybody and, you know, the fact that we’re a large project with lots of teams allows us to cross-fertilize and share experiences and lessons learned, and that helps reduce risk with regard to moving things along.
So, I think when you’re first exposed to these accelerators, you have to sit back and go, okay, wow; this is a tremendous opportunity, but I’ve also got to rethink how I’ve been doing things. In many cases it’s back to the future. Some of the algorithms designed for Cray vector machines in the 70s and 80s, now are apropos and work well on accelerators such as Summit. We have direct evidence that this is not necessarily reinventing or inventing from whole cloth. It might be sort of accessing an algorithm that was used successfully in the past and is again useful now.
Bernhardt: Just another tool in the application developer’s bag of tricks, huh?
Kothe: That’s right. Indeed it is, and I think the teams realize that they’re not going to succeed or succeed on the path that we have in front of us by closing their door and trying to do all of this on their own. And so, we really are managing and tracking and forcing, frankly, integration of efforts, especially the software stack. Key products that applications need to not just be aware of, but actually use. And so in many cases the applications are, to some extent, passing the risk or the challenge of exploiting on-node accelerators to the software technology products, and that makes a lot of sense. And in many cases as well, they’re not doing that, for a good reason. So, this is one of the advantages of having a large project where we can plug pieces together to make basically the whole greater than the sum of the parts.
Bernhardt: Got it. Yeah. It makes a lot of sense. So, within ECP, some efforts that I’ve noticed have been expanding and some have been consolidating. Maybe you could give us a few of the current stats to frame where we are today for the listeners, more like ECP by the numbers.
Kothe: Okay. So, we always are taking a hard look at how we’re organized and trying to see if there is a simpler way to put our organization together in terms of managing. Really, it’s not about the boxology so much, because the challenges are always managing at the interfaces, but we have worked hard to consolidate and simplify where possible and where it makes sense. So right now in ECP we have 81 R&D projects and that’s come down from about 100. So, where we found areas where we could consolidate, we did that, and it wasn’t oil and water. We didn’t force it just for the sake of trying to decrease the number; but in every case that we’ve done this, it has helped. So, let me give an example: In software technology, led by Mike Heroux at Sandia and Jonathan Carter at Berkeley, they recognized that there were several smaller projects, say, looking at I/O and by putting them together there were synergies there that we could take advantage of where they could adopt and use each other’s approaches, and we could move toward maybe one API for a particular I/O instance. And so the consolidation wasn’t just, hey, let’s reduce the number of projects—this is too hard to manage. It was really driven by what makes technical sense, and so right now I think we’re in really good shape to move into what we call our performance baseline period, which will be this fall and early next year, meaning our current structure of 81 teams, still over 1,000 researchers across the HPC and computational science community and industry, academia, and DOE labs; but I think this restructuring has us in really good position for the stretch run as we see Aurora and Frontier delivered.
Bernhardt: You mentioned a few of the folks there, and that leads into what I wanted to get to next. ECP’s success, in fact, the Nation’s success with exascale and bringing it to life depends on a very, very large group of people and it’s more than just the ECP. You know, the collaborating agencies, the collaborating universities, the technical vendor community that ultimately will stand up the systems. I know it’s difficult to single out just a few individuals when there are so many that are making these important contributions, but perhaps you could take a few minutes to acknowledge at least some of the folks, maybe from the leadership team level and so forth that often work behind the scenes a fair amount and don’t get the recognition they deserve.
Kothe: Yeah. That’s a good point. Let me start first with our Department of Energy sponsors. Barb Helland in the Advanced Scientific Computing Research (ASCR) office and the Office of Science (SC), Thuc Hoang on the Advanced Simulation and Computing Program (ASC) in the National Nuclear Security Administration (NNSA), and Dan Hogue in the Oak Ridge National Lab site office here, who’s our Federal Project Director. They have been fantastic in their support, and that doesn’t always mean it’s a thumbs up, team, you guys are doing great. It could mean you guys need to work on this, and so they give us a good, honest, objective assessment and they’re always there. We speak to them daily, weekly, all of the time. So our sponsors have been fantastic in making sure we’re on the right course and giving us the support that we need.
Our leadership team, again, consists of about 30 or so—I think 32 by last count—DOE staff across six labs; and we’ve been really fortunate to have leaders in the community with a proven track record and the trust and respect of their colleagues. We’ve been together now as a team for most of the time ECP has been in existence, meaning there hasn’t been a lot of turnover—not that that’s bad—but people are all in; they’re committed; they have the passion and the energy.
Many people, to quote some of our leaders, feel like this is the culmination of their careers, feeling like their whole career was built for this, and so, you know, that really helps during, say, tough times where you’re trying to prepare for a review when you realize that this is something that I feel like my whole career was built around. We have many people who feel that way.
To single out some names, our three focus areas, software technology, led by Mike Heroux at Sandia National Lab and Jonathan Carter at Lawrence Berkeley, really are up and running on all cylinders. And they, Mike and Jonathan, have made a lot of very productive and useful changes in how things are running and organized. They work hard to make sure our software products have a line of sight up into software development kits and are released and deployed on the facilities.
Terry Quinn at Lawrence Livermore and Susan Coghlan at Argonne National Lab run our hardware and technology focus area. Both Terry and Susan—I don’t know if people appreciate this—are dual hatted in that Terry is really on point for a large part of the El Capitan procurement and deployment at Lawrence Livermore, and Susan for Aurora; and so, we’re really fortunate to have two leaders in the field for procuring, deploying, and operating HPC systems but also leading our staff in terms of what does it take to make sure that products and applications are production quality and get deployed and used on these systems. So, their feet are sort of on both sides of the fence there.
And then in the applications area, Andrew Siegel at Argonne and Eric Draeger at Livermore lead that area, and they’ve really taken our applications from what looked like, say, three years ago some interesting may-work sort of R&D efforts to the applications now that have very specific challenge problems.
We’re assessing them annually. They have very specific metrics and they’re really, for the most part, all on track. So, these folks have been fantastic in leading these efforts. And I said there were over 30 leaders, so Andrew and Eric, for example, have a team of five or six that each oversee over half a dozen of these R&D projects. But the 81 R&D projects all have principal investigators leading these projects who are, for the most part, senior people with career-track records. I try to, and I think our leadership team does as well, call out these PIs, because that’s really where the work is getting done; and we’re lucky to have these PIs, who are all in, just like the leadership team, to make sure we succeed.
Bernhardt: And a lot of the behind-the-scenes, heavy lifting that takes place is with the project office, which happens to be housed at Oak Ridge.
Kothe: Yes, and I’m glad you brought that up. They are. This really isn’t a customer-client relationship between the PhD scientists and the project office. They really are our peers. The PhD scientists responsible for leading the technical areas have learned a lot from the project office about what good project management looks like; what is our responsibility; how do we need to track costs and schedule performance. It’s a tremendous responsibility with the budget we have. And so the project office is in itself a small organization that’s made up of people who care about risk, project controls, budget, procurement. All of these things are day-to-day sort of contact sports, so to speak, with regard to our technical leaders. So, I sit personally at Oak Ridge National Lab, and I think this lab in particular, as many other labs, has a very good track record in project management and leading and executing on large projects. So, we’re fortunate to have a project office staffed almost entirely here at Oak Ridge that has been through the trenches in running and being part of large projects, and knows what to expect. This is a unique gig in ECP, but I think we’ve figured out how to really tailor this to formal project management, sort of in and around doing more exploratory high-risk research.
Bernhardt: Great. This has been a good update, Doug. I’d like to wrap up with one topic that I know is near and dear to you. Talk a little bit about, if you could, the enduring legacy of the Exascale Computing Project.
Kothe: Very good point. I wouldn’t be here, and I don’t think the leadership team or the staff would be here if we didn’t think that there was going to be an enduring legacy. The beauty of a seven-year project is it allows you to have a sense of urgency, and a sprint, and you pay attention to metrics, and you really make sure you can dot I’s and cross T’s, but a project would fail if the leave-behind wasn’t useful. So, let me take you through applications, for example.
Enduring legacy translates to having dozens of application technologies that will be used to tackle some of the toughest problems in DOE and the nation, and so the applications are now going to be positioned to address their challenge problems and in many cases help solve them or be a part of the solution. So, an enduring legacy for us is the applications now are going to be ready at exascale to tackle currently intractable problems and when I say tackle, many, many program offices in DOE—by last count there were ten of them—and other federal agencies are going to essentially use these as their science and engineering tools, so that’s an important legacy. In software technology I think what we’re seeing with the leadership of Mike and Jonathan is the genesis of a probably multi-decade software stack that’s going to be used and deployed on many HPC systems, well beyond Aurora, Frontier, and El Capitan. And I think that by paying attention to what it takes to containerize and package things up, and make them production quality, and make them basically adhering to application and hardware requirements, we’re going to see a software stack that I think DOE will continue to support, maintain, and require on HPC systems in the future. Time will tell post ECP. But we wouldn’t be involved in the ECP if we didn’t expect and, frankly, require our efforts to really have a line of sight well beyond 2023.
Bernhardt: Great. That’s all I have. Is there anything else that you’d like to throw out there for the community at this point in time?
Kothe: Just that we appreciate the support, the engagement of the HPC, R&D, and computational science community. I’m not going to claim that we always have all of the answers, so we encourage the community to feel free to touch base with us, myself personally, or the leadership team. There are ways that you can collaborate and work with us. There are certainly ways that you can engage and help us move forward. We’re really lucky to be a part of this big project and always happy to hear about new suggestions and new possibilities from the community at large.