Delivering Impactful Science, Deploying Aurora, and Partnering with ECP at the ALCF

02/27/23

Exascale Computing Project · Episode 102: Delivering Impactful Science, Deploying Aurora, and Partnering with ECP at the ALCF

Katherine Riley, director of science at the Argonne Leadership Computing Facility

Katherine Riley is director of science at the Argonne Leadership Computing Facility. Credit: Argonne National Laboratory

Hello. This is Let’s Talk Exascale, the podcast from the Department of Energy’s Exascale Computing Project, or ECP. I’m your host, Scott Gibson.

Argonne National Laboratory and the Argonne Leadership Computing Facility, or ALCF, have been an essential part of ECP since the project’s planning stages. Along with the other DOE computing facilities, Argonne has participated and led in all components of ECP. Additionally, the first director of ECP was Paul Messina, an Argonne Distinguished Fellow and Argonne Associate.

The ALCF, a DOE Office of Science user facility, enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community. Led by Director Michael Papka, the ALCF is supported by DOE’s Advanced Scientific Computing Research, or ASCR, program. The ALCF and its partner organization, the Oak Ridge Leadership Computing Facility, or OLCF, at Oak Ridge National Laboratory, operate leadership-class supercomputers that are orders of magnitude more powerful than the systems typically used in open science.

Argonne is the process of deploying the Aurora exascale-class supercomputer. Aurora will support advanced machine learning and data science workloads alongside more traditional modeling and simulation campaigns.

In this episode, we’re joined by Katherine Riley, the ALCF’s director of science. She leads a team of computational science experts who work with facility users to maximize their use of ALCF computing resources. Katherine has been at the ALCF since 2007 and has the distinction of being one of the facility’s first hires.

I talked to Katherine on February 2^nd. We’ll hear the story of how her fascination with designing a science application tool for high-performance computing sparked the beginning of her professional life. We discussed the following topics as well:

Who the ALCF serves and the ways users are granted access to the facility’s systems
The types of research conducted using ALCF systems
Some of the major activities currently taking place at the ALCF
A summary of the innovations that Aurora will offer
Progress in deploying Aurora
Argonne’s role in partnering with ECP
The impact ECP’s products are already having on the high-performance computing and research communities
What ECP will leave in its wake and the “then and now” perspective on how it has changed high-performance computing since the project began
And thoughts about maintaining the continuity of ECP’s work after the project ends

[Scott] As already mentioned, Katherine has been with the ALCF since its earliest days.

[Katherine] I am the director of science at the ALCF. And I really started in this whole field … my professional life has really been spent in the field of high-performance computing [HPC]. And this started because as I was working toward a degree in astrophysics and applied math, I got very distracted. I got, in the end, not a little distracted but fully diverted into understanding what you need to do to a science application to work on an HPC system. This was in the mid-to-late nineties, and I was working with a project that many ECP listeners might have heard of, which was the Flash project at the University of Chicago. And the reason I specifically named that is that was a really phenomenal experience to have, because it was one of the rare circumstances where a project had a substantial amount of funding not just for the science that the team needed to accomplish but also to design and architect and create a real science application tool that was well architected. And that’s really what pulled me off, sort of the process of designing basically good scientific software that performed well for the systems of that time was really fascinating and was and is a hard problem and, frankly, didn’t receive a whole bunch of attention. So that’s really what started me on this, and this, as I said, was quite a while ago. And really, for someone who is interested in that, the natural process is you end up at a national lab. It is not the only place you could end up but certainly a very natural place you can end up.

So I’ve been at the Argonne Leadership Computing Facility since it was started. I initially started working as what we call our catalyst, our science consultant. These are the people who are collaborating with people in various different fields to get their application ready to think about how they’re using these big systems. And then it’s just evolved over time to the director of science position. And here, the big one-liner really is that the director of science makes sure we deliver on our mission, and our mission, which we’ll get to, is really to deliver science on these systems. We don’t build these big systems for the fun of building these big systems. It’s to deliver science, impactful research that you could not do otherwise.

And so sometimes that is overseeing how we design these systems and what we’re putting on these systems, how we’re actually executing things in production, but it’s also thinking ahead into our future systems and how we get science ready to go on these—pretty tricky—supercomputers.

[Scott] When did the ALCF start, and what’s the story behind it?

[Katherine] That’s a fantastic question. It’s about 2006 that the ALCF was founded, and that was actually the first year that we joined sort of the larger pool of things that the Department of Energy was building at that time to serve the open-science community in terms of science. That start is actually is a really interesting one, I think, because around 2004, Congress responded to Japan building the Earth Simulator, and we as a country really wanted to respond to that because it so outpaced anything we had on the floor at that time. That was so much larger and so much more capable than we’d seen. So they passed an Act of Congress to sort of say, ‘We need to be more competitive in the supercomputing fields’ and tasked ASCR [DOE’s Advanced Scientific Computing Research program] in the Office of Science with creating a program and systems that could actually deliver large-scale science, and that really grew into the LCF’s [DOE’s leadership computing facilities] in 2006.

[Scott] The leadership computing facilities, ALCF and OLCF, serve the portion of the open-science community that needs resources larger and more powerful than they could get anywhere else to pursue the most compelling and impactful research projects.

[Katherine] This is very connected to what I just mentioned—because the ALCF and the OLCF [Oak Ridge Leadership Computing Facility], I’ll mention—this is referring generally to the LCFs. They are national user facilities, and that has a very specific meaning. When these two facilities were created, the premise was that they have to serve open science—so this is the stuff that will be published; it’s not going to be kept behind a fence, for example—and that anyone can compete. The entire research community who thinks that they have a problem that needs substantially larger resources than they could get any other place, can compete for time, regardless of funding source, regardless of location. It’s really open for anyone who might really have impactful work.

So as I mentioned, we’re also looking for those projects that are not only super impactful but where they could not do that work without maybe ten to a hundred times the resource they might get at another compute facility. The primary way that people get access to deliver on that mission is through the INCITE program. This is a program that Oak Ridge and Argonne jointly manage. I happen to be the program manager for that. It’s a yearly call, for as I said, the most competitive, most compelling, most impactful projects that need the scale of resources that we build. But that’s really how we deliver on that mission because it’s agnostic to funding sources; it’s agnostic to field. It’s not tied to DOE mission at all. And they get about 60 percent of the time at both facilities. So that’s the primary way. There’s other mechanisms that people can get access.

ASCR itself runs an allocation program using 30 percent of the time on the systems, and that is a little bit more focused to DOE mission. That doesn’t necessarily mean that you need DOE funding, but it’s focused to priorities for the Department of Energy at that particular time. But then, given those two programs that everybody’s competing for, the thing that many people use to get started is the facilities have a discretionary program. And this is where you can apply, get really relatively rapid turnaround in getting an award. It tends to be small, but it allows you a chance to get onto the system and really get your feet wet. INCITE AND ALCC are competitive, and you have to be ready to use the system to really be successful in those competitions, and so that’s what the discretionary program is for. I’ll also point out INCITE is prepping for its 2024 call. We are going to announce that, fully all the details, in April, but information will continue to be uploaded as we go forward in the next couple of months. And that will be the call where we’ll be awarding Aurora, and obviously, Frontier was awarded for 2023 as well.

[Scott] Katherine described the types of research conducted using ALCF systems.

[Katherine] So this is one of my favorite questions, in part because I have what’s a not very helpful, snarky answer, which is everything in terms of research—almost all areas of science and engineering have used these resources at some point. The reason that’s the really broad answer is that computing is fundamental to doing science today.

Does it mean that every single science question, every single line of research needs supercomputers? No. And everybody needs them on different scales, but computing, at this point, is really a pillar of how most science is executed. And there are types of questions that you really can’t advance without big resources, and on top of that, even in some areas, just in terms of pushing ahead and seeing what the future might be in a particular field, that’s what large supercomputers can really bring. With an ALCF system you’re able to maybe be doing research that your field might not have otherwise been doing until like 10 years from now because the average compute capabilities wouldn’t have been there.

But as I mentioned, we see broad areas. You see projects that are studying cosmology, the structure of the universe, how we got here—perhaps not why we’re here—but how we got here. It’s a very compelling problem because we have a lot of observations. So images of the sky, and we can couple those images of the sky with our simulations of how the universe might have evolved and use those to test hypotheses.

From that really big scale, it even goes to things like energy storage. We all want better batteries for our phone. We want better batteries for electric cars. But really identifying what materials work best in a battery and how can we reduce the impact of imperfections in batteries. We’re looking for materials for alternative energy. Can we have better solar energy materials?

Perhaps even more approachable for people: can we find new cancer treatments? How can we make personalized medicine so targeted personalized treatments for your specific tumor? How can we make that happen? How can we transition that from like a big question and exploration into something that a physician would be able to use on a daily basis to improve treatment for cancer?

Or understanding climate change. Our understanding of climate change and global climate is entirely because of supercomputers. And that’s really where the entire field has done its exploration and tried to understand how the climate works. But we can do everything from understanding the risk of climate change. So can we better plan for the risks in communities based on some of the changes to the climate that will happen? Can we understand what the climate will look like over time? Can we execute changes or not? You know, things like that.

But it’s a broad area. I’m only touching on a few, and I could keep going. But I think these are some areas that I think are tractable for a lot of people.

There’s other things we do that are even looking at the essence of matter. What is the structure of matter? What is the fundamental physics of matter? Sometimes those are harder for people to get their brains around, but it’s important and necessary to even inform things like the other sciences we’re studying like cancer treatment and energy storage. So it’s a broad, broad area.

[Scott] I asked Katherine if she’d highlight some of the major activities that are taking place right now at the ALCF.

[Katherine] Yes! I think perhaps the most exciting one is we are installing compute blades. The compute blades for Aurora are going in now, and so that process of build is very exciting and such a relief to see. And related to that is Aurora readiness, and fundamentally, that is, in my mind the heart of all of the reason we’re doing any of this. We’re getting those science applications ready to actually do work on day one, and I’ll sort of defer some points to that for, I think, a little bit later in some questions. But I think that is the most important activity that we do in the end once we’ve built these systems.

The other exciting part to that is not only are we working on that readiness, [but] we’ve got early science and ECP applications running on early test hardware, and that’s extremely valuable because at that point, we’re not only testing those applications, [but] we’re [also] testing the software stack—we’re testing everything and we’re really working on making it the most robust it can be even before all the compute blades are in.

One of the other things that’s always going on is our training and basically preparation of other projects for how they use the ALCF. Some of this is straightforward. There’s a lot of online materials we develop and hackathons and workshops with students and postdocs who come in. We have currently opened the call for the Argonne Training Program in Extreme-Scale Computing [ATPESC 2023]. This is a firehose of a summer school. It’s two weeks really focused on taking someone and giving them the best practices in high-performance computing at that time. So we started this a very long time ago. There was a lot overlap with ECP, but fundamentally this is our way of saying there’s so much information here. Many people who are coming from, say, a domain science background might not have all of that exposure because it’s a really multidisciplinary skill.

So if you are someone who is thinking about using these large systems and you sort of want at least like, you know, the outline—these are the things to think about; these are the people you might even want to meet because the instructors are all experts in these fields. These are the people leading development in some of the hardware and software that you’d be using, so it’s a fantastic opportunity.

And finally, the last thing that we are doing currently at the ALCF is we’re in production. So we’ve got Polaris, which is a bridge machine really into Aurora running in full production. Our INCITE 2023 year just started, so that’s actually pretty exciting. And generally, we’ve all these activities that we’re doing on our production systems and getting science out right now, not just building something new.

[Scott] Katherine shared the elevator pitch for the innovations that Aurora will provide.

[Katherine] I take two spins on this elevator pitch, depending on who I’m talking to. I think many people in the ECP space who might be listening to this perhaps don’t need the ‘why is the science exciting that it can do,’ but I have that one. I think many people don’t understand that, can we actually tackle the level of complexity in science today, because science has gotten more complicated. We collect so much more data and ask very complicated questions. And Aurora’s really focused on how problems with that combination of challenges can really be done.

Can we build a system that can design more efficient renewable energy? Can we plan cities around effectively managing risk during climate change? Can we move cancer treatment that is personalized into an everyday thing and not a special one-off research experiment? Can we even understand blood flow in the human brain or the structure in the human brain so that we can understand disease and aging?

Those types of questions are so complex you can’t solve them without Aurora. You can’t solve them without large-scale computing that’s really focused on taking huge amounts of data and huge amounts of simulation capability and coupling those. You might be coupling those with something like a learning technology or not, but Aurora is really designed around being able to have all three of those levels of complexity in a science problem, running together really effectively and really sort of bringing many of these science questions into the next generation, leading that way. That’s my elevator pitch.

[Scott] She shared with us about progress in deploying Aurora.

[Katherine] As I mentioned a little bit, we are installing compute blades. So we have all of the hardware installed for Aurora and are in the process of putting in the compute blades. So that in and of itself is, I think, a really exciting place to be. In the machine room you see all of the racks that will be Aurora, and they’re fully powered up. We’re just getting the compute components installed as we go.

The other thing in terms of progress, as I mentioned, is we have test hardware. So we have early-access hardware that people are working on, that’s the ALCF staff working on early science projects and working on exascale computing applications and software. As I already mentioned, that is really crucially important. We’re coming. It is real.

[Scott] Along with ECP’s other collaborators, Argonne has been deeply involved with the project from the beginning.

[Katherine] It’s a fantastic question, and you are right. From the very onset of ECP and the initial planning, we had substantial participation, including Paul Messina, who was the first head of the project who was from Argonne. And we have had many people participating in all components and leading in all components of ECP, just, frankly, like all of the other DOE computing facilities. And there’s a reason for that.

I would say that our relationship, ALCF’s relationship, with ECP is as it is with OLCF, very tight, very coupled, and there’s reasons for that. We have at the LCFs this history of deploying these large systems and getting software and applications ready, and we’ve done this as a family. We have a tight family. What I think has been sort of fascinating with ECP is that compelled us in very good ways to expand an umbrella and really bring a huge amount of people and planning that ECP had in terms of thinking about especially the software and application space, and really challenging us to figure out mechanisms to make all of that work. And they have to be tight to make those work. So what I mean by that is, you can do general work maybe on an application or a piece of software if you just know that, hey, this big system’s coming—it will be accelerated. But it’s hard to really target effective performant work if you don’t have more details. And so really finding that way that we are tightly coupled with the applications and the software that we have these lines of communication and collaboration, and really growing the scale of what we’re used to doing, has been really fascinating and fantastic. And I think, frankly, really beneficial to both the facility side, the ALCF side, but also, I hope beneficial for the larger community out there as they get a little bit of a closer line of sight on sort of the challenges sometimes of deploying these systems.

[Scott] What impact does Katherine believe ECP products are already having on the high-performance computing community?

[Katherine] So I have a very high-level response to this, which frames anything else we might dive into in more detail. There was something really fantastic that comes from this large funding of ECP over its lifetime, and that was the level of focus in conversation around the software technologies and around the applications. And this recognition that—for example, on the application side—the conversation had to be not just about the science, which is fundamental, but also the quality of the tools that we’re using to get that science out. And while the HPC field, the scientific computing field, had absolutely been talking about software engineering ideas, what’s relevant to science applications, what’s not, good practices. We’d been talking about these beforehand, but ECP brought many of these conversations of the importance of these software tools to the front, frankly, just because of the size of the effort. It said that we really have to prioritize the software environment. We need to make the software environment more friendly because there was no software environment before—it was kind of like whatever you got on any system. We need to talk about how we can save our investment in these applications, which are the experiment. That is the science. That’s how you do the scientific experiment.

So we see these conversations happening in a way that they did not happen before. I would not claim that this means that our worlds have just completely pivoted and it’s all about, say, software engineering on the application space, but the value of that, I think, is a lot more obvious to people than it has been. And as I said, it’s not just software engineering for software engineering; it is not just a software environment for the sake of it. All of these things are crucial to improving sort of the quantity and the quality of the science that you can get out of these huge tools that we build. And, as I said, that’s the mission. So I think that’s actually one of the most compelling, really real-world, right-in-the-middle things that ECP has brought out in our community and pulled these conversations about reliable software and predictable software into the forefront. And I’m really grateful for that because this is connected to why I got started in this. I got started in this because I cared about building science apps that mattered and that would sustain and would survive, and we’re seeing that conversation really centered now.

[Scott] And more on the effects of ECP and what it may leave in its wake …

[Katherine] I suppose if you think about this, what I was mentioning before is very internal, internal to the larger scientific computing community in HPC. We want to preserve these immense person years that have gone into building the software technologies that we’re trying to use in our environment in these applications. So that’s important. And thankfully, we’re talking about how we can do that, and hopefully, we don’t blow it, frankly. But there’s another external impact I have found interesting and I have noticed over the past couple of years. It really stems from the fact that ECP, the exascale lift from DOE, was so large because it had to be large. So there was a huge investment, and when you have a huge investment, you really want to make that investment understood. So there were so many more people involved in this project and so much more attention on it. I see this as actually incredibly valuable because the idea that you get more interest perhaps because you’re getting more coverage of the idea of what these big systems can bring; you’re bringing this to the public more.

Without a doubt, every time we’ve deployed a big system, we would have messaging, and we would have messaging about cool science that was getting done, but I think this has been a higher level of attention, a higher of communication with the larger communities outside of HPC and the scientific computing community. And that is so valuable. I don’t think it can be understated, the value of having someone excited by that message of, ‘Look at these huge, cool tools we’re building and the science we’re doing.’ That can be exciting for kids. That can be exciting for adults as well. But it also helps improve this narrative of ‘what are we using these systems for?’ ‘Why are we building such a big system?’ ‘What science can we really get out of this?’ ‘Can you actually believe the science?’ And the answer is yes, you can. It has real impact. It has real benefit to using these, and we’ve seen more of that attention.

And so while that’s very external and it’s very optimistic in some respects, I see this as another benefit. I care profoundly about moving the state of scientific research forward, and frankly, you get that done best when you have the population around you supporting that. So it’s another external bonus, I think, coming out of the scale of what ECP was able to do.

[Scott] We also considered ECP through the “then and now” lens, looking at how ECP has changed the computing landscape since it began more than 6 years ago.

[Katherine] I love this question, and I think perhaps one of the most profound ‘then and now’s’ is the most concrete one that people see, could see figuratively anyway, is in software technologies, in all of the tools and libraries that these science applications need to use in order to be effective. So this concept of creating a software environment that could be predictable across big systems that is broader than the software environment people ran into before, has really taken hold, and it’s real at this point. There’s mechanisms where we are able to go out and get these SDKs [software development kits] for the system that might be the same one that NERSC [National Energy Research Scientific Computing Center] is running and make life basically a lot more approachable and manageable for people trying to use these systems. This was, I know, a very big lift for the Software Technology side and really viewed as crucial to making these big systems more approachable and more usable and increasing the efficiency even of the AD [ECP Application Development] projects.

So I think that’s really one of these exciting then and now’s that we not only have a broader scope of tools than we did before ECP, but we also have a way of deploying them in a way that could have some future if we’re able to stick with it as a larger community. Programming environments need to be robust, and ECP’s really sort of captured that and moved us into a slightly more modern place.

[Scott] Katherine provided her opinion concerning how the continuity of ECP’s work could be maintained after the project has ended.

[Katherine] The concept here is that the products that come out of ECP have to have a future, and that means that they have to be valued. So not just a software technology being valued for, say, the research it’s doing but also value in moving that software product into production, moving it to like a stable production place. And really, what has to be done there is funding that effort and funding the career paths for the people who might execute that, and that is tricky when people are used to thinking in a research way versus a production way. It’s also actually a big outcome out of ECP and another then and now. It’s just really the nature of that conversation has, I think, taken better root because of the demands of, say, taking some of these software technologies and moving them into production and saying it matters that they’re supported, for example. And in that same vein, you need to value that in the science space too. You need to say that it’s not just important to fund the outcome of papers that you’re going to publish, which are important, but that codifying the science that you’re doing in these applications is crucial work. And you need to have people who are focused on doing that and maintaining the providence of the data and the engineering within the application.

So the short answer to both of these things is that you need funding, and you need funding that specifically values the type of career paths that would make either of these things happen. It’s kind of a slightly different pool but not entirely different on the software technology versus the applications technology, and you have to have a place where that falls. So that requires a lot of lift, meaning the place where those might fall in terms of funding needs to be identified. You need to have people onboard with doing that, the people who control the money. But that’s the only way you save this investment and move it forward.

[Scott:] Thanks so much to Katherine Riley, director of science at the Argonne Leadership Computing Facility, for joining us on Let’s Talk Exascale.

And thank you for listening. Visit exascaleproject.org. Subscribe to ECP’s YouTube channel—our handle is Exascale Computing Project. Additionally, follow ECP on Twitter @exascaleproject.

The Exascale Computing Project is a US Department of Energy multi-lab collaboration to develop a capable and enduring exascale ecosystem for the nation.

Scott Gibson is a communications professional who has been creating content about high-performance computing for over a decade.

Topics: ALCF Leadership Computing Facilities