Discussing NERSC’s Unique Global Role and Close Collaboration with ECP

01/13/23

Exascale Computing Project · Episode 101: Discussing NERSC’s Unique Global Role and Close Collaboration with ECP

By Scott Gibson

Richard Gerber is the senior science advisor and High-Performance Computing department head at the National Energy Research Scientific Computing Center, or NERSC. Credit: Lawrence Berkeley National Laboratory

The National Energy Research Scientific Computing Center, or NERSC, supports all Department of Energy Office of Science–funded research that needs large-scale supercomputers and big data systems. And NERSC has been—and continues to be—an integral part of DOE’s Exascale Computing Project since ECP began several years ago.

Richard Gerber, NERSC’s senior science advisor and HPC department head, joins us in this episode of Let’s Talk Exascale. Hi, I’m your host, Scott Gibson.

Richard’s role at NERSC is to ensure that the center remains keenly responsive to the needs of the scientific researchers it serves and continues to tell the facility’s science stories.

I talked to Richard on 21 December 2022 to gain insights about NERSC of potential interest to many people involved in high-performance computing. We covered the expanse of its impact on diverse areas of scientific research across the United States and the world. Additionally, we delved into the collaborative NERSC–ECP connections.

NERSC, the Argonne Leadership Computing Facility, and the Oak Ridge Leadership Computing Facility are the supercomputer centers within DOE’s Advanced Scientific Computer Research, or ASCR, program. ASCR is part of the Office of Science. The leadership facilities differ from NERSC in that they are more technology oriented and support not only Office of Science work but also outside projects.

Among the topics covered:

NERSC’s unique role in the world
The similarities between NERSC and ECP in their efforts to optimize the performance of scientific applications
How the NERSC workload represents the diversity of science areas within the Office of Science
Perlmutter—NERSC’s first large GPU-based system
The NERSC Scientific Application Program, or NESAP
The integration of ECP Application Development projects into NESAP programming
How the HPC experts at NERSC are ensuring products developed by ECP Software Technology are installed properly at NERSC
Richard’s perspectives on how ECP products are already making a difference and what he thinks ECP’s legacy will be

Transcript

[Scott] First things first: Richard’s role at NERSC …

[Richard] Yes, thank you very much. I’m the NERSC high-performance computing department head. We have three departments right now: a systems department, a data department, and then my department.

In my department, I have a number of groups. User engagement includes consulting, third-party software, and that sort of thing. Application performance—a lot of the work is very similar to the kind of work that went into the Application Development part of ECP; in fact, they’re very, very similar. We have an advanced technologies group that’s looking a few years out at the kind of technologies that will be in HPC systems and how they map onto our workload. We just started a programming environment models group, and then I also have a business operations and services group. I’m also the senior science advisor. In that role, I kind of make sure that NERSC remains science focused and help tell science stories to staff, DOE, the general public, and the lab.

I started in HPC back when I was in graduate school in the ‘80s and ‘90s at the University of Illinois, where I was studying astrophysics and physics and was introduced to NCSA, which was just getting started, and they had one of the biggest Cray systems in the world at the time. And I started working on that in my research. And then when I graduated, I had a post-doctoral fellowship at NASA Ames, that also at the time had one of the biggest computers in the world—the Connection Machine, CM2—and also a very large Cray system.

I got just a lot of experience with the vector processors and in the beginning of distributed memory processors and the Connection Machine, which was a little bit different but was really the first parallel machine that I worked on. And then after that, I was hired here at NERSC, and I’ve been through all the generations of machines that NERSC has, starting, again in ’96 we had some vector machines and they’ve all been distributed-memory systems since then.

[Scott] NERSC’s workload reflects the projects of the entire DOE Office of Science research community.

[Richard] We have the unique role in the world of being the high-performance computing and data center that supports all the research that’s being done within the Office of Science. If you look at our workload, it reflects the distribution of science areas. And if you look at the Office of Science org chart, it pretty well represents the diversity of science areas.

We have advanced computing. We have bioenergy and genomics, earth and environmental systems, a lot of climate research that is done in that area. And then in Basic Energy Sciences, we have materials science and chemistry, and those two areas actually represent the largest fraction of our users and the hours that are used at NERSC. There is also a user facilities part of BES in which we’re supporting a lot of experimental work that’s being done at the Office of Science user facilities.

There is research going on in nuclear physics, which is a lot of QCD but also nuclear synthesis and supernova explosions and high-energy physics, which is particle physics, astrophysics, cosmology, that sort of thing. And then fusion energy sciences, which is looking at fusion energy and plasma physics. In fact, NERSC started as a center to support fusion energy sciences back at Lawrence Livermore National Laboratory back in 1974. So, we’ve been around for a long time.

[Scott] Richard compared and contrasted NERSC and the leadership computing facilities—the Argonne Leadership Computing Facility and the Oak Ridge Leadership Computing Facility.

[Richard] Those three centers—NERSC, OLCF, and the ALCF—make up the supercomputer centers that are within ASCR, the Advanced Scientific Computing Research, within the Office of Science.

As I said, we primarily support research that’s funded by the Office of Science that needs the kind of systems we have—very large-scale supercomputers and big-data systems.

The leadership facilities are a little bit different, in that their allocation programs are open to all researchers in government, academia, and industry whose scientific problems can take advantage of the advanced technologies they provide. All three of the facilities also offer access to a broad a range of staff expertise that helps enable research teams use our resources for scientific discovery.

[Scott] The Perlmutter supercomputer at NERSC is ranked number 8 on the most recent TOP500 list of the most powerful commercially available machines known to the TOP500 organization. It’s based on the HPE Cray Shasta platform and is a heterogeneous system with AMD EPYC-based nodes and 1,536 NVIDIA A100 accelerated nodes. Richard summarized his view of the system’s contribution to scientific research so far.

[Richard] Like all of our systems we get, we try to get the system that we think will bring the most value, the most opportunity to the scientists from the Office of Science. And Perlmutter was our first large GPU-based system, and so we were a little uncertain about how well the workload would be able to map onto it. That was one of the reasons we started what we call NESAP, which is the NERSC Scientific Application Program. And for those of you familiar with ECP, it’s structured very much like the Application Development part of ECP.

exterior of the Perlmutter supercomputer at the National Energy Research Scientific Computing Center

The final panels on the exterior of the Perlmutter supercomputer photographed at the National Energy Research Scientific. Computing Center (NERSC), at Lawrence Berkeley National Laboratory (Berkeley Lab), Berkeley, California, 09/28/2022. The Perlmutter NERSC-9 is a customized HPE Cray EX supercomputer that is named after Saul Perlmutter, a Berkeley Lab astrophysicist and recipient of the Nobel Prize for Physics in 2011. Credit: Lawrence Berkeley National Laboratory

We engaged with about 25 research teams that were developing codes that made up a large percentage—about half—of our workload. And we’ve been working with them for about 3 and a half years to make sure that their codes are ready for GPUs. And partially because of that—also partially because of all the work that went on at the same time within ECP itself and the fact that GPUs have been around now for a little bit—from day 1, our users just jumped on the system.

As soon as we were able to open it up to users, the machine was completely full, with scientists running all kinds of things they’ve never been able to run before, so we immediately had some large applications running.

Earthquake simulations were doing things at scale that they never could before. We had some protein interactions that were using new AI techniques for which their system is really, really well suited—that were able to do some things that, again, were never done before. And so, there’s a lot of work on the system immediately. And since we opened it up to users starting last year and then more so this year, it’s been essentially 100% used all the time. And the backload in the queues is probably greater than any system that I’ve seen in all my years at NERSC.

Even though it’s not yet in its final configuration, it’s just been a scientific workhorse for everybody. So, we’re really pleased about that.

[Scott] Perlmutter was deployed in two phases.

[Richard] It’s been a little slower coming fully online than we had hoped, in large part because of the supply chain issues and that sort of thing—and Covid didn’t help. We brought it in in two phases. We brought in one phase, which was the GPU-enabled nodes—about 1,500 of them. We brought them in in late 2021, running what’s known as the Slingshot 10 high-speed network. That was not meant to be its final configuration, but we wanted to get them available to scientists and to get our hands on the system. That started in late 2021, and we started letting users on in waves. The first wave was these NESAP teams. And all the ECP teams that wanted to get on, we let have access.

So that was in late ’21, and then by spring of this year, we’ve enabled all our users—anybody that had GPU-enabled code could get on the system. And then since that time, we’ve been slowly adding the CPU-only nodes. We have about 3,000 CPU-only nodes, and we have not yet quite finished integrating all those nodes into the system. Work is still ongoing.

[Scott] Richard explained how NERSC and ECP closely collaborate.

[Richard] We’ve been very involved with ECP from the beginning. And so, for instance, we have people in various roles. Katie Antypas, the division deputy, is the Hardware and Integration director within ECP. And Jack Deslippe, who heads our NESAP program, leads the ECP apps in chemistry and materials science. That’s been a good relationship for us. And then we’ve had other people involved at other levels within the project as well. I mentioned our NESAP program.

We also effectively integrated five or six AD projects into our NESAP programming. We kind of adopted them. And so, that was an addition to the ones that we had already been working with. So, we’ve been very involved there as well.

We’ve also been very involved with the ST teams, so Software Technology, teams. And we were, I think, very early on, very involved with the software integration. And the continuous integration projects are going on within ECP. We have three people now working just on that part of ECP at NERSC to make sure that the software products that were developed through ST—including E4S [Extreme-scale Scientific Software Stack] and the SDKs [software development kits]—were installed well at NERSC, that users can log into our system and get access to them and use them.

Figuring out the things that do work and that don’t work. This is a lot of very hard, non-trivial work—taking the products and then actually making them work in the software environment and with the hardware that actually exists at the facilities.

We’ve also been very involved, very collaborative with the project in terms of training and hackathons and that sort of thing. And then there is a part of ECP that is focused on how the project integrates with the facilities—both us and OLCF and ALCF. So, we’re very actively part of that discussion.

And like the user facilities, we have made time available to the ECP teams that wanted to use NERSC through our director’s reserve allocation. I looked up some of the numbers. I think last year we had 56 ECP projects who were using NERSC—our Cori system and the Perlmutter system when it came online. And a lot of that work has been testing codes, developing codes, and a lot of code optimization because Perlmutter at a very high level looks a lot like the other exascale systems are going to look with CPUs and GPUs supporting those.

[Scott] Richard said ECP products are making a big impact in the high-performance computing community.

[Richard] ECP has been, I think, a really great project and enabled a lot of great things for us as well—and the other systems.

I mentioned when Perlmutter came online we had lots of projects that were able to start using it immediately. And many of those were because of codes that had been developed through ECP.

EQSIM is earthquake code. There’s a lot of codes that were enabled through the work of AMR [adaptive mesh refinement] efforts that are within ECP. So, from an application perspective, the impact has been great and immediate, I think. That’s been fantastic.

And in the Software Technologies area, there are so many enabling technologies like libraries and tools, languages like Kokkos, and that sort of thing that people are already using. So, it’s already making a big impact in the community. And for NERSC and its users having access to all that software, it’s really benefited as well.

[Scott] His perspective on ECP’s legacy …

[Richard] I really like what ECP became. And I think that in addition to all the software that is currently available and currently being used productively on the systems, its legacy will have a lot to do with what it did to bring the community to work in a more coordinated way. A lot of different efforts were going on from a lot of different people, but the community at large didn’t have kind of a center of gravity or a focus that ECP has brought.

For instance, the bringing together of the domain scientists, the applied mathematicians, the computer scientists, the facility people, the optimization experts together to work on problems like AD did, I think. Just the way of working that will continue, and it’s just now kind of part of the ethos and it’s part of the landscape of HPC now.

And so, I think that will be one great legacy from ECP and also on the Software Technology side, the same thing has happened. When ECP started, one of the first things it did was create this idea of those SDKs that would work together. A lot of great software products are out there already. A lot of scientists are using them, but again, some of what you might think of as simple things weren’t really happening. Like there was no guarantee that one software library would be compatible with another one even if you wanted to use both together. That could namespace collisions or whatever.

These SDKs really brought these developers together and worked out all these issues and made toolkits that scientists could use productively and not have to worry about these things. And then, of course, we have E4S, a software project which I think is an extension of that which has really defined a new way for the community to look at software as a whole.

It [ECP] has done a lot with promoting best practices for software engineering; for example: best practices for how you support software, how you test it, how you do continuous integration, how you get it integrated into the facility so that it actually will work with the software environments, the hardware, the nuances of the facilities. We’re working in a new way that we didn’t before, and I think it’s to the benefit of everybody.

[Scott] Richard discussed the magnitude and diversity of the work being done at NERSC.

[Richard] We have about 10,000 users, and these users are mostly funded by the Office of Science. But they are actually mostly at universities. A university professor will get a grant of money to work on a project in battery technology, materials for batteries, or whatever. Then they will use NERSC.

A lot of people that end up using NERSC are students and postdocs. Of course, we have a lot of users that are at national labs as well. And we have a few from industry and nonprofits and that sort of thing. We have about, getting close to, 1,000 different projects that people are working on as well. And so that means that there’s a lot of codes. A diverse user base and they’re not just from the US—they’re from all over the world. They do have a commonality that they’re all working on these DOE-funded projects. But it’s a very diverse user base that we have.

[Scott] He said that digging down to understand the needs of users leads to better system design choices and a better experience for all the researchers.

[Richard] I’m personally very interested in what you would call workload characterization, and the reason really has to do with understanding our users, understanding their codes, understanding their needs, and understanding how they use our system. And it’s just a piece of the puzzle.

We talk to scientists and have them fill out surveys, and we look at their applications to use time for what they’re doing. But synthesizing all these sources of information really lets us understand, I think, as well as we can—and we’re always trying to do more—how users use our system. What their codes are doing. The reason we’re interested is so we can figure out how to best address their needs and also how to best configure, design, and procure our next system.

We have lots of design choices we can make when procuring a system, and those go into what we ask for—so our call for proposals that we put out onto the street. And so, this helps a lot with that.

One example of how we’ve used this really is, I talked about NESAP and how we have 25 codes in NESAP, and that does represent about half of our workload, those codes. But there’s another half that isn’t represented there. That’s a significant portion of our workload, but if you actually look at it, that’s distributed amongst what we call this long tail.

There’s hundreds of thousands of users and codes there, and different things. We’re going through and looking at different communities—say, this materials science community—and trying to assess whether or not they were ready to use our workload characterization infrastructure that we have so far to be able to go in and see what codes they were running or how they were running them and try to map those codes onto GPU readiness for those codes that we might know externally. And so, we were able to do all that without actually having to talk to all of the, say, 2,000 or so users that are doing materials science research and assess that.

For the most part, our community, that materials science community that’s currently using NERSC, is actually quite well positioned to use GPUs. And so, that was comforting to us and to the program managers within DOE that oversee materials science and also let us know that this is not an area that we need to spend a huge amount of effort looking into further, because they’re already doing quite well.

[Scott] Is what’s learned from the workload characterization at NERSC extensible to other facilities?

[Richard] I mean, certainly would be if they are running similar codes. And I do know that we do get asked a lot about this, particularly from vendors. For example, NVIDIA or HPE Cray are very interested in this kind of information for the same reason that we are, just making sure that the future systems are able to support the kind of activities that are going on.

[Scott] Richard said that nailing down the best approaches for collecting, sorting, and making sense of workload characterization data is perpetually challenging.

[Richard] It is very challenging to collect this information, partly because there’s potentially a lot of information. And so, they’re just dealing with lots of information, lots of data itself is an issue at times. You oftentimes don’t really know what the right questions are to ask, so that’s another challenge.

But beyond those two, if you look at the data itself, the data is often data that we can get our hands on, but it was not necessarily data that was designed to answer any questions that we might be interested in.

Chip manufacturers, network designers, people that design, like I said, networks and interconnects have counters and things in the hardware that they can measure. And we have hooks into measuring, but usually they are designed to answer some question that the designer was interested in—like, ‘Is this thing working right?’ Or ‘Is this thing doing the right thing?’ And mapping those onto metrics that are interesting to us is also a challenge.

And the fact that the data that you can pull from these various counters and things like that are of wildly different quality and formats—that sort of thing makes this kind of data wrangling and merging and joining of data streams extremely difficult.

[Scott] What sorts of systems are in place to make the task a little easier?

[Richard] Well, this is an ongoing issue for us, and so we’re collaborating with a lot of other centers too to try to figure out how to best do this. I mean, there are some things that are making it easier. LDMS is now a way to collect data and to report data that is more commonly being used at various centers. And there are technologies for storing data that make it easier to get out large data sets and to query large data sets. But we’re still trying to piece those all together.

I’m not sure I have a system right now that I could give you that I would say, ‘You know, copy us and do what we’re doing.’ We’re both exploring ways to do it better and trying to use the data as we get it at the same time.

[Scott] Here’s more on LDMS, which for reference, by the way, is the Lightweight Distributed Metric Service.

[Richard] It provides a way to collect data from nodes while jobs are running and a transport mechanism that then puts it on a bus or has some way to transport that out to whatever external system you want to store it on. So, it’s a way of moving performance and counter data that you might want to collect off a system onto something else.

[Scott] What is NERSC currently up to?

[Richard] Oh, we are very busy. Right now, we’re working on many things. As I mentioned before, we’re trying to finish off configuring and integrating all the pieces of Perlmutter and making that available to all our users as a full system. We’re already working really hard on NERSC 10. We number our procurements, and Perlmutter was NERSC 9, and NERSC 10 will be our next one.

For those of you who know what the CD0 is, we have the CD0. It’s basically developing and showing a mission need for this system and getting that approved. We’ve done that, and we will be putting out a call for proposals for that system sometime later this year. So, a lot of activities.

We’re looking at NESAP and trying to redefine the role of it—from being a little less focused on the performance of an individual application to the performance of an entire scientific workflow. That could include moving data, post-run analysis, integration of simulation and experimental data, and ‘How do you optimize that?’ ‘How do you measure that?’ ‘How do you enable that to work better?’

We’re looking at a lot of workflows, as I said. We’re thinking about ‘How can we make our systems more resilient?’ A lot of that is being driven by the interactions we’re having with the experimental facilities within the Office of Science. And ‘How can we support their needs for high-performance computing in their data analysis?’ Oftentimes, they have needs for real-time or live data analysis. And ‘How do we coordinate all that and make our systems available in a more resilient way?’ So that’s a big thing we’re working on.

We’re looking at how to continue to leverage AI and deep learning for science, for data analysis as it applies to data but also as it applies to enabling simulations to be able to do things faster.

We’re also very much exploring how quantum systems are going to impact science in the future. So ‘How can quantum computers or quantum accelerators be used to attack the same kinds of problems that our users are attacking right now?’ And ‘What new kinds of problems will they enable?’ And ‘How does a center like NERSC or user facility look when quantum technologies are available?’

[Scott] Richard summed up the overall frenetic activity at NERSC.

[Richard] So much is going on, but at the same time, it’s really exciting—all the work on Perlmutter. It’s really challenging and exciting looking for the next system also.

And then all the new kinds of techniques of the new kinds of science and the new kinds of capabilities that are being enabled by these GPU systems with their tensor cores and the developments in AI and how that is being applied to science and to calculations is really changing the landscape. So, there’s a lot going on.

It’s all really fascinating and interesting and just trying to apply our finite resources to kind of an infinite number of interesting things, I think, is kind of how a lot of people feel right now.

[Scott] He shared more about the ECP–NERSC connection.

[Richard] I think that NERSC has been really involved with ECP. And I hope it’s been to the benefit of both of us—our user community and the community of software technologies that are involved in ECP are really one and the same.

As I said before, I really like what ECP has done and what they’ve become—and partially it’s because I really think that there will be a big impact. There’s a big opportunity. I think there will be a big impact on how scientific computing is done and the benefit to science. It’s really going to be great.

I’ve really been happy how much we’ve been able to work with ECP and how welcoming they’ve been to us working with them. I really can’t say enough about, I think, the impact that it has had. And you asked about the legacy. I think the legacy of having the community work together more closely is really exciting to me, and I think it’s something that we really didn’t have before to the extent that we do now.

[Scott] Much appreciation to Richard Gerber of NERSC for being a guest on Let’s Talk Exascale.

And thank you for listening. Visit exascaleproject.org. Subscribe to ECP’s YouTube channel—our handle is Exascale Computing Project. Additionally, follow ECP on Twitter @exascaleproject.

The Exascale Computing Project is a US Department of Energy multi-lab collaboration to develop a capable and enduring exascale ecosystem for the nation.

Transcript

Related Links