Siting the El Capitan Exascale Supercomputer at Lawrence Livermore National Laboratory

Exascale Computing Project · Episode 104: Siting the El Capitan Exascale Supercomputer at Lawrence Livermore Lab.

By Scott Gibson

Hi. Welcome to the Let’s Talk Exascale podcast from the US Department of Energy’s Exascale Computing Project. I’m your host, Scott Gibson.

A special thanks to Jeremy Thomas for his input concerning this episode on the upcoming El Capitan exascale supercomputer at Lawrence Livermore National Laboratory,  or LLNL. Jeremy is the public information officer for Engineering and Computing at LLNL.

El Capitan, LLNL’s first exascale-class supercomputer, is projected to exceed two exaFLOPS, which is two quintillion floating-point operations per second of peak performance. That capability could make El Capitan the most powerful supercomputer in the world when it comes online. El Capitan is projected to arrive in 2024 and deliver science in 2025.

Bronis R. de Supinski of Lawrence Livermore National Laboratory

Bronis R. de Supinski of Lawrence Livermore National Laboratory

As the National Nuclear Security Administration’s (NNSA) first exascale supercomputer, El Capitan will enable the NNSA Tri-Labs (LLNL, Sandia, and Los Alamos) to meet the increasingly demanding requirements for ensuring the safety, security, and reliability of the nation’s nuclear stockpile without nuclear testing. It will help LLNL fulfill its core mission by providing scientists with the tools to perform the complex predictive modeling and simulation required by NNSA’s Stockpile Stewardship Program, particularly the multiple Life Extension Programs and Modernization Programs, as well as secondary missions impacting national security, such as nuclear nonproliferation and counterterrorism.

When the exascale era arrives at LLNL, researchers will be able to more efficiently model and simulate complex physics with a level of detail, accuracy, and realism not possible today.

Our guest is Bronis R. de Supinski. Bronis is chief technology officer for Livermore Computing at LLNL. He formulates LLNL’s large-scale computing strategy and oversees its implementation. He frequently interacts with supercomputing leaders and oversees many collaborations with industry and academia. Previously, Bronis led several research projects in LLNL’s Center for Applied Scientific Computing.

He earned his Ph.D. in computer science from the University of Virginia in 1998 and he joined LLNL in July 1998.

In addition to his work with LLNL, Bronis is a professor of exascale computing at Queen’s University, Belfast.

Throughout his career, Bronis has won several awards. This includes the prestigious Gordon Bell prize in 2005 and 2006, as well as two R&D 100s. He is a fellow of the ACM and IEEE.

[Scott] Welcome, Bronis. Thank you.

[Bronis] Thanks. Nice to be here.

[Scott] Great. Well, let’s get going here talking about El Capitan. How is the siting process going for El Capitan?

[Bronis]  It’s going well. You know, we had to first start with a big project for getting the whole building ready; that was called the Exascale Computing Facility Modernization Project that [High Performance Computing Chief Engineer] Anna Maria Bailey led for Livermore. And that has increased the power available in our main data center on our main compute floor to 85 megawatts. And that’s been done for about a year now. That also gives us about another 15 megawatts for cooling. So we actually have a 100-megawatt data center now.

Since then, there’s also, of course, preparations for the system itself, like that upgraded the building. And then inside the building, we need to deploy water through the primary cooling loop and do some upgrades to the electrical system that actually brings the power from the wall all the way to the system. And that basically also just finished about a week or so ago. And so that’s now all done, and we’re now ready to start siting the computer in our machine room.

[Scott] All right. Well, how is El Capitan going to impact Livermore Lab’s core mission of national nuclear security?

[Bronis] Well, so we expect El Cap to kind of be a transformative system. So our existing system is Sierra, and one of my happiest moments was when I heard members of our code teams state that Sierra was really the first system that they found truly transformative, in it had actually made it so that 3D simulations are now fairly routine; they can complete them in a reasonable period of time.

With El Capitan, it’s going to significantly increase the capability that we provide to our users. And I expect it’ll make … again, have a similar transformation, in that now they’ll be able to run those so routinely that they’ll be able to use them in uncertainty quantification on a very rapid turnaround basis.

[Scott] Will there be an unclassified companion system for El Capitan like you have with Lassen for Sierra?

[Bronis] Lassen, yes. Lassen, it’s a park in Northern California near Lake Shasta. So, yes, we’re planning to get an unclassified system that will be called Tuolumne. We pretty regularly take the names for our biggest systems from landmarks related to California and mountains. El Capitan, of course, is an iconic rock face in Yosemite. Tuolumne Meadows is a nice area up kind of near the highest point in Yosemite. It’s near the Tioga Pass. So, yep, we’ll have a system. It’ll be roughly between 10 to 15% the size of El Capitan.

[Scott] All right. With the recent success at LLNL of fusion ignition, will El Capitan be used for fusion research?

[Bronis] Some. Primarily, El Cap will be used for the Advanced Simulation and Computing program for the stockpile stewardship mission. But we do have a team actively working on an application that they call ICECAP. And that uses a variety of techniques to simulate the NIF [National Ignition Facility] beams. The goal of that set of simulations is to understand the fusion process, the ignition process, sufficiently that we can make achieving energy gain a regular occurrence with NIF. NIF, of course, is the National Ignition Facility, which is where the big fusion energy experiments take place.

[Scott] What other scientific areas might benefit from the capabilities of El Capitan?

[Bronis] El Capitan will be pretty heavily used pretty much—not quite exclusively but nearly exclusively—for stockpile stewardship. Tuolumne will be contributing more to the wider range of scientific areas. Now, there’s a wide range of scientific disciplines that get explored as part of stockpile stewardship. There’s a lot of materials modeling. So a lot of just kind of basic ways that the universe fits together. We’ve typically had a wide range of molecular dynamics. Some QCD get run on the system, seismic modeling.

What will probably happen is that, you know, those sorts of applications, climate, and that sort of thing will run on Tuolumne. And if there’s a particular case to be made, we can occasionally provide for briefer runs on the big system.

[Scott] What is the role of AI going to be on El Capitan, and moving forward even beyond El Capitan? What do you see as the role of AI?

[Bronis] So the ICECAP application that I mentioned actually uses AI. So we’ve been very actively exploring cognitive simulation, which is where we use AI techniques—primarily deep neural networks—to short-circuit the need to do detailed physics simulations of some aspects of these large multi-physics simulations.

So ICECAP is using a model called the Hermit model that models portions of the overall fusion process. I don’t think I want to get into all the details of what it does, but we’re actively looking at ways. So I mentioned uncertainty quantification. That’s where we run a wide range of basically a parameter sweep of a specific type of simulation and then try to understand the uncertainties involved in that simulation. And so, we tend to use AI models to guide the parameter choices in those simulations. Then we also … In ICECAP, it’s actually using AI at the kind of lowest level of the simulation, right within the inner loop of the simulation to simulate specific physical aspects.

[Scott] What can you tell us about the El Capitan software? I think TOSS, RHEL, Spack, Flux.

[Bronis] Yeah, so we site a pretty large set of systems at Livermore, not just systems like Sierra and El Capitan, which are ASC advanced technology systems, or ATSs, but also many just more ordinary Linux clusters. Most of those are bought through our commodity technology system procurement. And so, that has for years been running TOSS for system management, which is based directly off of the RHEL, or Red Hat Enterprise Linux distribution.

So initially El Cap will be running RHEL 8, and not too long after we site it, it will move over to RHEL 9. TOSS will be the mechanism by which all the system management functions are handled.

You mentioned Flux. So Flux is Livermore’s next-generation resource manager. TOSS has traditionally used Slurm as its resource manager, but we, some years ago now, were looking at what we saw coming in future systems and what it would take to adapt Slurm to really support those capabilities. We found that it was going to be too difficult. Slurm was originally developed at Livermore, but we decided that its sort of node-centric view—and relatively homogeneous node-centric view of the world—wasn’t going to work.

For El Capitan, we’re getting … in addition to a fairly large number of compute nodes, we’re also getting something called Rabbit nodes, which I like to think of as data-analysis nodes. They’re connected over PCIE to the compute nodes, to a subset of the compute nodes, basically. Each set of compute nodes in one of the HPE Olympus cabinets is connected to a Rabbit module that consists of several NVMe SSDs and also an AMD EPYC processor. And in order to make use of that capability—so it’s a near-node local storage capability—we need to be able to allocate that storage as part of a compute job and also understand which Rabbit processors, which of the EPYCs are associated with that compute job. And Slurm wasn’t going to be able to handle that. When we asked HPE to provide the appropriate resource management support for it, they talked with SchedMD about it, and it worked out that they were not going to be able—as we kind of expected—to make Slurm support that type of varied allocations within a single job, work within Slurm.

So we were quite happy that we had been developing Flux, and now we’ll be using Flux as our resource manager in El Capitan. We’ve already got Flux deployed as the system-level resource manager on some of our EAS3s as well as some other CTS, commodity technology systems, at Livermore.

Flux is also … So it has a unique design. It’s a hierarchical resource management framework. So within a Flux instance, you can create additional Flux instances. And so that allows us to actually run it at user level. And it’s been used quite widely already for that. And it’s had ECP funding that’s primarily funded building it into making it so that other sort of ECP applications could use it to do things like UQ [uncertainty quantification] runs to manage a partition that they get allocated potentially under a different resource manager.

[Scott] So will you describe for us what the effort is as far as porting codes—RAJA—over to El Capitan?

[Bronis] COE is the Center of Excellence, and that’s a basic mechanism under which our application teams and our software experts, in general, are interacting with HPE and AMD. So RAJA is a portability suite built on C++ abstractions, primarily lambdas, that many of our applications in the process of porting from primarily CPUs—prior to our Sierra system—have adopted in order to be able to run on GPUs and basically be able to simplify the effort involved in porting to new systems. It’s similar to the Kokkos infrastructure that’s produced at Sandia. The two, RAJA and Kokkos, are actually very similar.

In general, our application teams have found preparations for AMD GPUs to be pretty straightforward. We largely credit the use of RAJA for that. Our application teams basically spent 3 to 5 years getting ready to run on Sierra. And the effort that’s been involved in terms of man months has been more like 3 to 5 months—man months—to be ready to run on AMD GPUs.

So our EAS3s [third-generation Early Access Systems] are systems that are very similar to Frontier. They have MI250X GPUs, Trento CPUs—they’re just about identical. They don’t have the on-node SSDs. Instead, we’re deploying Rabbits into those systems so that we can get ready for using those in El Capitan. But our teams have really found the experience getting ready to use the AMD GPUs—and, as a result, getting ready to use El Capitan—greatly simplified.

[Scott] If you will, please describe for us the progress that has been made in the programming environment.

[Bronis] Well, there’s always challenges in programming environments. But, you know, I mean, I think there’s significant progress that’s been made. And I think that the ease with which our teams have been able to get up and running on the AMD GPUs really speaks to that.

[Scott] Will you discuss for us the partnership with HPE and AMD in getting El Capitan to this point?

[Bronis] Well, it’s been a really good partnership. There’s been a lot of education on all sides. This is, you know, really the first really large system that we’ve bought from HPE. I mean, HPE acquired Cray. The last time a big system was deployed at Livermore was before I started working at Livermore, so more than 25 years ago. And, you know, we’ve had smaller Cray systems and also other systems from HPE in that interim, but not one quite this size. We’ve also had some systems with AMD processors on them. But this is the biggest system we’ve done. Well, it certainly is the biggest system in basically every way you would measure it—not just the capability, which kind of reflects an inexorable march on of time.

So it’s been a learning experience. They both have been really great to work with. There’s definitely been an education process with HPE. We right away told them that we planned to run Red Hat on our systems and that we meant everywhere on every node and every type of node. That was something that we had to convince them about. But now that we’ve convinced them, they’re working hard with us on that. We’ve been able to get them to actually deploy TOSS on systems at Chippewa Falls and been able to have them ship to us with TOSS already deployed. So that’s been really good. AMD is really great to work with; they’re quite open. They bring a lot of interesting ideas to the table. And when we ask … when we tell them what we want to get done, they really help us figure out how to get it done.

[Scott] Now, here I’m asking for a summary because I know it’s a big thing to ask. But what did it take to get Livermore Computing facility ready for El Capitan, if you had to put that in a relatively few amount of words?

[Bronis] Well, I mean, the ECFM project that I mentioned was a huge effort. Most of that took place during the pandemic, so I think Anna Maria [Bailey] and her team deserves top-most kudos for managing to upgrade our power capability—power and cooling—by basically doubling them, roughly, while very few people were able to be on-site. That’s a huge effort.

There’s been several years that we’ve all be involved working closely with HPE and AMD figuring out what we’re going to need to do to be able to run on the system well, how it should look, what we want the system to look like. You know, you don’t deploy a system of this size and capability without really involving the entire center. So, you know, we’ve got on order of 120 employees in Livermore Computing. And in addition, we work closely with the Center for Applied Scientific Computing, and a large number of those people have been involved. So it really requires everybody pulling in the same direction for multiple years.

[Scott] All right. I want to step back in time to PathForward. That program was critical to ECP’s co-design process, which brought together technical expertise from diverse sources for collaboration in co-design. And you led that effort within ECP. Will you tell us more about PathForward from years ago and the impact it had on ECP?

[Bronis] Sure. Well, you know, PathForward was on the order of a $300-million advanced R&D project. And that’s in addition to non-recurring engineering that’s been funded for the exascale systems. And so that was really advanced preparations for getting the ecosystem ready to offer what we would require for fielding successful exascale systems. It funded six different companies, so if I think about it, I can probably name them all. Obviously, Cray, HPE, AMD and Intel, IBM, and NVIDIA. So those are the six companies.

Not all of them have had success in exascale system procurements, but all of them were part of, you know … we saw technology from all of those projects offered in systems that we could have chosen for the exascale systems. So the projects were quite successful at making major impacts on the computing ecosystem, the large-scale computing ecosystem in the US. We’re actually seeing significant technology from those projects in the systems that we’re siting.

There is, I believe, some technology that Intel is fielding for the Aurora system. There’s quite a bit that’s being used already in the Frontier system at Oak Ridge, and the El Capitan system wouldn’t look anything like what it’s going to look like without that product, without that project.

So to give you some specific examples for El Cap, we’ll be using the HPE—formerly Cray—Slingshot network in El Capitan. So that’s, you know, significant portions of that networking technology were developed through ECP funding. We are also in AMD technology, so we’ll be using the MI300A. The A is for APU, which is accelerated processing unit, which provides integrated CPU and GPU technology on the same package. So it’s using CPU chiplets and GPU chiplets all together to form a single processing unit. And that type of technology would not have been available for El Capitan without the work that AMD did under PathForward.

[Scott] All right. Bronis, you mentioned the word ecosystem. Are there other things you would credit ECP with in terms of really laying the groundwork for El Capitan to become a reality. Anything you’d particularly like to mention?

[Bronis] Singificant parts of our software base have been involved in ECP. So we’ve already discussed Flux; that gets significant funding through ECP. It’s also been funded before that through the ASC program. Several of the application teams have had some funding that’s been included in the ECP umbrella. In addition, Kokkos and RAJA that we already mentioned also received ECP funding. So, you know, without that funding, we wouldn’t be ready to site El Capitan.

Another area that I’ve been involved in is OpenMP.  And the SOLLVE project under ECP has funded the development of a wide range of OpenMP technology. I’ve been involved—I’m the chair of the OpenMP and language committee. And so that’s funded a lot of interactions with ECP application teams and understanding their needs for OpenMP and ensuring that the latest versions of the OpenMP specification reflect their needs.

It’s funded a lot of work for developing OpenMP technologies in LLVM, which is really the backbone of the compiler infrastructure for El Capitan. So the Cray compiler, they use their own software base for their Fortran compiler, but they now are based on LLVM for C/C++. In addition, the AMD compiler suite uses LLVM, and so it’s benefited a lot from the ECP work to improve OpenMP in LLVM. And SOLLVE also developed an OpenMP correctness test suite that’s used for Frontier and will be used for El Capitan to verify that the compiler suites are implementing the OpenMP specification correctly.

Pretty much, you know, soup to nuts, a lot of the software work directly has impacted El Cap, I would say.

[Scott] Great answer. I want to ask if there’s anything else that you’d like to cover that hasn’t been discussed?

[Bronis] You know, I mentioned that one of my proudest moments was hearing that Sierra was transformative for our application teams, and I’m really hoping to hear similar reviews once El Capitan is sited and accepted and moved to our classified network and put in the hands of our users for production work. I really expect that it will provide a real significant change in the way they’re able to get their work done. It’s going to be one of the most capable systems on the planet, if not the most capable system on the planet. And in terms of its ability to allow our application teams to get their day-to-day work done and then move forward with cognitive simulation. It’s capability in terms of AI will be thoroughly impressive, I’m quite confident.

[Scott] Fantastic. Well, thank you, Bronis.

[Bronis] Thank you for having me.

Related Links

Scott Gibson is a communications professional who has been creating content about high-performance computing for over a decade.