A Conversation with Ben Bergen of Los Alamos National Laboratory
Computational scientist Ben Bergen of Los Alamos National Laboratory (LANL) leads the Advanced Technology Development and Mitigation (ATDM) subprogram‘s Flexible Computational Science Infrastructure (FleCSI) project. He recently spoke with ECP Communications to offer insights about FleCSI, including an update on progress. This is an edited transcript.
What is FleCSI all about and how does it fit into the overall ECP effort?
FleCSI is a newish project. It’s part of the ATDM subprogram, which is part of the ASC program. ASC is the Advanced Simulation and Computing program in the National Nuclear Security Administration (NNSA), which is responsible for the weapon stewardship mission of the United States. ASC’s primary mission is to provide high-performance simulation services for the nuclear stockpile stewardship mission and ensure that everything works when and only when it is supposed to. An aspect of the motivation for starting ATDM about three and a half years ago is to develop a new production weapons capability at the national nuclear security laboratories. The problem that we’re trying to address is that some of our production codes are 20 or 30 years old, at least, and some of them are older. So, you can imagine the technology we have available for developing codes and projects has advanced quite a bit.
Another challenge is that the computer systems have really changed over that time period. We had something of a golden age of doing what’s called distributed memory programming using a tool called MPI, which is the message passing interface. And that let us write codes that would run across clusters of computers where data needed to be communicated between the individual nodes of the machine. At that time, we also went through a transition, I guess probably in the early 90s, that caused some rewriting of codes and required that we do some new development, but once we got onto, and adopted that programming model, we were able to go for quite some time, really through around 2005.
But then machines and the computer architectures started to change. A lot of that was related to the sort of walls that we ran up against in advancing just the basic hardware technology of the computers. We ran into something called the power wall, which meant that as we shrank the transistors on processor die, we couldn’t maintain the state anymore without excessive amounts of energy to power it. It became infeasible to continue down that road. And there was a noticeable gap in the consumer computer industry that clock speeds stopped going up.
To try to fix that, we still had the advances of Moore’s Law, which, in layperson’s terms, means that we doubled the number of transistors on a chip every 18 months. And in the past, that meant we got better performance out of it, but around 2005, that stopped being true. We were still getting more transistors, but we needed to figure out what we could do with them. And that led to a lot of innovation and a lot of splits in the different approaches that vendors used to try to figure out what they could do with this new transistor budget.
All of that innovation and experimentation led to problems in maintaining the programming models that we had been using because they didn’t align well with the new systems that we had. Some things like GPUs came out of this change and the cell processor that was part of the Sony PlayStation 4. There was actually a version of that processor that was used in a supercomputer at Los Alamos that was called Roadrunner.
This presented challenges for the actual physicists and applied mathematicians and software developers at the laboratories who were trying to develop and maintain these production codes that we need to be able to satisfy our stewardship mission. So ATDM, the program that I mentioned before, the project, was a recognition by the US government, Congress, that we needed to do something to try to fix that. And so there was money that was carved out for that project that sort of preceded the Exascale Computing Project, and we started trying different approaches across the different laboratories—Los Alamos, Sandia National Laboratories, and Lawrence Livermore National Laboratory—to try to develop some new ideas for how we could create production codes that would run and be more robust to the kind of architectural changes that were coming up.
So that was really the beginning of the FleCSI project, from the ATDM program.
Ben, what do you anticipate will be the ultimate effect of the project’s efforts, the impacts?
If we’re successful, then we will enable physicists and applied methods developers and software developers to write production and open science applications that can do new science. So, really, we want to empower the people who are trying to develop new understanding of physical phenomena and the ability to do that through software simulation.
So tell us a little bit about the collaborations that have been involved in your project.
Sure. One of the main collaborations is with the ECP Legion project.
I mentioned before that we had been using MPI, or the message passing interface, to do distributed memory communication, and one of the challenges with that is that the communication is bulk synchronous. That means that when these different compute nodes are running the simulation, basically at different points they’re fairly fine grained. Everybody has to synchronize and kind of agree on where they’re at.
One of the ways to address that challenge is to move to what are called task-based models. And the Legion project is an implementation of a task-based model. Basically, what it does is it allows the different parts of the program to make progress without syncing up with each other, and that means that we can be more efficient and do a better job of subscribing the resources of a supercomputer to do the simulation. The FleCSI project depends on Legion, and FleCSI actually is a realization of the Legion programming model. The reason it exists really, or FleCSI exists, is because the Legion programming model, or the interface for Legion, is pretty low level, and so it’s not something that you would want to program to, even if you were a computational physicist or an applied mathematician because it has a lot of boilerplate code that’s necessary to describe information to the runtime.
I guess if anything, FleCSI is an abstraction layer for the Legion programming model that lets people interact with Legion in a more comfortable way. But that’s an important project going forward because, in the future, we believe it will allow us to scale to much larger problems and it also gives us some portability in a performant way across really diverse architectures, or at least system architectures where we have components that vary from one machine to another. The Legion runtime and FleCSI let you schedule work to be done on those compute nodes in a way that’s very flexible, and that means that you don’t have to change what you’re doing from one machine to another, at least not in the high level code.
How about milestone accomplishments to date? Where’s the project with respect to that?
Well, one of the milestones that we met last year is an ECP milestone. We have separate milestones, so we’re accountable to the ASC program and to ECP. We had a milestone for the ASC program last September, and that was just to get a hydrodynamics code running with FleCSI on top of Legion. We also support the MPI back end as a risk mitigation strategy. And so we got a physics application that does hydrodynamics. I could explain more about that, I guess. The most intuitive way to think about it is if you were standing next to a stream and you dropped a cork into the stream, the cork would start moving in the water because of the current. And because everything around it is moving, the cork is moving. This is called advective transport, and it’s different from another kind of motion, called diffusive transport, which would be like if you put a drop of ink in a swimming pool or in a glass of water. The ink would diffuse out until it was sort of evenly distributed. Those are two of the ways that things can have motion.
Hydrodynamics describes advective transport (You’re moving because everything else around you is moving.), and it’s one of the phenomena that we need to simulate, and so this milestone was to do a hydrodynamics simulation that did both Eulerian and Lagrangian models. And I won’t go into the details of what those are, but we met that milestone last September where we were able to run this physics application built on top of FleCSI and then using either the Legion or the MPI backend with the same user code. So that gives us some portability across runtimes that we were unable to achieve before.
There are different capabilities of those back ends. When we’re using the Legion runtime, we get task-based parallelism that can be more efficient at running on big architectures. But the MPI one also runs pretty well. Actually, it turns out that the tasking model that we’re using in FleCSI that was developed for the Legion model actually applies fairly well to the MPI one and improves it. So that was a good milestone for us.
You know, ECP is a special project. It’s a special opportunity for researchers. I want to ask you what you believe are some of the immeasurable, intangible benefits of the Exascale Computing Project.
I think the main thing that I’ve observed is that this is the first time in my career that we’ve had a project that really spanned as many institutions and has brought people together.
We have an annual ECP meeting. I mean, the project as a whole has just driven a lot of interaction and collaboration that really wouldn’t have existed unless we had this project. So it’s been great from that point of view. I think the other thing is this isn’t just with the different scientists at different national laboratories. It’s with vendors. It’s with the universities and people in academia. And so I think it’s been a breath of fresh air. It’s kind of an HPC [high-performance computing] spring, in a way.
It really is far-reaching and cross-cutting, right?
Yeah, absolutely. It consumes all of my life now, but really in a good way. I was just at a meeting with a vendor involved with some new work that they’re doing as part of the ECP PathForward project. I’m going to meetings coming up in Zurich, including one meeting with people from another vendor to talk about some new innovations they’re doing. We’re participating in the C++ standards committee. There’s no end to the kinds of collaborations that we’re doing, and there’s a lot of new work that’s going on that just wasn’t the case before. So it’s really lit a fire under people, I think.
Yeah, the thing that came to mind when you were talking about that was, if I were a researcher, I would think what great experience this is to be involved with ECP.
Yeah. I mean, it’s also fostered some opportunities for bringing in more students. So the collaborations with academia have really been rewarding. ECP is funding new things, we have some summer schools and a summer program that we’ve had at the laboratory, but it seems like in the past five years or so, those have really grown. So it’s great for Los Alamos too because we actually are suffering from some attrition from people retiring. They’re not leaving because they want to quit the lab; they’re just either retiring or dying in some cases. We really need to bring in new young people and, you know, they come with an energy that is hard to describe, and it’s given me hope that we can actually accomplish what we need to do.
That sounds very exciting. All right, what is next for FleCSI? Where are things going for your project?
Well, this year we’ve been working on adding to the basic capability. We had a set of requirements that we needed to do to meet the milestone that we did last September. Since then, we’ve done some improvements just in hardening the library and the toolkit to make it easier for people to use, working on documentation. But we’ve also been adding a lot of new capability, and so we’ve expanded on the kinds of data that we can represent and sort of polished that.
I mean, that’s exciting too because I’m the team lead for our co-design team and for the co-design project. So FleCSI started as part of that co-design project. And co-design really just means having people from different disciplines really work together across a broad spectrum of expertise and understanding. So I think that we’ve really been successful at that. I give a lot of credit to my sort of direct project leads, David Daniel and Aimee Hungerford. They’ve been instrumental in trying to set that up, but what’s happened is that we formed the interdisciplinary teams and we’ve gotten very effective at communicating with each other, which is difficult. So that’s an exciting thing.
And so the way that we develop in FleCSI is we start with something that a physicist wants to do, and there’s an applied mathematician probably in-between that physicist and the next person over who would be maybe a computational scientist, and then we have computer scientists and then maybe even a computer architect. And so those disciplines span a pretty broad scope. And the requirements that we’re getting really are from the physicists and the end users and what they want and need to do to satisfy the stewardship mission and to investigate things like climate change and energy, anything that you can imagine. We do a lot of astrophysics as well. But we’ve set up this interaction that’s been very effective at bringing new capability into FleCSI to try to address the needs of those communities.
It sounds like the needs are coming from a very organic level, so to speak.
Ben, thanks a lot for talking with us. This has been very enlightening.
Okay, great. I was happy to do it.