Reflecting on the ‘Why’ behind Supercomputing Simulations: Advancing Science

12/16/22

Exascale Computing Project · Episode 100: Reflecting on the ‘Why’ behind Supercomputing Simulations: Advancing Science

Bronson Messer, director of science, Oak Ridge Leadership Computing Facility, Oak Ridge National Laboratory

Bronson Messer is a distinguished scientist and director of science at the Oak Ridge Leadership Computing Facility, a US Department of Energy user facility located at Oak Ridge National Laboratory.

Hi. If you follow the work of the Department of Energy’s Exascale Computing Project you know what ECP is about: ensuring the necessary pieces are in place for the nation’s first exascale systems. The components are critical applications and an integrated software stack.

This podcast, Let’s Talk Exascale, looks at the impact of ECP and exascale computing from different angles. I’m your host, Scott Gibson.

Bronson Messer has an up-close, expert perspective on computer modeling and simulation for advancing science. He is a distinguished scientist and director of science at the Oak Ridge Leadership Computing Facility, or OLCF. The OLCF is an Office of Science user facility at DOE’s Oak Ridge National Laboratory.

Bronson is also a joint faculty associate professor in the Department of Physics and Astronomy at the University of Tennessee. His primary research interests are related to the explosion mechanisms and phenomenology of supernovae—both thermonuclear and core-collapse. And he is especially interested in neutrino transport and signatures, dense matter physics, and the details of turbulent nuclear combustion.

The OLCF houses Frontier, the world’s only exascale computer and the fastest in the world at 1.1 exaflops. I talked to Bronson on December 7^th.

Transcript

[Scott] His role as the OLCF’s director of science is ideal for someone who is passionate about science.

[Bronson] So the director of science is an interesting position. It’s actually terrific for a bit of a science junkie like me. My primary responsibility is to make sure that the facilities that are fielded by the Oak Ridge Leadership Computing Facility are used to actually accomplish groundbreaking science across a variety of disciplines. And that variety extends almost infinitely.

The LCF’s in particular are sort of charged with being the highest-end computing destination for not only the nation but, indeed, the world that can make use of it. So, for example, our biggest allocation program, the INCITE [Innovative and Novel Computational Impact on Theory and Experiment] program is open to anyone on the planet who can show that they have a need for and an ability to use the very largest scales of computing that are embodied in our major facilities, our major platforms.

That sort of catholic purview, which I think is sort of unique to supercomputing, gives me an opportunity to learn a lot, more than just a pedestrian level about a lot of different kinds of science. And part and parcel of that is also learning about the culture of scientists from discipline to discipline, which is quite different in a lot of cases. So, that’s been fun as well.

I sort of handle everything from allocations on the machine to promulgating the results after the projects are sort of done or finishing up, and then longer-term trying to determine the requirements from an application and scientific point of view for the next machine that we’ll field.

[Scott] Bronson said ECP’s role has been to bring a fresh way of preparing applications and incorporating efficiencies and capabilities.

Bronson Messer, OLCF director of science, with the Frontier supercomputer

Bronson Messer with the Frontier supercomputer at DOE’s Oak Ridge National Laboratory

[Bronson] Ever since we first fielded Titan back at the beginning of the last decade, we’ve had as part of our project to build new machines an application-readiness program. We call our particular one CAAR, the Center for Accelerated Application Readiness. I always make this pun, and I don’t apologize for it: CAAR is our vehicle for ensuring application readiness. In a lot of ways, ECP is sort of like CAAR but on steroids.

ECP has allowed a lot of teams to be able to take a completely fresh new look at the way they’re actually doing what they’re doing and the codes that they’ve been working on. They’ve had a nice, long runway and substantial amount of support to be able to do that. And because of that, we going to have codes running on the exascale platforms that will have efficiencies and capabilities I think the likes of which we haven’t seen before compared to the previous generation.

We had a good big jump when we first did hybrid CPU–GPU with Titan. We saw another good size jump when we went to Summit, when we really increased the total number of GPUs per node. But moving to Frontier there are significant architectural and hardware differences that are going to help—a lot more GPU memory, for example, and a lot more nodes.

ECP has really catalyzed what always wins when it comes to new codes and getting new insights, and that’s algorithms and implementations. It’s really the software that’s going to buy you the biggest gains, and I think ECP has sort of made that manifest in a big way.

[Scott] What about entering the exascale era is most exciting to Bronson Messer?

[Bronson] One of the beauties of supercomputing to me, period, is just how applicable it is to all facets of human inquiry. And I think that the variety of ECP applications and software projects sort of speak to that. I think writ large, though, that the ability to do simulations with a sort of base level of this physical fidelity—sort of a base level of believability, predictability—and be able to do those simulations at what I would call human work time scales. Something like a day or overnight or a few hours, right? Instead of, for example, weeks. There are lots of multiphysics simulations, for example, that have required, typically, weeks even on the largest supercomputers.

The ability to do those kinds of simulations that do it on a sort of “I can keep what I’m working on in my head” sort of caching time scale is huge to make real scientific progress. Good examples of this will be engineering kind of studies that we see in the wind energy projects as part of ECP, other engineering projects where engineers who do multiphysics simulations—which is a good part of the portfolio of ECP—really do want that design cycle to sort of happen on partial-day sort of time scales. But they also need to have physical fidelity to sort of backs that up.

Same thing with, for example, climate modeling. You of course have to have weather and climate modeling that runs faster than the weather actually happens if you want to be able to make predictions. And that’s the kind of thing they’re looking at.

There are a variety of other things. My personal interest is in stellar astrophysics, and the ExaStar project is really going to make it so that we can do some of the most intensive multiphysics simulations and be able to do them in a reasonable amount of time and therefore not have to do it for a single star but be able to do it for classes of stars. And that’s really important, because that’s really where all of the elements that make us, us come from—not just single stars but from a whole ensemble of those stars and be able to do that.

So, shortening the time to solution, increasing the physical fidelity of what’s going on, and at exascale it really has raised that minimum level of physical fidelity for everyone to a point where you can really talk about quantitative predictability—that is, being able to make a prediction and put a number to it and expect maybe to be able to measure it later and verify it. To me, that’s one of the most exciting things.

[Scott] Discussions about supercomputing and modeling and simulation often reference the importance of the relationship between those things and experimentation. What’s a reasonable description of that relationship?

[Bronson] This is a great question, and I’ve thought about it quite a bit recently, as a matter of fact. There’s a whole other community—our cousins over in large-scale data analysis, data science, using machine learning techniques for data that’s obtained through other methods. They’re experimentalists. But the folks who do modeling and simulation, for the most part, in every community are thought of as theorists. There’s really three different sorts of ‘flavors’ of theory that people who do modeling and simulation worry about a lot.

One is what I would call is the ones that have direct, immediate ties to experiment. So, I want to be able to, for example, go up to a beam line at SNS [the Spallation Neutron Source at ORNL], put a sample in the way of the beam line, get a measurement, and do a molecular dynamics simulation to see what measurement I should do next, or be able to explain the result I just got. So, something that’s very tightly coupled to the experiment.

That’s going to be enabled by exascale because we’ll be able to do those simulations, again, at the requisite fidelity good enough that you’ll be able to provide constructive feedback and take a real advantage of that experimental time.

There’s another kind of theory that people do through modeling and simulation that you still are looking to get the right answer. You still have to match physical reality as measured through experiment, because otherwise you’re just playing a video game. But the connection is a little bit more distant, and the reason the connection is more distant is that the theory is about trying to understand fundamentally what is going on.

Molecular dynamics is a shortcut; molecular dynamics is not what’s really going on in a molecule. It’s an approximation for what’s actually going on. Now, quantum theory is coming in and being sort of joined with molecular dynamics, but at zero border, molecular dynamics is strictly Newtonian—so, that’s not really what stuff does. But it’s a really, really good approximation, and useful approximation.

Other kinds of theory, you want to be able to predict from first principles what’s going to happen because you want to understand what’s actually going on. A good example of that would be nuclear physics. LatticeQCD, for example. You could explain a lot of high-energy physics events without having to resort to latticeQCD, but, actually, if you want to understand what’s going on, you need to understand it at that level.

And then the third kind is a kind of simulation that I mentioned earlier, which is multiphysics simulation. Again, you want to get the right answer, but you could never do an experiment. I can’t blow up a star, right? I don’t want to crash 14 planes to be able to get a final answer. I can do some limited experimental investigations, but really what I want to be able to do is to predict the behavior of singleton, highly exotic, non-controllable observations. And that’s yet another flavor.

So, in all those different cases, there’s a direct tie to experiment, but it sort of depends on what kind of theory you’re doing, how close that tie is. But ECP sort of embodies all three of those kinds of modeling and simulation.

[Scott] We turned our attention to computer hardware and architecture and considered the future.

[Bronson] So I think it’s becoming more and more evident that the primary constraint is power, and we’ve known this for a long time. In fact, that’s a huge part—maybe even the primary reason—that we went to GPU computing, hybrid CPU–GPU computing, to keep the power costs at as small as we could get them. And, basically, those kinds of power savings we realized through GPUs, that’s what has allowed us to achieve the exascale.

That’s not going to go away, and, in fact, it may become more acute in the near future, because for the bulk of my career, it’s been said that Moore’s Law is dead, and for the bulk of my career, that’s not been true. I think it might finally be true—finally.

And so, the gains that we get in just pure, unadulterated processing speed will probably be modest in the coming years compared to what we’ve seen in the past half decade to a decade. So, what does that leave? Well, it leaves something pretty important, actually. What it means is I want to be able to do things fast. So, if I can’t do things a lot, lot, lot faster, as a scientist what would I want to do? Well, maybe I want to do things better. I want to be able to get a better physical fidelity. What that usually means is memory, more and more memory in machines. Here’s the downside to that: the most power-hungry part of the big computer is the memory. It costs 10 times as much to move an operand from memory to a register on a GPU than it does to actually operate on it. So, then we run into this power wall again, this question of how much energy we’re going to use.

I’m hoping—and I think there’s reason to be optimistic—that there’ll be innovations in memory technology that will allow us to add considerable amounts of memory to future machines and increase their utility for science that way. We’ll also get boosts in speed. We’ll get boosts in scalability, as well. I think we’ll be able to build some bigger machines. But we really need to attack the power problem from a memory standpoint, because memory’s so important.

[Scott] How about the convergence of HPC and artificial intelligence and machine learning applications? What is that is going to produce?

[Bronson] I think we’re already seeing how that’s happening, and the way I say it typically is that AI and machine learning techniques are suffusing the whole workflow from the beginning all the way to the end. They’re everywhere.

A decade ago, we talked a lot about a distinction between modeling and simulation and what was called data science. There was a great quote I heard from Thomas Schulthess this one time where he gave a whole talk about data science. And he kicked the talk off with … he said, ‘I’ve been researching this and I’ve been thinking about it for quite a while, and as best I can tell, doing data science is just doing science.’ And I think there is more than just a kernel of truth to that.

So, regardless of whether or not the data comes from sensors or experiment or from simulation and modeling, I think the ability to glean insight, that’s what scientists really do. And I think AI and ML for scientists is going to be probably in a handful of buckets. On the front end, it’ll be being able to do what I call design of experiments; that is, given the whole set of parameters that I could possibly study with a large number of simulations, I use AI and ML to tell me where I should walk in that parameter space first to try to get the best set of answers.

Then, during the runs themselves, there’s a lot especially, for example, in multiphysics simulations there’s a lot of places where I can use surrogate models, for example, to replace what are called sub-grid models in some multiphysics. So, to stand in for physics and physical processes that I can’t resolve with my simulation code.

Then at the end, AI and ML is going shine, right, because being able to look at large data sets that are produced by very large simulations and modeling runs and be able to help a scientist discern what’s actually there is going to be the really interesting thing. And there’s a ton of tools that can be brought to bear that are used in lots of different fields. But, accordingly, we’re not interested in the kind of AI and ML implementations that, for example, allow us to somebody if you buy a vacuum cleaner, to suggest they buy some bags to go along with it. That’s not quite the same kind of flavor that we want. But there’s enough places that we’re going to stick it in the workflow. I think it’s going to be from soup to nuts. The whole enterprise is going to be sort of suffused throughout by this. I think that’s actually really, really good news. I think you can get excited practitioners for all of that.

[Scott] Where does quantum computing factor into high-performance computing?

[Bronson] The truth of the matter is I don’t know, and I don’t think anybody knows. That’s probably a safe thing to say. Everybody’s very excited about the possibility of quantum computing. I think that there already are a small handful of problems that it’s obviously incredibly well suited for. I think there’s another class of problems, scientific problems—the current problems that it’s really well suited for are often cryptography and things like that, but there are also scientific problems that it’s probably pretty well suited for.

One, for example, that I’m very interested in is how neutrinos power exploding stars. And we now know that because neutrinos have mass, they also change flavor, and that process purely a quantum mechanical process. And it would be really, really hard to model on a classic computer. But on a quantum computer, it could be very straightforward to model.

And so I can imagine having a quantum sidecar to a classical computer where I solve the equations of fluid motion and energy production and nuclear transmutation of elements alongside a quantum computer solution to neutrino flavor mixing.

I think the days of a fully general purpose quantum computer are probably pretty far off. I don’t think it’ll be in my career. I have hope that by the end of my career, we will see small-scale sidecars like I just referred to—small quantum computers connected directly to classic computers you can sort of off-load some work onto. That’s going to require quantum computers becoming less of a molecular ion experiment and less like they live in a chemistry or a condensed matter lab and more like a computer part, because that’s the only way we’re going to be able to get the noise down.

The whole problem in quantum computing is the noise from the environment the quantum computer lives in tends to swamp any of the operations that are actually being carried out by the quantum computer. So, maybe things like topological qubits, making qubits out of silicon rather than out of an ion trapped in a magnetic field at somebody’s lab is going to be the central breakthrough to be able to enable that, I think.

[Scott] Bronson shared his perspective concerning ECP-developed products that are already making a difference or likely will across the research community.

[Bronson] I think on the software technology side, ECP had an appropriately large tent to start with and made sure that they had a good variety of projects that were both new and sort of speculative to some extent, but also a mix of projects where development was continued and enhanced all things that are very much bread and butter when it comes to HPC for modeling and simulation.

I mean, projects like PETSc, HDF5, OpenMPI, and a variety of other projects are absolutely bedrock for the kind of things we do. And making sure that those actually run well on exascale computers and beyond is key. I think those have already shown their value even in early days on Frontier. And the expertise that’s been developed has also been important for us to tap, for example, when we test certain things.

The application codes themselves, we’re starting to see some results from them on Frontier. We’ve had a handful of ECP application development teams try their hand at Frontier. A couple have already got preliminary numbers that indicate that their figures of merit have increased by the requisite amount, and they’re very excited to be able to carry out longer-term scientific runs with those codes.

I think that the real legacy of ECP is going to be sort of encased in those application codes that to some extent, we hope, have been future proofed and they’ve certainly been re-engineered to place where they should be extensible in the future. So, thinking about code management, scientific code development, scientific software engineering, I think is also a major legacy of ECP.

[Scott] What exciting things are on the horizon for the Frontier machine in the new year?

[Bronson] In the new year, I think one of the most important and exciting things will be the full production runs using ECP application codes to be able to attack scientific problems that haven’t been attacked before. They’re going to range from relatively short but using a big chunk of the machine but doing a lot of them, to using a big chunk of the machine for a nice long period of time to do something that nobody’s ever done before. Both of those dimensions are exciting to me. So, I’m really looking forward to how that’s going to work out. I’m sure there are going to be surprises—there are always surprises at scale the first time you do things, and I don’t think this will be any different in flavor, if you will, than our previous experience, and maybe even a little richer.

[Scott] And an overview of what’s going to be possible on Frontier. What will be the first apps to run?

[Bronson] Yeah, so I don’t know about first up, but we are going to be able to do, for example, things like whole device modeling for fusion reactors; that’s going to be a really challenging challenge problem, something that, of course, is absolutely key to being able to advance the science that ITER is going to do for a search to be able to harness the power of the sun here on Earth in the form of a fusion reactor. And again, that’s a case where people have worked on that problem for a long, long time. And like so many other problems—this is something I didn’t mention earlier—people often ask me, ‘Is there a single killer app for supercomputing and for exascale computing,’ and I think there’s not. That’s the beauty of it.

But there is this one physical thing that I think that the exascale is going to be able to help us with, and that thing is turbulence. Turbulence is the last great classical physics problem. There’s this apocryphal tale which is absolutely false, but it’s a great, pithy quote. It was said that Werner Heisenberg once said that when he go to heaven, he was going to ask God two things, ‘Why relativity and why turbulence,’ and he actually thought He’d have an answer for the first one. It’s remarkable how ubiquitous turbulence is in our lives and in science, and how much exascale computing is going to help us to attack it.

We’re still going to be a long way away from being able to resolve all turbulence scales, but we’re going to get close to some places where connections are starting to be made. And turbulence matters in , for example, fusion reactors. It’s what drives our weather and climate—turbulence in the atmosphere and in the ocean. It’s responsible for so much of what makes making things like wind turbines or other machined engineering parts hard to manage or requiring some engineering to go along with them.

It’s absolutely essential to the way stars end their lives and actually blow up in supernovae. It’s found in all scales, and it’s something that can only be sort of grokked, or understood, through computing. And that’s why I think exascale computing and ECP codes are going to really shine in that particular regard.

So again, that includes whole device modeling for fusion reactions. I think we’re going to be able to do models of the climate system at a level of fidelity we haven’t been able to do before. I think we’re going to be able to do fully coupled modeling of fission reactors as well at a level we haven’t been able to do before, which could really help with a renaissance of nuclear power in this country and in others as we march toward a future where energy’s going to become more and more important.

It’s still going to be a while before we get fusion energy. We probably still need to take a long look at fission as a possible source of energy. Having really, really reliable predictability for that kind of machinery is, of course, important, and I think exascale allows us to do that by being able to sort of solve all at once all the pieces and parts that go into a fission reactor.

I think we’re also going to be able to see with the ExaSky project pretty early on what their addition to what’s called baryonic matter is going to do. If you’ve ever seen one of these simulations of the so-called cosmic web—the way matter sort of clumps and forms clumps—what you’re really looking at is the visualization of dark matter. You wouldn’t actually be able to see that. And the over-densities and the blobs are these so-called dark matter haloes that quote, unquote real matter fall into and form galaxies eventually. Well, the ExaSky project’s adding that baryonic, or non-dark matter, that glows. They’re adding that to their simulations, and they’ll be able to trace where stuff that we actually see glows ends up. That kind of matter acts a little differently than dark matter; that’s the whole point. Being able to do that all at once is going to be a big-time advance as well.

[Scott] Our thanks to Bronson Messer for joining us on Let’s Talk Exascale.

And thank you for listening. Visit exascaleproject.org. Subscribe to ECP’s YouTube channel—our handle is Exascale Computing Project. Additionally, follow ECP on Twitter @exascaleproject.

The Exascale Computing Project is a US Department of Energy multi-lab collaboration to develop a capable and enduring exascale ecosystem for the nation.

Scott Gibson is a communications professional who has been creating content about high-performance computing for over a decade.