Let’s Talk Exascale Code Development: HACC
By Scott Gibson
This is episode 79 of the Let’s Talk Exascale podcast, where we explore the efforts of the Department of Energy’s (DOE) Exascale Computing Project (ECP)—from the development challenges and achievements to the ultimate expected impact of exascale computing on society.
And this is the second in a series of episodes based on an effort aimed at sharing best practices in preparing applications for the upcoming Aurora exascale supercomputer at the Argonne Leadership Computing Facility (ALCF).
The series is highlighting achievements in optimizing code to run on GPUs and providing developers with lessons learned to help them overcome any initial hurdles.
The focus in the current episode is on a cosmological code called HACC, and joining us to discuss it are Nicholas Frontiere, Steve Rangel, Michael Buehlmann, and JD Emberson of DOE’s Argonne National Laboratory.
Our topics: what HACC is and how it’s applied, how the team is preparing it for exascale and for the upcoming Aurora system, and we’ll cover more.
Interview Transcript
Gibson: Nicholas, will you start us off by telling us about HACC, including what it’s currently used for?
Frontiere: Yeah, so HACC stands for Hardware/Hybrid Accelerator Cosmology Code, and as its name implies, it’s a cosmology N-body and hydrodynamics solver that studies the structure and formation of the universe evolved over cosmic time.
The code itself consists of a gravitational solver, which was later modified to resolve gas dynamics, and it includes astrophysical subgrid models that emulate galaxy formation. One important characteristic, I would say, in addition to the physics is that HACC was designed from the beginning to run at extreme scales on all high-performance systems operated by DOE labs and continue to do so as new hardware emerged. So, in fact, HACC began development, I think, over a decade ago when the first petascale machine came online, called Roadrunner, at Los Alamos. And since then, we’ve optimized the code to run at tens and hundreds of petaflop scales leading into the exascale computing effort that we’re currently preparing for.
Gibson: OK. This seems important: how will HACC benefit from exascale computing power?
Frontiere: That is a good question. One important element to consider in that respect for us is that there’s going to be cosmological measurements being made now and by future observational surveys. Upcoming instruments like the Vera Rubin Observatory, for example, will be making extremely precise measurements, which in turn, require commensurate theoretical predictions from codes like HACC. And exascale computing, in particular, will play a large role in meeting the resolution requirements for those predictions.
I would say that since the machines are getting a significant boost in computational performance—roughly an order of magnitude, I believe, at this point—it allows us to include a lot more physics into our simulations at these immense volumes, which then, in turn, drastically increases the fidelity of our predictions. And none of that would be possible without exascale computing. So it’s actually quite exciting for us that we’ll be able to run these simulations with the new computers that are going to come out.
Gibson: Nicholas, what is the team doing to prepare HACC for exascale?
Frontiere: HACC has two ALCF Early Science projects, and it is the primary code for the ECP project called ExaSky. And all of those activities have a goal to make sure that the code is performant on the upcoming exascale machines, namely Aurora and Frontier. These projects allow us to access early hardware and software compiler stacks to ensure that the code is optimized for each new hardware and that we’re ready basically to run on day one of launch for each machine. Since HACC has a long history of optimization for a variety of supercomputers, we have a lot of experience in what preparations we need to make to get the code ready.
Gibson: HACC has been developed to run on diverse architectures across DOE computing facilities. How have the code’s earlier portability efforts helped in preparations for DOE’s upcoming exascale systems?
Frontiere: It’s a very good point that HACC essentially was designed to emphasize performance over, I’d say, strict portability, yet it achieved both with tractable effort, which is pretty interesting. We accomplish this by combining a long-range gravitational spectral solver, which encapsulates most of our global communication within the code. And we couple it to a limited number of computationally intensive short-range force kernels, which are node local, in that they don’t require communication. And if these kernels were a majority of the time spent within the solver, that decomposition that we chose since the beginning of HACC has the advantage that the spectral solver already efficiently scales to millions of ranks and is written in MPI, so that’s already portable to the new machines that we get access to. And, as such, all of our performance efforts can be restricted to optimizing the limited number of short-range force kernels for whatever hardware we’re running on since they contribute to the majority of our time to solution.
This strategy we’ve taken, I would say, has a number of interesting advantages. The first is that we don’t have any external library dependencies from portability frameworks to worry about. So we don’t need to wait on any specific machine support for other codes. And, secondly, we have the flexibility to rewrite these kernels in whatever hardware-specific language is required to achieve the best performance. So that could be CUDA, OpenCL, DPC++, whatever the machine requires. And because there are only a small number of kernels to rewrite, that actual performance and optimization effort is quite manageable, without having to rewrite the whole code.
One other additional benefit is that because the kernels are node local, we can optimize on small early-hardware machines ahead of time, which essentially means that day one when a new machine drops or the supercomputers come online, we can immediately perfectly weak scale with the spectral solver and run state of the art out of the gate.
So this strategy’s been really successful in the past for a ton of different machines with quite different hardware. And, I would say, subsequently, we’re pretty confident that with the same strategy we should be able to optimally run on Frontier and Aurora using the same approach.
Gibson: Thanks, Nicholas. Steve, what is the team currently doing to get ready for Argonne’s Aurora system?
Rangel: Well, with respect to code development, the focus for our effort for Aurora has been porting and optimizing kernels for the solvers. This is for both the gravity-only and the hydro versions of HACC. So, as Nick mentioned, because of HACC’s structure, this work is at the node level for Aurora. It just means we’ll be targeting the Intel GPUs.
I should note Aurora isn’t the first GPU-accelerated system HACC will run on. The gravity-only version of HACC ran on Titan at Oak Ridge with an OpenCL implementation and later with a CUDA implementation. And the hydro version is being developed using CUDA, also running at Oak Ridge on Summit.
Very generally speaking, GPU architectures share a lot of computational characteristics. So, across manufacturers—AMD, NVIDIA, Intel—the underlying algorithms require only a little bit of change for us to begin optimizing. For Aurora, this lets us focus on evaluating the programming models, testing the SDK [software development kit], getting familiar with the Intel GPU hardware, and trying to develop a porting strategy that would ease maintainability across our different exascale implementations.
The work to evaluate the programming models really began in earnest in June of 2019 when we had our first Intel–Argonne hackathon. During this time we were able to get our existing gravity-only code written in OpenCL, originally written for Titan, running on the first Intel GPU hardware that was in our test bed, with very little modification. We were able to rewrite the gravity-only kernels in DPC++ and OpenMP, and we conducted some preliminary tests. Subsequently as the SDK matured and more representative hardware became available, we were able to conduct more extensive evaluations, and, ultimately, we chose DPC++ as a programming model for HACC on Aurora.
So in preparation for our second hackathon with Intel. This time we were going to focus more on the hydro code. We began exploring porting strategies because the hydro code involved significantly more effort, being that there are more kernels, and the kernels themselves are more complex.
Since we had a CUDA version of the gravity code, we began testing the DPC++ compatibility tool to migrate code from CUDA to DPC++. We finally converged on a semi-automatic strategy where we used the compatibility to migrate individual kernels, and we manually wrote kernel calling code and code for explicit data management between the host and the device using these newly available features for unified shared memory in DPC++. So in this way, at least from the perspective of host code, the data movement and device kernel calls looked remarkably similar to our CUDA implementation, hopefully getting us a little closer to a more manageable code base, and although this work is not completely finalized, we think it’s going in the right direction.
Lastly, we still have this ongoing collaboration with Intel where we’re optimizing the performance of the code on the new hardware assets becoming available.
Gibson: Thank you, Steve. We’ll pivot over to Michael and J.D. for this question. Michael, please begin a discussion about the next steps.
Buehlmann: So maybe I can start answering this question with talking a little bit about the physics that we’re adding to HACC.
As Nick already mentioned earlier, observational surveys that are currently being done and will be done in the future, they become both wider and deeper in scale but also much more precise. For us, from the numerical modeling side, we have to keep up with that development to get most of that data that is provided from these surveys. That means that we have to simulate both a very large volume of the universe as well as include all the physical processes that are important and affect the properties of the galaxies that we observe in those surveys.
To give some perspective, in the past it was often sufficient to treat the universe as if it only consisted of dark energy and dark matter, and that of course simplifies doing numerical simulations by a lot, because we only have to consider gravitational forces in an expanding background as the universe evolves.
Now with more and better observational data, the effects of the baryonic content of the universe could become more and more important, and they also open an exciting new window to study these astrophysical processes.
However, with baryons, there are a lot more physical properties attached than just gravity. So for example, baryonic gas can cool and heat due to radiation. Especially in the high-density regions, cooling becomes very efficient so that stars and galaxies can form there. So we also have to model star formation. And those stars, in turn, will heat up the surrounding gas and also change the composition of the baryonic gas and the environment as heavier elements are formed in the stars.
And at the center of massive galaxies, we find supermassive black holes, which accrete the baryonic gas and eject enormous amounts of energy that has to be accurately modeled to explain how the center of galaxies look like.
Currently, we are working on incrementally adding these processes and validating each as we add them. And I think JD will talk a little bit more afterwards about how we do this validation.
The scale of most of these baryon processes is well below the scales that we resolve in our simulation. For example, the mass resolution that we typically have is 106 to 109 solar masses. So we cannot follow the evolution of individual stars. Instead, we follow an entire population and model how it evolves.
One problem with these kinds of models is they have some free parameters, and those parameters are often resolution dependent. So these models are not resolution converged and have to be re-tuned for every simulation run.
With HACC in the future, we’re planning to go a slightly different route. With HACC, unlike other cosmology codes, we’re really focused on large-scale simulations. And we are, therefore, more interested in treating galaxies as a whole instead of resolving galaxy substructures and individual star-forming regions within that.
We are working on a new sub-grid model with what we call super particles or galaxy particles that will be more suitable for us and for the kind of simulations we want to do. And this will both help us to speed up the simulation as well as get more stable and converged results.
This new technique, together with the performance leap in the exascale machines, will allow us to run the largest multi-species simulations that have been run so far.
Emberson: So one thing I’ll add is that an important consideration that we need to make with all of our development work is that we thoroughly understand the error properties as well as the numerical convergence properties of these developments. And there are basically two main ways that we are addressing this topic.
The first is to consider how our code changes impact the gravity solver; that’s the part of the code that already existed beforehand. And the second is to consider the newly added hydro solver itself. With regards to the first point, as mentioned by Michael earlier, when we go from simulations that include only gravity to simulations that include both gravity and hydrodynamics, one of the fundamental differences is that the code now has to evolve two separate particle species. This includes both the dark matter, which interacts only via gravity, as well as baryons, which have both gravity and hydro interactions.
Now, in traditional gravity-only simulations that have only the dark matter component—so only one particle species—there are various numerical artifacts that are known to exist due to the discreteness of the problem. That is, since we’re using a set of discrete macroscopic particles to represent the continuous matter field of the universe, it’s easy to see that the numerical solution we obtain will deviate from the continuum limit when we approach the particle scale in the simulation. And knowing how to handle these errors and restrict them to small scales is something that’s well understood in the single-species case. However, in the case where we have multiple particle species interacting with each other, it’s possible that, (a) the numerical issues that we see in the single-species case become amplified due to the fact that we have particles with different masses and/or (b) that having two species introduces completely new numerical issues that do not have an analog in the single-species case. It turns out that both of these statements are true, and therefore a large part of our development work is to thoroughly investigate this topic and figure out the best ways to handle any numerical issues associated with having multiple particle species and make sure that those errors are confined to small scales and that the code behaves in a numerically converged and robust way.
The second part of this work is to understand the systematics associated with the hydro solver itself. The hydro solver in HACC is based upon a novel Lagrangian particle-based method, which is actually the first of its kind to be used in cosmological simulations. As such, we can expect that our hydro simulations produce slightly different results compared to other cosmological codes that have their own different hydro methods with their own different internal free parameters.
For instance, the companion code, Nyx, which is also part of our ECP project, uses a fundamentally different approach to solve hydrodynamic forces, namely a Eulerian mesh-based method. And so as part of our ECP efforts, we have begun an extensive comparison campaign between HACC and Nyx that serves not only as an internal verification and validation pipeline for our development work but also enables a more rigorous investigation of the convergence properties of each code.
And so to do this comparison properly, we start by comparing the codes in the simplest case of gravity-only dynamics and then systematically work our way upwards to increased complexity with all of the full range of sub-grid models we want to implement. Performing careful comparisons in this manner, between the two codes which have fundamentally different hydro schemes, we’re able to answer the question of what range of length and time scales we can expect cosmological codes to agree at a given level of accuracy.
Gibson: Thanks, JD, Michael. To close, let’s go back to Nicholas. Finally, do you have any lessons learned or advice that may help researchers in preparing their codes for GPU-accelerated exascale systems?
Frontiere: Yeah, so, specifically, GPU-accelerated systems, I think there were a number of good lessons we learned that others might find valuable.
The first is the standard advice you normally hear, which is that you want to reduce the memory management complexity and increase your data locality as much as possible. Essentially, attempting to take advantage of the brute force nature of GPUs, I should say. And this is because many applications over the years have implemented a variety of algorithms that were selected to optimally reduce the order of computation, essentially make the code as efficient as possible. For example, tree-based algorithms are a classic example of that. But people typically have issues translating that work to GPUs because they depend on complicated pointer chasing data structures, for example, and non-contiguous memory accesses.
And so rethinking which algorithms can be used and even potentially resorting to more computationally expensive but simpler algorithms can actually greatly improve your performance on GPUs. That’s like the standard advice people often give when thinking about GPUs. But to go beyond that, I would say we’ve also discovered that you, in general want to write your algorithms to be what I call GPU resident as opposed to GPU accelerated. And what I mean by that is, typically, people are taught that when you’re accelerating your application, you want to pipeline everything so you accelerate a bunch of tiny kernels, which are all reading and writing and computing asynchronously at the same time and then you’re trying to pipeline everything to avoid latencies, and I think we’ve actually conversely found that simple kernels with large—and even serial—data transfers to the GPU, where the data just resides and is batch computed, can not only achieve peak performance but also circumvent a lot of problems, potential hardware and software concerns that arise when you have new things that drop. So every time you get a new GPU or a machine comes online, you always expect there to be hardware and software concerns. And almost all of them immediately impact you when you take the traditional accelerated strategy as opposed to opting for some more simpler GPU-resident algorithms.
I would say that essentially making your data layout as simple as possible and doing these bulk data transfers to a device and computing away and repeating, if necessary, can be essentially a simple strategy to implement and maximize performance.
In modern GPUs, if you look at the memory capacity, it’s growing significantly. So I think intuitively for most scientists, it should make sense that your data or a large fraction of it should actually reside on the GPU as long as possible instead of trying to stream and pipeline it as fast as possible.
And so for those reasons, GPU-resident algorithms, I would say, should really be considered and I think are a good recommendation.
Gibson: Thank you all for being on the podcast. Again, our guests have been Nicholas Frontiere, Steve Rangel, Michael Buehlmann, and JD Emberson of Argonne National Laboratory.
Related Links
HACC: https://cpac.hep.anl.gov/projects/hacc/
Get Ready for Aurora (Aurora Documentation): https://www.alcf.anl.gov/support-center/aurora
oneAPI DPC++: https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html