Let’s Talk Exascale Code Development: EQSIM

Exascale Computing Project · Episode 77: Let’s Talk Exascale Code Development: EQSIM

By Scott Gibson

Brian Homerding of Argonne National Laboratory and Houjun Tang of Lawrence Berkeley National Laboratory

From left: Brian Homerding, Argonne National Laboratory; Houjun Tang, Lawrence Berkeley National Laboratory

If you follow what we do on the Let’s Talk Exascale podcast, you know we explore the efforts of the Exascale Computing Project (ECP), from the development challenges and achievements to the ultimate expected impact of exascale computing on society. With the latest episode, we add a new dimension to the content.

We introduce the first in a special series based on an effort aimed at sharing best practices in preparing applications for the upcoming Aurora exascale supercomputer at the US Department of Energy’s Argonne National Laboratory.

In the series discussions, we’ll highlight achievements in optimizing code to run on GPUs and provide developers with lessons learned to help them overcome any initial hurdles.

This first episode focuses on preparing an earthquake risk assessment application for exascale computing and features Houjun Tang of Lawrence Berkeley National Laboratory and Brian Homerding of the Argonne Leadership Computing Facility (ALCF).

Interview Transcript

Gibson: Houjun and Brian, let’s begin by having you give us a general overview of your work and how it fits into the Exascale Computing Project. First, Houjun, your angle is your involvement in the Earthquake Simulation, or EQSIM, ECP subproject and the SW4 application that EQSIM uses. In terms of SW4, it will be great to hear how it’s used and how it will benefit from exascale computing power.

Tang: The EQSIM project aims at developing and implementing an earthquake simulation and analysis environment to establish a coupled assessment of earthquake hazard and risk. SW4 (Seismic Waves, 4th order accuracy) is our main software for simulating seismic wave propagation on HPC systems. The simulation results would help us address key issues such as how earthquake ground motions vary across a region and how it impacts risk to infrastructure, and how realistic and complex incident ground motion waveforms interact with a particular building. With the upcoming exascale computing power, we will be able to run large models at higher frequency much faster, which also enables us to run many different scenarios that take account of uncertainties such as the fault rupture and provides more insights on earthquake analysis. For more information on the science aspect of our project, please listen to episode 76, in which our project’s PI [principal investigator], Dr. David McCallen discussed in detail.

On the code development side, besides making algorithmic improvements such as supporting mesh refinement for curvilinear grids, we have utilized many state-of-the-art software libraries that made it much easier for us to transition toward exascale computing, we have adopted HDF5 for efficient I/O and workflow data management, ZFP for data compression, and RAJA for the portability that allow us to run our code on GPUs with different system architectures at various supercomputing facilities, Brian will share more details of the work to prepare our codes for the next-generation systems like Aurora at the ALCF.

Homerding: As was mentioned, I’m working at Argonne towards preparing code for Aurora. Essentially, what we’re trying to do is get ready for the exascale system Aurora, which is a GPU-accelerated system. It’s developed by Intel and HPE. And we’re really focusing on getting applications ready to run and run successfully as quickly as possible when the system arrives.

Gibson: What are the challenges associated in preparing CPU codes to run on GPU architectures? And how does porting codes from CPU and GPU systems compare with porting from CPU to CPU?

Homerding: I think one of the biggest challenges in preparing for executing on a GPU is that they require explicit parallel programming models to enable the heterogeneous execution. From a less technical standpoint, one of the challenges is that you need to decide what kernels need to be executed on the GPU. This is kind of influenced by a few factors such as the amount of work that the kernel’s going to be doing and also the overhead involved with running on the accelerator. Part of that overhead is getting the data that’s required over to the accelerator so that it can run there. And that’s also done explicitly with the programming model, where we do mem copies, say, send it from the host CPU to the GPU. If you’re just going from CPU to another CPU, that’s just naturally taken care of with the compiler and the system. While the overhead is one role, the other part of it is that you need to determine how much work you have available to do on the GPU. Part of preparing for running on GPUs is to maximize the amount of parallelism you have. This is because GPUs offer a great deal more parallelism than we have on CPUs. So maximizing the amount of parallelism in your code is a very important step. This can be done basically with how your kernels are executed, but then you can even go up to a higher level and start looking at the algorithm you’re using. With certain science applications, you may want to use a different algorithm that has a lot more parallelism in it to be executed on the GPU.

Gibson: What kinds of tools are available to researchers working to prepare for Aurora and other exascale systems?

Homerding: There’s a lot of tools available currently to help researchers and developers to prepare their code for Aurora and the other exascale systems. There’s the different programming models, some of which offer different levels of support for executing across these different exascale systems. Two of those are Kokkos and RAJA, which are portability layers from national labs. These two sit one level higher in the abstraction over the vendor-specific programming models. And the EQSIM team, as we mentioned earlier, is using RAJA to enable their performance portability along with this related project, Umpire, for memory management.

Another useful tool that I’ve found for researchers is having access to a proxy application. This is essentially a smaller version of the application that is focused on representing a specific aspect of the full application, often the compute code that you’re going to be looking to run on the GPU. When you’re making a big shift like this to run your code on accelerators, having a small version of your code without complications like external dependencies—or I/O, checkpointing, even big initialization—can really simplify this early development so you can focus on the parts you’re trying to focus on. And there’s also a lot of useful tools specifically for when preparing for Aurora. Intel has their advisor tool, which can help you to identify potential kernels to be offloaded that have enough work and are profitable.

There’s an Argonne-developed tool called iProf for lightweight profiling on Intel GPUs. There’s also, of course, Intel’s VTune, which is very useful for doing an in-depth analysis of your code as it executes. Additionally, there’s Intel’s DPC++ compatibility tool. This can assist developers to migrate CUDA code to DPC++ code. You can try this out on Intel’s DevCloud. If you just do a Google search for Intel’s DevCloud, you’ll find it. These are resources that anybody can request access to from Intel to be able to experiment with their upcoming programming models.

Gibson: OK. Great. Let’s focus specifically on RAJA for a bit. Could you tell us more about it, including how it helps prepare or port your code to a GPU-based exascale machine?

Homerding: This is kind of a programming model, but it’s really an abstraction layer over other programming models. This is one of two; you might also have heard of Kokkos. These are both coming out of national labs. And what they do is offer a way to have portable execution that’s also performant across many of these machines that are coming up.

Now, RAJA, specifically, is designed with some goals intact. They want to enable portability and manageable disruption to your source code. So it’s not that RAJA code is going to run on every kind of device; it’s that the amount of code you need to change is minimal, and they try to minimize that as much as possible.

It essentially gives you a common interface to make these changes so you can control how your code is going to be executed. Ideally, your kernel is staying the same while the code that you’re changing just tells you how to execute the kernel. This can include some complex changes such as implementing tiling or loop interchange, and you can do this very simply with RAJA. This kind of control can be very useful for stencil codes like SW4.

RAJA itself is a project that’s focused on loop execution, and it does not include memory management. RAJA applications either can use the underlying backend’s memory management model directly or they use this related project, Umpire, which SW4 is using. Umpire is another abstraction layer that sits on top of these other programming models, providing a single interface to do memory management such as mem copies and allocations. This design is nice because it offers a separation of concerns. RAJA is very much focused on executing your kernel on the GPU while Umpire is just focused on memory management and moving memory around.

The nice thing for preparing for Aurora is that the SW4 team has already ported the code to run using RAJA with some of the existing backends such as CUDA and OpenMP. And that really simplifies the work that we’re going to be doing to prepare for Aurora.

Gibson: What was the level of effort pertaining to porting the code to a new platform using a portability layer?

Homerding: This is what I was speaking to briefly. Essentially, we’re trying to minimize the amount of effort. So for RAJA, when you are moving to a new platform such as the Intel GPU, you need to implement new policies. This is an execution policy, and the way that it is handled in RAJA is that it is a template argument of the kernel execution API.

Essentially, you’re able to take all of your kernels that you want to offload, and in this template argument you can control how they’re going to be offloaded. This allows us to put all these kernel execution policies and have them defined in a single header file so you can do your CUDA policies and SYCL policies all in separate files and just swap out which one’s being used on the system you’re targeting.

These policies are a collection of template statements to define how to execute a kernel. The RAJA team provides many different statements because they want to expose that control to the developer. This allows for such things as tiling and atomics, reductions, even conditional execution. But to start with the basics ones, we’re going to execute it on this backend; it’s going to be a Forall loop over this set of iteration statements. And these are the kind of things you’re going to need to define. As you can tell, it really is separated from the scientific kernel you’re going to be running on the device. It’s really just focused on how to run it.

The nice thing about this, as I mentioned, is when porting to a new system. When we’re preparing for Aurora, we’re able to begin with some kind of simple execution policies to say, OK, target this to an Intel GPU, and then try to launch it and run it. Then we can go back and we do optimizing, and we can try to do some tiling and introduce some of these more complex loop execution controls to maximize our performance. We were able to get up and running very quickly by just executing the kernel naively.

Gibson: How does RAJA enable kernels to execute on Intel GPUs?

Homerding: RAJA is being developed to run on GPUs using DPC++. That is part of Intel’s oneAPI, and it’s an implementation of the SYCL programming model. Now, SYCL is an open programming standard that’s focused on heterogeneous C++ programming. DPC++ is kind of an overloaded term. It refers to the name of Intel’s implementation of the SYCL standard, with some important extensions that they made to the language. The execution policy statements are set up in a header file. All of these different statements funnel down into the backend as different information such as how big your loops are and what kind of block sizes you want to use on them and what order they should be executed in. All this information gets funneled down into the backend. Then we set up and we launch the kernel to execute on the accelerator. For the Intel GPUs, this is going to be launching a SYCL kernel. So, basically, all these policies get turned into information, which then sets up and runs a SYCL parallel for kernel.

Gibson: Which SYCL features do you find most useful for implementing RAJA to run Intel GPUs, and what are their advantage?

Homerding: To start with, it’s a fairly standard feature, but it’s very important for RAJA, just having multidimensional nd_ranges. This is because they offer access and control to the workgroups and you’re allowed to do things like workgroups reductions. And it gives you a lot more control, and it enables a lot more complex kernel execution.

There’s basic Foralls, and they really don’t need all these features; and they can be executed a lot more simply using just a regular range. A regular range is just saying, I have a range of 10; iterate Forall 10, whereas an nd_range defines a workgroup, a global group, a global size that you’re going to be executing over it. It’s very important for RAJA because we’re a library, and we don’t know the kernels that we’re going to be executing; so, we need to be able to handle the most complex case while also handling the simple cases. And because of that, we need to use these nd_ranges that are important for us.

Beyond that, DPC++ is offering us some features which are very useful for RAJA. Among these are extended atomics so we can have full support for atomics. Also, there’s an extension called unnamed kernel lambdas. This essentially allows for not naming your kernels, which is something that is originally required in SYCL. This is absolutely needed for an abstraction layer library like RAJA, which we don’t know anything about the kernel we’re going to be executing. Another example feature—although it’s indirectly very important for us—is USM.

With USM, SYCL offers two different forms of memory management, both implicit and explicit. With the implicit form of memory management, kernels are defined along with their accessors, which say what memory I’m going to use and how I’m going to use it. That’s very useful for being able to do some complex kernel DAG execution. It’s difficult to implement in an abstraction library. And, specifically, with RAJA since we have the separation concerns and we’re just focused on kernel execution, we don’t even have a memory management module in place.

This is just not going to match up with the way that it’s done. But USM is essentially a different form of memory management that can be used with SYCL that’s going back to being very explicit, just saying, allocate this memory on the device; copy this from the host to device; copy this back. This enables us because we essentially execute assuming all the data that we need is already on the device. So, while we’re not using USM directly, we’re expecting the application to make use of it, either directly to manage this memory or through the abstraction layer, the related project, Umpire.

Gibson: Do you have any lessons learned or advice that may help researchers in preparing their codes for GPU-accelerated exascale systems?

Homerding: If possible, I’d say they try to start simple and try to scale up in complexity. Just make sure to get things running and slowly scale things up and start looking at more important performance issues. Don’t prioritize performance at first. Also, don’t completely ignore it so you don’t hit any big issues later. Beyond that, I’d say the main advice is start now. There’s a lot you can do currently. At Argonne we have the Early Adopters webinars on demand. And you can reach these through the Argonne Leadership Computing Facility website. Specifically, you go to alcf.anl.gov/aurora. That’s the landing page for Aurora, which will be able to direct you to information about the machine itself along with these Early Adopters series. And a lot of this can be very useful when getting started.

Gibson: Again, our thanks to Houjun Tang of Lawrence Berkeley National Laboratory and Brian Homerding of Argonne National Laboratory.

Related Links