Helping Large-Scale Multiphysics Engineering and Scientific Applications Achieve Their Goals

Exascale Computing Project · Episode 89: Helping Large-Scale Multiphysics Engineering and Scientific Applications
Siva Rajamanickam of Sandia National Laboratories

Siva Rajamanickam, Sandia National Laboratories

Hi. This is where we explore the efforts of the US Department of Energy’s (DOE’s) Exascale Computing Project (ECP)—from the development challenges and achievements to the ultimate expected impact of exascale computing on society.

In this episode, we take a look at Trilinos, a collection of reusable scientific software packages with a long-standing history of supporting multiphysics and engineering applications. Within ECP, Trilinos is valuable to DOE’s Advanced Scientific Computing Research (ASCR) program, the Office of Science, and the National Nuclear Security Administration’s (NNSA’s) Advanced Technology Development and Mitigation (ATDM) subprogram. For example, Trilinos recently helped two NNSA ATDM applications reach an important milestone. Trilinos is essential to ECP’s ExaWind wind farm simulation project and is part of several other ECP projects. In the Software Technology research focus area, a new project called Sake—Solvers and Kernels for Exascale—focuses on Trilinos.

To discuss the Trilinos efforts within the ECP, we have Siva Rajamanickam of Sandia National Laboratories.

Our topics: A general summary of Trilinos; a little bit about its capabilities; its use in scientific and engineering applications; the aspect of portable, parallel execution; Trilinos structure; some project successes; and more.

Interview Transcript

Gibson: Let’s start with some very high-level information about Trilinos for those who aren’t very familiar with the project or don’t know anything about it all.

Rajamanickam: It’s a large software framework. The focus for the software framework is to solve problems faced by large-scale, multiphysics engineering and scientific applications. And within that—going one step further—our focus is on large-scale simulations on HPC [high-performance computing] architectures that are of interest to DOE.

Trilinos as a project has been around for about 20-plus years, so it is not new. We’ve been fortunate to have funding from different sources within DOE. I work at Sandia, so we have funding from NNSA that has funded Trilinos for a long time in the ASC [Advanced Simulation and Computing] program. ASCR and SciDAC [Scientific Discovery through Advanced Computing] programs have contributed to Trilinos development. And the Sandia LDRD [Laboratory Directed Research and Development] program has also contributed to Trilinos development.

In summary, this is a software framework that has been developed for the last 20 years with funding from different sources, but the functionality itself has been developed with the goal of helping our typical multiphysics engineering codes and scientific applications to achieve their mission goals and science goals. The software is in GitHub; anyone can go get it. It is completely open source. And the purpose of this podcast is to talk about what we are doing in the ECP project itself. Hopefully, that helps to give a broad overview.

Gibson: Yeah, that’s great. That’s very helpful. So, with that overview, let’s move to some high-level details. Will you share with us about the capabilities of Trilinos?

Rajamanickam: Of course. I can describe the capabilities of Trilinos at a very high level. Let’s say you wanted to run on a large parallel computer—say, you want to solve a linear system—then the first thing you would need is what we call a distributed linear algebra. So, [in Trilinos] we have what we call data services products that provide the ability to create matrices or vectors on many different compute nodes on a large supercomputer.

The next thing you’d like the capability to do is solve the linear system—solve the equation that you want to solve. Trilinos provides linear solvers as a product for it. And then there are nonlinear solvers, discretizations for a mesh, meshing capabilities. And when you run on these parallel systems, you also have several load-balancing issues, so we have tools for better load balancing. We also have different kinds of solvers that are adapted for different applications. For example, you would not use the same solvers for a climate simulation as you would for a circuit simulation. Trilinos has different solvers capabilities [for each of these cases].

It is a collection of data services or linear algebra data structures for a parallel computer, linear solvers, nonlinear solvers, optimization libraries, load-balancing capabilities. Trilinos is all of these put together—it’s a collection of all of these.

Gibson: How is Trilinos used to help scientific and engineering applications?

Rajamanickam: As you can see, [with Trilinos] being around for 20 years, you have many examples. But I’ll give you some recent examples related to ECP. Within ECP, Trilinos has uses on the ASCR/Office of Science ECP side of things, and Trilinos gets used in the NNSA’s ATDM aspects of things.

Let me give a very recent example. There are two NNSA ATDM applications: Spark and Empire. They were going after what you call an L1 milestone, similar to our KPP [key performance parameters] on the ECP side of things. So, in order to achieve their goals, like, reaching their KPP goal, Trilinos had to meet certain requirements on how fast our solvers have to be. We were able to help Empire, which is an electromagnetic code at Sandia, [and] Spark, which is a CFD [computational fluid dynamics] code. Both of them achieved their milestone goals last December.

The next one I would like to point out is ExaWind. This is an ECP application itself, and Trilinos is an integral part of ExaWind, along with other solvers. So, whatever ExaWind requirements there are, say, when they want to run it on Summit or Sierra, or when they want to run it on the upcoming machines Frontier or Aurora, then Trilinos has to adapt and help them achieve their KPP goals. We’ve been able to help them so far, and we hope to help them in the upcoming systems as well.

The final example I will give you is climate simulations. For example, if you want to simulate ice sheet problems, multigrid solvers are very important for solving these problems, and Trilinos has been used heavily for those kinds of codes.

These are some examples. There are several other examples you can choose from. I just picked from something off the top of my head.

Gibson: Okay. Will you discuss Trilinos relative to portable parallel execution?

Rajamanickam: ECP by itself and the exascale architectures that are coming up in the next few months to a year essentially have necessitated us to focus on performance portability. What I mean by that is the code that you develop should be able to run on current architectures, say, NVIDIA GPUs, and it should be able to run on the Intel GPUs and AMD GPUs that are coming with minimal changes. And you should be able to achieve good performance on all of them. You don’t want to sacrifice performance. Our focus has been on performance for a long time.

Performance portability has taken a front seat because of these changes [on the architecture side recently]. But [for] Trilinos as a whole, performance portability is not completely new. The Kokkos ecosystem that you see today includes the Kokkos Core library, the programming model, and  the Kokkos Kernels library that gives linear algebra and graph algorithms implementations. This is the current incarnation of a long line of research on portable parallel execution. There are two versions of Kokkos before the current version of Kokkos that we see today.

The first two versions of Kokkos were part of Trilinos, and even the third version of Kokkos was part of Trilinos when it [Kokkos] started. We spun off Kokkos by itself because we saw a very important use case that is going beyond Trilinos itself. Kokkos originally was designed to make Trilinos portable. Now that we have taken Kokkos out, it is also helping several applications directly write to Kokkos itself. So, in some sense, portability is not new to us. We’ve been working on this for a while, and the current work is completely dependent on Kokkos. Trilinos uses Kokkos Core, the programming model, and Kokkos Kernels, the linear algebra and graph kernels, underneath its solvers, data structures,  to achieve portability on different architectures.

And if you’re interested in what is happening as of now, we are starting to add support for AMD architectures. We are adding support for OpenMP target, with which you can support Intel architectures for the upcoming Aurora system. And we are also evaluating new research projects for programming model needs that are beyond what is immediately needed so that we don’t get left behind when the next new architecture comes up.

Gibson: Let’s turn our attention to the structure of Trilinos. It is composed of about 50 disparate, autonomous packages and has some new interfaces that sit above them to help users. Please describe for us how Trilinos is structured and how it works.

Rajamanickam:Yeah, this was very important to the spirit of Trilinos when [it was] originally designed. The credit goes to a lot of original developers. I came into Trilinos a lot later. Like I said, it’s been around for 20 years. I’ve been working on Trilinos only for the last 11 years. So, the credit goes to the original designers, who came up with this idea, and Mike Heroux as the lead for Trilinos.

The essential principle of Trilinos is it is a federated group of packages. The packages have a lot of autonomy in how they want to help our scientific and engineering codes. But there is also a common set of guiding principles that you want to follow to be part of Trilinos, like a certain amount of documentation. You use the same build system. You use the same style of how a user will install your package and integrate it into their code. So, there is set of guiding principles that everybody follows, but beyond that set of guiding principles, you also have an autonomy for each package to go and solve the problem that they are trying to solve in the best way that they know how to solve it. And this is very important because we don’t want to impose the same set of restrictions or the same set of interfaces for, let us say, a graph algorithm’s library for load balancing and [a] nonlinear solver that is being developed.

So, there is a big group of libraries that come together. They all follow community common guidelines and they, together, help our applications, both in the mission space and in the science space. That has been our primary focus for a very long time.

You talked about these new interfaces [and] new areas that are coming on the top. We recognize that sometimes this can become quite daunting because there are, like, 50-plus different packages, each doing its own thing. As a user, you sometimes just want to solve a problem, and you don’t care what algorithm you use. You just want to solve a problem. So, what we have done in the recent past—including [with the] ECP [funding], is to put a layer on top of these packages, what we call Trilinos products. There are five different product areas, and this has allowed us to bring together packages that have a common theme. All packages still have the same community guidelines and all the things that I talked about.

But what this helps is if you have a linear solver product. It has a similar interface. The users don’t need to worry about what type of preconditioners and what type of linear solvers [to use], at least in the beginning. Eventually, for performance, you care about all of those things, but to make the user experience easier in the beginning, we are bringing together, for example, all of the linear solvers into a linear solver product, all the data structures and the distributed data structures, as well as the node-level data structures, into a data services product. And there are five different products like that. So, we hope that will help the users have a better experience in using Trilinos. Those are the two levels that we have—between about 50 packages and five product areas. The users can choose which level they’re comfortable with to interact with Trilinos.

Gibson: Now that we have a handle on Trilinos’ background, how it’s organized, and what its capabilities are, will you share with us about the project’s accomplishments so far within ECP?

Rajamanickam:

I didn’t talk about the ECP aspect of Trilinos as deeply. So, let me talk a little bit about that. Trilinos, within ECP, is part of several projects. There is some Trilinos funding, say, for example, within a co-design center or within an application itself to do what is needed immediately for their use case and is only applicable for that co-design center or that application. But the primary Software Technology project that focuses on Trilinos is this new project called Sake. It is in the Software Technology area, so I’m going to talk about the accomplishments of that project. But you have to be careful about that I’m not talking about Trilinos’ accomplishments in all of these other projects.

Within Sake, there are three components. One is focused on Kokkos Kernels and the linear algebra kernels in it that we are developing for the new architectures that are coming. Number two is Trilinos itself, especially helping Trilinos go to the new Aurora and Frontier architectures, and, finally, a long-term research project on linear solver algorithms, especially pipelined Krylov solvers. Okay, let me talk about the first two aspects and the accomplishments there a little bit more.

Kokkos Kernels, for example, we had a big software release this year. Kokkos is an ecosystem we released together with Kokkos Core. Kokkos Kernels had a 3.4 release this year, and the primary thing that the ECP users should care about is now we have support for AMD and OpenMP target through AMD HIP back end and OpenMP target back end. So, you can run on AMD architectures. We have some nice results on the system Spock at Oak Ridge [National Laboratory]. Some of these results are shared within ECP in different meetings, and it is also available in a paper that was published recently. The support for AMD and HIP back ends is the number-one thing that I [would like to] point out from the Kokkos Kernels perspective.

The next step, as we follow this waterfall model where once Kokkos Kernels are ready—i.e., each node-level kernel is ready—we go to the distributed algebra in Trilinos, and we start porting those. So, a good portion of the Trilinos software stack is running on AMD using HIP back end with Kokkos Core and Kokkos Kernels underneath as their foundation. So, that is something that we are really happy about. And these are two [examples of] Software Technology–focused work that we have done in the recent past.

The next thing that I would point out is that [there are] new features that we are adding to Kokkos Kernels and to Trilinos to help ECP applications, actually. There are new features that we have added in order to reduce multigrid setup for applications like ExaWind. And there are new features that we have added in order to help EMPIRE and SPARC, two ATDM applications, to help them succeed in their L1 milestones. So, these are all part of our project goals. In order to help these applications, several new features have to be developed, and these features have been delivered in this 3.4 Kokkos Kernels release. And they have also been integrated into Trilinos. So, it’s been propagated to our applications as well.

These are some accomplishments that we are really happy about in the recent past. I can go back one more year and find more, but let me stop there.

Gibson: I believe those paint a clear picture. Siva, what are the next challenges for Trilinos within ECP?

Rajamanickam: Oh, interesting question. Within ECP, we are actually very confident of reaching our milestones on AMD and the HIP platforms. If you asked me 2 years earlier, I would have been a little worried, but now, working closely with facilities—a lot of folks who help us from facilities—working closely with vendors and with our application teams that are communicating to us their requirements, we are quite confident in achieving our goals that we have set for ourselves in the ECP itself. We are starting to look beyond that, and that is where we see new challenges coming up. Let me point out two things from two different directions.

The first one is, we have done all this work on portability for new architectures. What we are starting to see and starting to think about is what happens if the architecture is hetereogeneous. And how would Trilinos adapt? How would Kokkos adapt? How much would our applications need to change if the compute node is not only like a GPU but there is a different kind of hardware, a data flow hardware, and that has been more than one type of hardware—a GPU and a data flow hardware? And that is something that we have been thinking about, and we think that is an important challenge as a community we need to address from the perspective of Kokkos, Kokkos Kernels, and Trilinos. We are starting to think very carefully about that, and how software has to be done for this is something that will be a challenge for the next few years.

The second aspect that I would point out is open source itself and how to deliver all the good work that is happening in the community to our users with a lot less pain than they have to go through today. So, this is a very easy way to get Trilinos, get solvers, and use them well. From the perspective of the users, and from the perspective of the developers, [the question is] how to develop this with a lot less pain if you want to add a new capability to it.

So, these are the two things that we think about constantly. New architectures coming along—how would algorithms change for it? How would software change for it? That is one aspect. The second is delivering that well in an open-source environment to our users.

Gibson: This interview has evolved to a nice place to talk about long-lasting impact. What do you think will be the enduring legacy of the ECP Trilinos work?

Rajamanickam: Okay, this is an interesting question. Let me paraphrase this to say, this is my personal view on Trilinos legacy. Trilinos is a large project. We have roughly 50 people who contribute to Trilinos very actively on a weekly basis. And then there are, like, another 50 people who contribute somewhat less frequently. So, it’s a large project, and Trilinos has been around a long time before I started getting involved in Trilinos. So, with that caveat, let me say what would be the legacy of Trilinos from my point of view with a personal bias. I’d say two things. One is what is the best way to help our mission goals and science goals achieve their goals? We do a lot of work in Trilinos from three different aspects. Trilinos is focused on helping production applications. Trilinos is also focused on research on new algorithms. And Trilinos is also focused on co-design with hardware vendors with complete system developers and how to co-design a system and the software together. Trilinos has a role to play in all of these.

But eventually, I would say the legacy, in my point of view, is what is the best way to help our science codes and mission codes achieve their goals because we are a foundational library, and their success is basically our success. So, all of these other activities that we do eventually feed into that goal. And that is number one to me, the number-one aspect.

The second aspect—I’m going back to something I like again—the process of developing an open-source, community-developed software, and that would be a legacy of Trilinos, in my mind, because we’ve been doing this for 21 years, and a lot of processes that came out of Trilinos are being adopted by the community, and that is very nice to see. To bring together this group of people who are all interested in improving science and scientific simulations, how to do that without imposing too many restrictions on them as well. And that would be an enduring legacy of Trilinos, in my mind.

Gibson: Siva, what would you like to say to the technical community about engagement and the open-source environment with respect to Trilinos?

Rajamanickam: Wonderful. This is something I’m passionate about. Help us by contacting us. Reach out. For folks who are listening—if you are application folks, you are looking for new capabilities, and you have a difficult problem to solve and the solvers that you have don’t work—reach out to us and go try Trilinos.

The first stop is our GitHub repository. We have a very active group of users and developers with nice engagement in our GitHub issues. We have annual meetings, what we call user groups. If you can participate in the user group meetings, please do. We have user group meetings both here in the United States and somewhat less frequently in Europe. So, if you are on the European side, you can try to attend that. You could use the different repositories developed in ECP, like E4S, xSDK, and these collections of software and get Trilinos from there and use it. You can use it from Spack, of course. Use these, give us feedback, ask us questions, and let us know your challenges and engage with us. The more we talk to each other, [the better] we can solve the needs of our applications. I am looking forward to it. And I say thank you to the ECP Communications team for inviting me here. This was fun.

Gibson: Thank you for being with us. We really appreciate all the great, helpful information.

Rajamanickam: Okay. Thank you.