Ensuring the Exascale Ecosystem Lands Successfully at Energy Department Facilities

03/23/21

Exascale Computing Project · Episode 78: Ensuring the Exascale Ecosystem Lands Successfully at Energy Department Facilities.

Ryan Adamson, lead of the Software Deployment at Facilities effort in the Exascale Computing Project and group leader for ORNL HPC Security & Information Engineering

Ryan Adamson, Oak Ridge National Laboratory

By Scott Gibson

The US Department of Energy’s (DOE) Exascale Computing Project (ECP) has a Software Deployment team that deploys and integrates and exascale software stack and implements a software integration and testing capability at DOE high-performance computing (HPC) facilities. This team supports continuous integration with site environments, including container technologies and software development kits. Containers package up computer code so it can be moved from one computing environment to another and used reliably.

A little over a year ago, ECP’s Let’s Talk Exascale podcast had the pleasure of speaking with the Software Deployment team lead, Ryan Adamson of Oak Ridge National Laboratory. In this episode, he joins us again for an update on the project.

Our topics: ensuring the exascale ecosystem lands successfully at the facilities; the benefits of continuous integration to scientists, users, and systems management teams; providing a base layer of continuous integration smoke testing (or confidence testing); how continuous integration helps avoid the problem of stale software; and more.

[The term “facilities” in the discussion refers to the DOE HPC facilities.]

Interview Transcript

Gibson: Welcome back, Ryan. All right, well, I know ECP’s Software Deployment at the Facilities project automates the building and testing of software for the ECP software ecosystem at various Department of Energy facilities. Let’s start with a broad brush stroke of the activities involved in your work. When you and I last spoke on the podcast just over a year ago, you noted how your team collaborates with the ECP Software Technology research focus area; namely, using Spack and delivering containers for software applications. At that time, you said the effort was about to turn a corner in gaining project momentum. Will you bring us up to date on software deployment?

Adamson: You used the word ecosystem in that question. I think it’s the right place to start because the Exascale Computing Project is really about developing a software ecosystem for these next-generation HPC resources.

And so the software deployment area that I’m leading ultimately is going to take that ecosystem and make sure it lands successfully at the facilities. To do that, we want to make sure that the software is tested and that it works and it’s efficient, and that’s kind of our main mission.

The corner that we turned, in fact, is getting a common interface to take everything in that ecosystem, package it up using Spack specifically so that the facilities know what they need to test and when they need to test it, rather than having lots of individual projects each with their own build and test strategy. We’re asking ECP to package everything up into this common framework so that it’s easy for us at the facilities to test it and then get that information back to the developers.

Gibson: Will you explain how continuous integration is a win for scientists, users, and systems management teams?

Adamson: Absolutely. And, again, this goes back to that ecosystem. So we have scientific software users. We have developers and then we have the folks at the facilities, the systems administrators, and software installers, who are really providing this whole stack. So the win for the facility software folks and systems administrators is that, I guess, in the past, facilities have installed a brand-new software stack maybe once every six months or every year because it takes so much time and effort to build all of the tools that you need as part of a single working stack.

If that stack has been tested and issues have been fixed all in that particular hardware before the facility starts installing it, many of the problems that they would have run into have already been fixed. So it really speeds up time to deployment. And we’re targeting quarterly deployment now for the E4S software stack at each of the facilities. So that’s been a huge speedup for us at the facilities. That also means that software developers get feedback immediately about issues that they may have introduced into the code. Software developers love having code that’s stable and that’ll run successfully, and that is good-quality code. But they don’t always have access to all of the platforms that their code will run on. So providing that to them is really important.

And then finally, scientific software users, the scientists themselves, benefit from having the latest and greatest features: new ways of speeding up their I/O, new debugging tools, new visualization tools. They don’t have to wait up to a year to try to get the latest and greatest. They can get the newest things or build it themselves because the recipes are easy to build.

Gibson: Is the continuous integration effort in ECP something new to the HPC community? And why is this such an important function to the HPC community?

Adamson: That’s a great question. I would say yes and no to the ‘is it new?’ It’s been a best practice on the software development landscape for decades. You want to test your code before it gets rolled out into production. But if you look at just software stacks that are out there—we can take a red hat, for instance. They provide a version of Linux and a bunch of extra libraries and tools that they package up and deliver as kind of a one-stop-shop, easy-to-use software collection. They can go and sort of, with broad strokes, define how they want to test because they kind of own all of that functionality, and in an HPC software stack, multiple project teams own their own CI infrastructure. So your compilers may have different testing strategies than your debuggers, which may have different testing strategies than your I/O libraries. And they’ve all developed these over time to meet their own project team needs.

So what we’re doing with this ecosystem is we are providing a common kind of first-step base layer of CI smoke testing that we can run at the facilities and in the cloud to apply to really the ECP’s E4S software portfolio. And that, to my knowledge, has never been done before for this community.

Gibson: Ryan, tell us how the continuous aspect of continuous integration helps avoid stale software.

Adamson: Absolutely. What we’ve seen in the past is software that sticks around for a long time at facilities becomes stale. If a new stack hasn’t been installed in a year, that code is the only code that can be used by new users that are coming into the facility. If we are testing and deploying more frequently, scientists will use the fresher software in some sense because it’s easier. And if scientists can hook into our continuous integration framework and know that newer versions of the libraries that we’ve installed don’t impact their scientific results or performance, they’re also more likely to want to move forward and use new libraries and the features that they provide.

Gibson: The automated ECP framework for continuous integration of the Extreme-scale Scientific Software Stack (E4S) of multiple DOE sites is new. What are some of the many pieces of this process?

Adamson: That’s a great question. So E4S is really the culmination of this broader ecosystem that we’ve been talking about. Gosh, I don’t know where to start. It’s such a big effort. There is this ability to define what it means to be a member of E4S. So if you are a software team, and you want to become a part of this ecosystem, there are community practices and standards. The team has a build recipe that they put into Spack. They have smoke test recipes that they put into Spack. They document their code in a certain way. They make sure they adhere to certain community guidelines and standards. And so that’s how you gain entrance into this ecosystem as a development team. Then once you’re there, the facilities—there are other test harnesses that run in Amazon, etc., for their uses of E4S. Sort of take all of that material and those build recipes and test recipes and we do our thing. We test them; we build them; and then we get the results back to the developer through a dashboard so that they can understand what’s broken and they can fix it.

Gibson: Please describe for us the build, test, and verify runs involved in continuous integration.

Adamson: So this is a great question as well. These are great questions! So Spack is where we’re standardizing this work. If you want your stuff to be smoke tested at facilities, you make sure you put your build, run, and test recipes into Spack. There are 500-plus commits to Spack per month, OK? And each of these commits to Spack could potentially trigger one of our build or test pipelines. Usually it would if we’re going to do a test on merge, right? And each of these pipeline runs can spawn up to 100 GitLab jobs because there are different pieces of the software iceberg, I guess, that you need to build. And so that adds up to a large number of software products that need to be built on these facilities. So the scale is huge.

On the plus side, HPC teams are in the business of scaling. So we understand what the limits are of our systems and how to turn the dial to get the most out of the tests that are written. So what we find is there’s a tradeoff between performance testing and smoke testing. And we really want to exercise the set of cardinal tests that give us the most value per test so that we don’t overload these HPC resources, which at the end of the day are available for open science; they’re scientific instruments. They’re not dead build farms. We want to be sensitive to the scientists that are on these systems, but we want to deliver them value through sane tests.

Gibson: ECP’s continuous integration has a cohesive support model to correct software issues. This enables developers to receive immediate feedback so they can prevent errors from propagating throughout the source tree and becoming entrenched in stable releases. What’s the strategy involved in this cohesive support model?

Adamson: That’s right. The support model is a loop; it’s a cycle. So you have features that developers develop. Those get generally tested by them at their build farms or locally on their systems. And then when they’re ready to be rolled into Spack, we pick those up at the facilities and run their smoke tests. And when we see issues, we want to be able to tighten the loop and automatically throw those build errors back onto—it’s a merge request that the developers are doing into Spack. So the tighter that that loop can be, the more quickly developers can fix the issues. It’s obvious what change to their code caused this new build error or functionality error to pop up. And so tightening the loop is the most important piece of that cycle.

Gibson: What are the benefits to developers and systems teams of having tested and verified software to start with rather than an ad hoc situation of working around stale software?

Adamson: Absolutely. Complexity is one of the worst things that you can have when you’re designing a system, right? Sometimes it’s necessary, but complexity absolutely kills. And so by identifying common interfaces and identifying clean procedures for taking new features, putting those into Spack, taking the new Spack tests and running them at facilities, and then taking that and giving it back to developers. By identifying a clean, consistent way of doing that, we’re reducing complexity of that cycle. Ultimately, that means that the code that has been delivered has been tested. The libraries have been tested with themselves; they’ve been tested with the hardware that they’re going to be installed on; and there are just fewer gotchas because we’re reducing complexity.

Gibson: Ryan, why do you think continuous integration is the path for the future of high-performance computing: academic, enterprise, and commercial/cloud users?

Adamson: Good question. CI is a way to automate a lot of the manual work we’ve had to do in the past. It just isn’t going to go away because it’s an industry best practice. And when new engineers and scientists are graduating and they’re coming out to the field of computer science, they already have experience with version control and continuous integration. So it’s embedded in developer culture at this point. It’s not going to go away.

One of our tasks moving forward is to make sure that we can sustain this ecosystem and sustain the testing capabilities that we have at the individual laboratories. We’ve built something that’s awesome. But it really took a large project like ECP for us individually to rally around to get a common testing framework in place. And so we’re in discussions—What do we do? How do we support this testing ecosystem after ECP is over?

But it’s pretty clear that it’s got widespread adoption. Spack existed before ECP; it will exist after ECP is over. They do builds and tests in the cloud; that’s going to exist after ECP is over. So it’s the future. It reduces complexity. And I just can’t see us going back to the way things were.

Gibson: Ryan, to close, how about giving us a snapshot of the current activities in ECP’s Software Deployment at Facilities efforts.

Adamson: Yeah. Over the last year, we’ve focused on a couple of different use cases to, again, help reduce the complexity of our mission. The first is providing a stable continuous integration platform using GitLab at each of the facilities where these ECP systems are going to be available to HPC users. So that’s done and is available; people can take advantage of that.

The second focus that we’ve had is in getting the deployment of the E4S software stack more automated and the testing and the builds running on our commodity hardware, our last generation of systems. This is Cori, Summit, and Theta. What we’re pivoting towards this next quarter, these next months, is allowing facilities to add their own flavor to tests. So facilities make changes as well to the software stack that the vendors provide. They might roll versions of kernels. They might update versions of the operating system, which might cause the software stack to fail, and so we need to integrate that into this testing ecosystem. And that will be what we look forward to in the next three to six months.

Gibson: Ryan Adamson, thank you. Very insightful stuff.

Adamson: Absolutely! Thank you for having me!

Interview Transcript

Related Links