ECP Communications Manager Mike Bernhardt sat down with Mike Heroux of Sandia National Laboratories and Lois Curfman McInnes of Argonne National Laboratory to discuss the IDEAS (Interoperable Design of Extreme-Scale Application Software)-ECP project.
The role of IDEAS within ECP is to help ease the challenges of software development in this environment, and to help the development teams ensure that US Department of Energy investment in the exascale software ecosystem is as productive and sustainable as possible. The following is an edited transcript of the discussion.
McInnes: Well, our project initially began in 2014 with funding from the Office of Science as a collaboration between researchers in advanced scientific computing and basic energy research in order to address challenges in software productivity and sustainability encountered in next-generation applications in subsurface science. That is, we are motivated to try to help our developers work together more productively and to develop software that enables them to tackle multi-scale, multi-physics challenges, which requires bringing together software developed by independent, different teams. So, we’re working with the science community in order to come up with strategies for more effective collaboration through software.
Heroux: I see two primary motivators for the project: one is the introduction of highly concurrent node architectures, which requires major refactoring of algorithms and software. So, we have ahead of us, a massive redo of our high-performance computing software base. So anything we can do to improve the productivity, improve the sustainability of the base is a positive, just because of the massive undertaking we have right now. And we have a very large software base that the nation and world depend upon for many, many things.
And the second motivator is that there has been a blossoming of improved software practices and software tools. Software engineering is increasingly an engineering discipline where best practices are well-known, and we need to tap into that. We often can’t just take what the broader community is doing in terms of software engineering practices and just blindly adopt them. They often have to be adapted to our kind of environment, to scientists and the kind of exploratory software that we do; we must adapt software engineering practices before we adopt them. We can also contribute back to the broader community, because we work in some ways in a very extreme software environment, and the experience we have can feed back into the broader software community. So, those are, from my perspective, the two main motivators.
Bernhardt: We’re taking it now from a project that greatly benefits HPC in general at the extreme scale, and very specifically targeting it toward exascale.
Bernhardt: How does all of this fit within the ECP?
McInnes: Well, we collaborate across the entire ECP community. Our project interacts with application teams directly by engaging them in understanding their most challenging software bottlenecks and productivity bottlenecks and working with them to come up with strategies to overcome them. We also interact with the Software Technology teams, which Mike can talk about more, where we are again, trying to understand some of their challenges but also learn from their extensive experience and good practices. Many of those teams are experts in developing widely used and sustainable software, and so we’re trying to bring together capabilities and share experiences across that whole community. Of course, we also interact with the hardware and integration area very strongly through our partnerships with facilities and many of our outreach activities, and also our work towards developing sustainable software that’s used at facilities, so working towards processes that the whole community can rely on.
Bernhardt: Is it a function of the Software Technology focus area within ECP?
Heroux: So, this is a change. Doug Kothe, who’s now the director of ECP, was director of Applications before doing that, and he saw the need for this kind of activity and supported us in our transition from an Office of Science project to becoming an ECP project. And we were part of the Applications portfolio at that time, because he wanted his Application team to have access to this kind of resource.
Heroux: But then with the restructuring that occurred about a year ago, this project moved into the Hardware and Integration research focus area where facilities was also moved—or, sorry, Training was also moved—and that was a really nice move overall, because it allowed IDEAS to reach out to both Software Technology, that research focus area, and to the Application Development research focus area, and then also tap into the resources, the training resources, that the facilities have. That resource has been a tremendous benefit for us in getting the word out. Our HPC Best Practices webinar series, is probably our most popular outreach activity. We have literally hundreds of people who dial in to the webinar or watch it afterwards, and we have people all over the country. I hear from [people]—who have no connection with ECP, who say, “Yeah, we pay attention to what’s there, because this is really good content.”
Bernhardt: It sounds like collaboration is the real heartbeat of this program.
McInnes: Absolutely, and I think everyone recognizes that extreme-scale computational science requires collaboration. That’s one of the things I love about it. I enjoy the interactions among people with complementary capabilities and skills, but I believe it’s essential for our broader community to really take seriously that software is the practical means by which we do a lot of this collaboration.
Heroux: Yeah. Exactly.
McInnes: The way we encapsulate the expertise of the people with complementary capabilities—for example, in science areas and math areas and various computer science areas—and the way we bring them together to work toward new predictive science, and the way we sustain that and collaborate over the long term. So, we really need to take seriously the ways in which we do our software to make that effective.
Bernhardt: Really well stated, Lois. Let’s move into some of the accomplishments and the interactions that you have with the community.
McInnes: Well, we have a variety of activities going on. In the IDEAS-ECP project, we have four complementary areas of work. We do interviews with various ECP teams to understand their biggest productivity challenges and bottlenecks, and then we use that information in a variety of ways. First of all, we extract common observations or challenges and cross-cutting issues and we work towards developing content with the broader community to help address those needs.
And, secondly, we pursue partnerships with a variety of teams on ways to overcome their challenges through productivity and sustainability planning. Mike can talk more about that, but before we launch into that in more detail, I’ll just mention then that the other areas of our work are developing content that’s specific. We address the needs of our computational science community. As Mike mentioned, we can’t just use software practices from the mainstream; oftentimes, we need to customize practices to the experiences for our developers in scientific computing. Then we also have a strong outreach component that incorporates not only webinars but also other avenues for outreach, including a site, an online portal for collaborating and sharing our information on better scientific software.
So, Mike might want to talk more about the productivity and sustainability planning.
Heroux: Yes. So, we have—we call it PSIP, Productivity and Sustainability Improvement Planning. Lots of words. That’s why acronyms are useful in this area; but it’s an iterative process and it’s typically our initial way of engaging with a software team.
We start with an interview using simple language. We don’t want to assume that somebody has software engineering lingo in their vocabulary, because often there are fuzzy terms that people tend to use—lifecycle model and all that kind of stuff—so we try to avoid that. We just try to use plain language when we talk with a team.
We ask them, say: “Well, how do you get started, you know, creating a software product or a feature in your product? How do you get the idea? How do you take that idea and mold it and do planning, and then design and implementation and testing?” We kind of walk them through that whole cycle and how to bring it into the existing software base. It’s not meant to be exhaustive, but just a few pages of text—notes that we take in this interview process. And then once we’ve gone through that, we have a sense of what might be some low-hanging fruit for improvement, say, a single practice where we said, “Well, we’d like to change this,” or its obvious it might be a really good thing to change.
Maybe it’s unit testing; they really want to get unit testing in, and unit testing is a lightweight kind of testing where you can run something, a test really quickly, and see if you broke something, your existing product, when you’re writing a new feature. So, maybe they work on that.
Then we take a step back and say, “OK, you want to do that, but are there any impediments in the way of getting to that?” And we work on that and set up a plan. We set up what we call a progress tracking chart, so we baseline that so you can see progress in a calculated way. Then we come through and make that change, and then we come back again and we iterate. We look at their sketch again and say, “What’s the next thing?” Or maybe we update it and then we iterate. This kind of process is an effective way, generally speaking, for incrementally improving what people are doing.
Bernhardt: Can you cite some specific examples?
McInnes: We have worked with a variety of teams on PSIP planning. One exciting direction has been a partnership with the EXAALT project led by Art Voter at Los Alamos where we have worked with them to identify their biggest bottlenecks, which largely included testing and also their build environment. So, through a partnership with our team, we worked to improve both their build processes and their testing using the PSIP methodology and what not, and now we’re in a position where they’re going to be continuing to expand beyond their initial achievements, including focusing on continuous integration testing. This is an important area of new work that’s in partnership with facilities and working groups that span between facilities and software technology. So, that’s one concrete example where we have, I believe, strongly impacted their ability to do more effective science, and the impact is permeating the whole ECP project.
Bernhardt: As I listen to how you’re describing the efforts with IDEAS-ECP, you really are interconnected with every single aspect of the project: the applications, hardware and integration, and software. And you’ve got a multi-lab collaboration going on.
Bernhardt: I assume a number of universities and research organizations are involved.
Bernhardt: What’s the biggest challenge that you face with such extensive collaboration?
McInnes: I will say that our team, as you mentioned, has a variety of labs involved, including Oak Ridge, Livermore, Berkeley, Los Alamos, the University of Oregon, our university partner, as well as Sandia and Argonne. And I believe that our teams are really effective in reaching out to the people at their institutions in order to engage them. But one challenge we all face is not specifically a collaboration challenge, but it’s a challenge of juggling the tension between investing in better software over the long-term versus the pressure to have science achievements quickly and to publish papers quickly. So, we struggle as a whole community in being able to carve off precious resources of time to be able to invest in focusing on changing these fundamental aspects of our work.
Bernhardt: That makes sense.
Heroux: Yeah. Each team has specific science or software feature objectives, goals that they signed up for. We can’t put too much overhead of improving practices on top of delivering those goals. So, they can shave off a portion of time, because it will pay off in the long run. If you integrate over a long enough span of time, these productivity improvements, sustainability improvements, will really pay off, but you still have to get those science and features done.
Bernhardt: We’re right on the eve of SC18, the big supercomputing conference—probably 11,000 attendees this year from what I hear. [Editor’s note: more than 13,000 people ended up attending the conference.] That is kind of overwhelming, but that’s such a small fraction of the entire HPC community when you look at it this way. What message would you like to have folks walk away with when they listen to this audio or watch this video related to IDEAS-ECP?
McInnes: Well, we believe it’s a great opportunity for the community at large to really take seriously their own software and how they collaborate using their software. And our team would be delighted to engage with anyone who’s interested in talking with us. Of course, we also create resources that people can use independently without talking with us one-on-one, but we’d really like to engage the broader community in thinking very seriously about improving software as a mode of collaboration. And we’re working with a worldwide community who’s looking at this. Already, we are interacting with a number of international groups who are working on complementary aspects of issues in software productivity and sustainability. I believe the time is really right for us across the whole international community to collaborate more effectively together.
Heroux: Yeah. A couple of very concrete things is one of the portals that we contribute to, Better Scientific Software, called bssw.io. It’s a website, so you just type in bssw.io and that portal provides a lot of hooks to information that we’ve contributed, other people from the community have contributed. You can sign up for our mail list. I think it’s in the upper-right corner of the original or the initial screen when you go there, and if you sign up for that, you get lots of timely information.
One of the things—the webinars, the Best Practices in HPC webinars that we do—are announced on that list and they occur once a month. And you can see what’s coming and you can sign up and get notified and then either, you know, see it in real time or watch the video. Again, that’s been one of our really more popular opportunities for engaging with the IDEAS project through those, because it’s our people who often—not always, but often—provide the content and give the talk that’s a part of that.
Another one I think is just to give some acknowledgment to people who have helped us: the Software Sustainability Institute in the UK. We have learned a lot from them. They’ve been engaged in this kind of outreach and promotion of better practices for a decade longer than us, I think. So, SSI in the UK is another great organization, and they put out information on a regular basis. So, those are two concrete things.
Bernhardt: Well, hopefully a lot of the folks that we’re connected with through the community will pick up on this call for participation and join in on the fun.
Heroux: Yeah, yeah. Very good.
Bernhardt: All right.
McInnes: Great. Thank you.