A Conversation with Mike Heroux of Sandia National Laboratories (SNL) about ECP Software Development Kits
A senior scientist at SNL, Mike Heroux is director of ECP’s Software Technology focus area. The following is from an interview with him at the ECP 2nd Annual Meeting in Knoxville, Tennessee, in February 2018. This is an edited transcript.
The term software development kit, or SDK, brings to mind such things as software development tools and math libraries. Is this what we’re talking about with respect to ECP’s SDK effort?
Yes, that’s true. However, it’s also a very important element for us because it is an organizational approach to try to reduce the complexity of the project management of ECP Software Technology, ECP ST for short. Introducing an aggregation of similar products allows us to manage coordination and development across that set of projects to manage the complete build of the ST software stack by having intermediate collection points for those builds and for the management and policy. So you can think of it as introducing a level in the hierarchy as we do coordinated development and delivery of our capabilities, because ECP ST delivers software; it delivers software via source, via container technology coming in the future or via open-source tool kits like OpenHPC, and to computing systems vendors. ECP ST also delivers software to the DOE leadership computing facilities themselves in coordination with the hardware and integration component of ECP, which helps in the deployment of the software we deliver. Think of it like the last half mile of providing a software capability that is on a given platform that can then be used by the application team or by other development teams that need that product.
Why is the SDK effort critical to a capable exascale ecosystem?
There are both effectiveness and efficiency improvements that we can see by organizing our efforts in SDKs. The effectiveness is that we get people who are providing similar or complementary capabilities to work together and to provide common versions. For example, in the math libraries area where we have had this effort going on several years, all of us use a package called SuperLU, but we have never coordinated which versions of SuperLU we use. So now we want to have people use all of our math software in a multiphysics or multiscale simulation instance where many of these libraries are needed. We all need to point to the same version of SuperLU to have a consistent build and execution of the suite of tools. So that’s the effectiveness.
The efficiency is that we also learn a set of practices from each other. We understand more deeply how each of us engages in software development, and we learn from each other best practices. This enables us to build a community of people. We also have a collection of what are called community policies that enable us to provide a kind of contract to the user. This is a consistent set of policies in support of how we provide interfaces to our software, and how we work with each other to ensure the compatibility and composability of our independently developed software capabilities.
How do you test your products when the systems they will run on have not been built yet?
Well, so we can’t build or test on something that’s not available. But we do have available to us systems that are precursors of what exascale platforms might be. We know that one common theme on any exascale system that’s built will be so-called massive concurrency. Our clock speeds on processors have stalled at fewer than 10 billion clocks per second. And if you want to get to an exaflop, which is 1018 operations per second, that extra 109—if you’re good at exponential math, which maybe you aren’t—trust me, that means you have to have billion-way concurrency to reach that level of performance. So we know that no matter what the architecture looks like underneath, that we need to, first of all, in our algorithmic and problem formulations, and then in the writing of our code, expose that large degree of concurrency no matter what the platform is. Mapping that to a specific system is a lot of effort. There’s no doubt about that, but we can do a lot to prepare for exascale even though we don’t necessarily know the exact architectures of those systems.
Who on your team is leading this effort, and do you have many collaborations?
It’s certainly in the math software area where we prototyped this effort, and we had a few years’ lead time. Essentially, all of the math library teams are either fully engaged or in the process of becoming fully engaged in the math SDK, which we call the extreme-scale scientific software development kit, or xSDK—the xSDK was the original SDK.
All of the PIs [principal investigators] and the teams that are in the ECP math libraries area are engaged in some way in the xSDK, and the PIs are providing a leadership role. And we expect that this will occur in other areas as well. In fact, as we socialize the idea of an SDK in the development tools, programming models, and visualization areas, we find that there are already nascent efforts—or even sometimes quite mature efforts—to provide these SDKs. So what we’re describing here for the purposes of ECP are already existing. These ideas make sense because there’s the fundamental value proposition when people work together to more efficiently deliver products of high value at lower cost.
In terms of people who are engaged in this effort, within the math libraries area, Lois McInnes and I started the effort for math libraries with collaboration from people like Ulrike Meier Yang at Lawrence Livermore National Laboratory, Sherri Li at Lawrence Berkeley National Laboratory, and I’m sure I’m missing names. I apologize to those who hear this and should have been listed. But then more broadly, we’re looking to the so-called Level 3 leads of the five technical areas in ECP ST. That would be Rajeev Thakur, Programming Models and Runtimes; Jeffrey Vetter, Development Tools; and then Lois, whom I already mentioned, is the lead of math libraries; Jim Ahrens in Data and Visualization; and Rob Neely in Software Ecosystem and Delivery. So those five people will be coordinating the conversation and the starting plans for SDKs across ECP ST.
How do you go about measuring their progress?
We have started an effort to create what are called impact goals. Impact goals are statements of the effectiveness of the product that you’re trying to deliver—whatever the product is—because we have a wide variety of products. So we have come up with a framework that allows each project to specifically express its value proposition but do so in a way that’s consistent across this broad product suite. Impact goals are a way of allowing PIs to express their value proposition. I’ll take the xSDK as an example. This is a collection of math software libraries that are commonly used. The four original were hypre, PETSc, SuperLU, and Trilinos. The xSDK has made them compatible with each other and allowed them to be built simultaneously with a single command—all of the libraries—where it pulls source code from the repository that lives out in the open Internet and then builds all of them in sequence with consistent versions and a cohesive single unit of built libraries.
So in the case of the xSDK, the impact goal, the key one is a penetration goal. People already access mathematical libraries via other means; the four I mentioned in particular and others as well that we’ve added. But what we want is for them to access libraries through the xSDK because that sets them up to do their multiscale, multiphysics calculations that they need to do. And so a penetration metric is the number of ECP applications that access these mathematical libraries primarily through the xSDK. That’s a tangible measurement of progress, and that’s the metric. We have the impact goal and the impact metric and then we also are asking the PIs to provide what is called the threshold and objective values, to use management-speak. That’s kind of the minimum acceptable level and then we want their maximum expected value, the really good value. Then we also ask them to put in the current value, and so that allows us to track in a very tangible way progress toward accomplishing their impact goal in a well-defined metric that everyone understands.
This is a relatively new effort, but have you been able to overcome any anticipated barriers or are there any significant milestones you can talk about?
With the xSDK and the math libraries, which has been going on some time now, we’ve had quite a bit of success adding libraries. The xSDK has its history in ASCR [the Advanced Scientific Computing Research program of the US Department of Energy’s Office of Science] research funding. We had some early program managers who had strong foresight and could see the value of this kind of effort, and they funded us to do this. We’ve been able to convert this effort to translate and focus on the ECP requirements. But in the first year and a half or so of ECP, we’ve been able to incorporate a much broader collection. We more than doubled the number of libraries that are part of the xSDK, primarily because all of the hard work of defining the policies and setting up the structure that’s required to become a member package of the xSDK have been worked out. And so adding new members usually requires an effort on their part to become compliant with the policies that we have, and these are very reasonable policies that make sense technically and commitment-wise—that you answer emails, that you have a test suite that anybody can run to guarantee things were installed correctly, and that they’re behaving properly. They’ve had to step up and make sure that they can check all those items off. We’ve also had interest from the international community and other non-DOE-funded projects to become part of this.
What can the community look forward to, say, over the course of the next year with the xSDK effort?
We’re asking all the PIs and other senior people to play an active role in identifying the collection of products that make sense to be part of an SDK. The attributes of products that would belong together in an SDK are things like: are the products interoperable? or where one product uses another, and then things like: are the products interchangeable? If two products provide a similar capability, it should be easy for users to switch between the two technical approaches to solve the same problem. You might think of that as redundancy, but it’s not really. It means that there are strengths in each kind of package and that even though they both satisfy the basic functional requirement, one might do a better job in a particular situation than the other, and then in a different situation, it would be reversed. And so this kind of interoperability is really important. It provides us robustness in our ecosystem. We have more than one tool to do a similar kind of job.
It sounds like the software development kits transcend the exascale effort. Will these have a longer-term positive impact on the broader HPC community?
I hope so, and that’s certainly our intent. The SDK model is not something that’s exclusive to ECP-funded projects. In fact, it’s a technology and policy-based membership model so that, in principle, any package that fits within the domain of a given SDK, say, for example, a third-party math library, could be part of the SDK. We have interest from people at the German aerospace research center, DLR, in contributing their capabilities to the xSDK, the math libraries one. And so there’s no organizational impediment to other people being engaged.
We foresee that if this effort goes as well as we think it will, that this will have a lasting cultural and organizational impact on scientific software forever. And maybe that sounds grandiose and bold, but we believe it. I’m not a historian, but I like history. One of the great quotes about the US Civil War was by historian Shelby Foote, who said that one way to describe its meaning was that before the war, we said “the United States are,” and after the war, we said “the United States is,” meaning an integrating effect resulted. My hope is that as we create these SDKs and bring these independently developed products together under a collaborative umbrella, that instead of saying that each of these individual products is available independently, we can start to say that an SDK is available.
Is there anything else you’d like to say to the community? Is there any aspect of the SDK relative to this project that might be misunderstood?
The acronym SDK is used to describe a variety of development and product efforts, and so many people may come in with some specific notion of the SDK. Not that it’s wrong, but it might not be identical to what we’re describing here. So the product element of the ECP ST SDK—if that’s not enough acronyms in one sentence—is certainly consistent with what people might think of as an SDK in a commercial setting where the perspective is that of the user or the client. But for us there’s also the community organizational aspect of it in that we’re using this to principally define and execute our organization of independently developed products. And I think that element of it may not necessarily be appreciated as a critical aspect. Maybe it will be, but not for everybody. If you’re accustomed to being a user of an SDK, you think about it only as a suite of products that are available, but you may not necessarily think about how it actually enhances the coordination and development of these independently developed products.