Influencing the Evolution of the MPI Standard for Optimal Exascale Scientific Applications

A Conversation with Pavan Balaji and Ken Raffenetti of Argonne National Laboratory

Pavan Balaji is principal investigator and Ken Raffenetti a co-principal investigator for ECP’s Exascale MPI (MPICH) project. Both are with Argonne National Laboratory. In this episode of the podcast, they describe efforts to help MPI, the de facto programming model for parallel computing, run as efficiently as possible on exascale systems. This is an edited transcript.

What challenges does exascale pose to the use of MPI, and how is ECP’s Exascale MPI project addressing them?

Balaji: Exascale is a completely different ball game compared with what we have today. There will be a lot of challenges going to exascale. We broadly think of the challenges with respect to what new application domains are being studied for that scale. Machine learning has emerged recently as an important domain. Additionally, traditional scientific computing applications have become more irregular than before. That’s part one. And part two is new hardware coming up—new network hardware, new types of processors, heterogeneous processors, and so on. MPI sits between these two—the application and the hardware—and it has to match the requirements and constraints of both and give the best performance it can.

Those are the problems of exascale systems. With respect to how we are handling them, we have various broad areas that we want to tackle. We need to look at a lot of key technical challenges, like performance and scalability, when we go up to this scale of machines. Performance is one of the biggest things that people look at. Aspects with respect to heterogeneity become important. GPUs are a part of today’s heterogeneous systems, but as we go forward, heterogeneity is going to expand to include different types of hardware. Fault tolerance is an important aspect that needs to be addressed because at that scale, faults become more common. Awareness with respect to how we handle all this hardware topology and how MPI works with other software technologies, such as other programming models, is an important aspect, too. These are some of the key technical challenges we are looking at.

Of course there are a lot of other aspects, too, such as working with vendors to make sure that their needs and the hardware they are developing are addressed and taken advantage of in MPI implementations. Working with ECP applications teams is important as well for us because they are our final customers. And, of course, working with the community in developing the next MPI standard is critical so that it’s not just one-off implementation but users have some confidence that in the future—10 years, 20 years from now—this work is going to survive, something of this model is going to be there and they can rely on that. So we will do all these  to address the challenges of exascale while working with the vendors to ensure that their hardware is being utilized well and giving the applications developers and the users confidence that what we are doing is sustainable for the foreseeable future.

Pavan, how would you summarize the backgrounds and experiences that your team members bring to bear on the project’s efforts?

Balaji: Ken, do you want to answer that?

Ken Raffenetti of Argonne National Laboratory

Ken Raffenetti

Raffenetti: Yeah, I can take that. Our team has a pretty diverse background, I would say, in general. And being that MPI kind of sits between applications and the hardware, that background is probably a good thing. We have people that have architecture backgrounds and know the very low-level details of hardware and implementation. Conversely, we have people with more of an applications background, people who have worked more at the high-level scientific parts of computing in the past, so they know how those applications would use MPI. You know, we have a lot of MPI history in our team, so there are people that have gone back to the beginning, the early days of MPI 1, MPI 2, all the way through current MPI. One of the things that I bring to the table is system administration and a user support background, so I have a bit of perspective when it comes to actually giving users new software and new features and how these can be improved when delivering new versions of MPI. And we also have some external perspective, not just MPI and HPC, but people from the cloud computing sector as well. So we have pretty broad, diverse backgrounds—kind of all parts of the HPC [high-performance computing stack—that I think helps give us a good perspective on how to best support both the hardware and the software and the actual users of MPI in the long run.

For the project to be successful, the team needs to interact closely with ECP application teams and other software system teams in a co-design type arrangement. Will you describe for us the dynamics of the collaboration?

Raffenetti: A lot of our collaboration started as part of this series of software technology and application interaction webinars that were put on by ECP. And through that, we got an idea of what the interests of applications were in terms of MPI functionality and features. From there, we kind of cherry-picked a few applications that we wanted to follow up on, primarily because we recognized their needs—they recognized what their needs were, so we had a clearer understanding and there was an overlap in interests of what things we were already looking at. Or I should also say there was overlap in interest between the applications we cherry-picked. So some of the things we wanted to examine would have more impact. We would do it once, and it would be reused by multiple applications.

From there, we followed up again with these applications we had selected on some deep-dive telecons. We got some of the key people from each project and had one-hour discussions talking about what—or trying to get more understanding of what—their needs were. We talked about communication patterns and other common application use cases. From there, we’ve followed up with a couple of applications on exchanging code, getting together, hacking on bleeding-edge versions of MPI, using them with their codes, and seeing if what we’re working on is giving them any benefit.

Balaji: Again, to add to that, perhaps, ours being an MPI project, and given that one of the things that almost all applications use is MPI, there was a lot of interest from many applications to collaborate in some sort with MPI, given the requirements, needs, whatnot. Unfortunately, given the size of our team, we have to cherry-pick. We couldn’t just pick all the applications, mainly because of the developer bandwidth limitations. We can’t look at all ECP applications. So, we pick a few of them. We don’t work with all applications that expressed interest—not because they’re not interesting, but because there’s a limit on how many we can work with.

Who is your team working with?

Raffenetti: We’ve been focusing on mostly the applications teams—the CEED [Center for Exascale Discretizations] project, specifically the Nek5000 folks. We’ve interacted with them quite a bit on some of the low-overhead, lightweight communication aspects of MPI that we’ve been trying to implement for our newest versions. Also, the NWChemEx team: we’ve been working with them on some of the more, I guess you would call it, irregular applications. So some of the improvements we’ve been making are to irregular application communication patterns in MPI, particularly with respect to remote memory access in MPICH. Those are the most fully formed collaborations at this point. Of the others that we’ve also engaged for now— LatticeQCD, ACME, HACC, AMReX—I think all of us agree that there’s definitely potential for improvement and collaboration there, but we’re still at the initial stages with those teams.

Have you taken advantage of any of ECP’s allocation of computer time?

Balaji: We have not, but that’s mainly because we have other allocations for the same machines we have been using. Again, being an MPI project, for most supercomputing centers we get discretionary time, director discretionary time, to work on our stuff. Optimizing MPI is important to all supercomputers. They tend to give us time anyway, so we didn’t have to go through the ECP allocation, although in the future we might do that.

What milestone accomplishments would you like to highlight for us?

Balaji: We have completed about six milestones so far. I think two of them are particularly interesting, and I would like to talk about them. I think the lightweight communication is especially notable. I’ll let Ken talk about that one.

Raffenetti: With respect to lightweight communication, this was work that came about as we were studying how MPI would perform on newer architectures, some of them with lighter-weight cores than maybe a traditional heavyweight server processor today, and particularly with respect to strong scaling—applications that are scaling out their problem to the maximum number of CPUs on a system, where the communication overhead becomes the limiting factor in how fast you can go. So what we set out to develop was a very low overhead MPI, essentially reducing everywhere we could the software overhead in the MPI stack. We wanted to make it almost as if the user was directly programming to the network hardware if at all possible. In the lightweight communication milestone, we got it so that MPI was only adding some 40 or so instructions between the application and the actual hardware API. This work was very successful.

We worked, again, with the CEED team because strong scaling is very important to them, and they saw a great benefit in the lightweight work. We actually published a paper together at SC17 detailing the successes and how we achieved what we did. And not just the CEED project, but this work is broadly applicable to any other application, because again, this is literally making MPI faster. It doesn’t change any of the semantics, any of the characteristics of the library; it’s just reducing the overhead and making it go faster.

Pavan Balaji of Argonne National Laboratory

Pavan Balaji

Balaji: Apart from the application folks—CEED is a big example of a team that takes advantage of our work—a lot of our vendor partners are really excited about that as well. In particular, Intel, Mellanox, as well as Cray, are working very closely with us because they want to have the solution picked up for their production implementations right now. They don’t even want us to make a full release of that. They just want to pick it up right now. It got a lot of interest from the vendors as well. Again, the same reason: it’s just no change of semantics, nothing; everything is just much more lightweight. And we see a lot of benefit from that, so both applications and vendors are very happy with that particular milestone.

The second accomplishment that I want to highlight is with respect to hybrid programming, particularly with trends that we have been working on. We actually have several milestones in that area. I think we have a total of five milestones, four primary milestones, and one of them is the evaluation sort of milestone, an application connection milestone. I’ll talk about the initial milestones for that. The idea of this is how MPI interacts with OpenMP or other threading models that work on the node. And this we have been optimizing a lot; even in the first milestone we have gotten significant improvements. This has gotten a lot of interest from supercomputing centers. For example, our ALCF [Argonne Leadership Computing Facility] has expressed a lot of interest in that, mainly because we did a study together with them on what our applications are using right now. And especially on our Blue Gene/Q machine, we found a larger than expected fraction of jobs that are using hybrid programming in MPI and OpenMP together where all of the threads do MPI calls simultaneously. And it’s a very interesting model, a very complex model to optimize, but even if we get five percent in improvement on that it’s a big deal for many of our applications. In some cases, we are able to optimize the performance by twice, so that’s a big deal for many applications. Now, this is just the first of the four milestones. There are other three milestones where we expect the performance to increase at least an order of magnitude more compared with where we are today, so we are very excited about that interaction path of how MPI works with OpenMP in the future.

Will you explain for us what MPICH is?

MPICH is an implementation of the MPI standard. And when we say MPI standard, we’re just referring to a PDF document; that’s all it is. It just says that there will be functions, such as MPI send or MPI receive, and when you call MPI send, this is what will happen. Now you can’t run applications with a PDF document – you need to have actual software that implements the MPI functions. MPICH is one such implementation that implements these functions and internally does lots of magic to make it very fast. And one of the cool things about MPICH is that it is has a lot of derivative implementations, so Intel MPI, MVAPICH, Cray MPI, Microsoft MPI, all these are essentially MPICH, 99 percent MPICH. They take MPICH and tune it, optimize it for their specific hardware or their specific network or their specific user requirements, but anything we add in MPICH essentially becomes a part of all these derivative implementations.

Your project aims to enable MPICH and its derivatives to run effectively at exascale. What impact will achieving your objective have on science, industry, and national security?

Balaji: Well, MPI is already used by many applications in these domains. So in some sense, whatever we are doing is essentially allowing applications, just the way they stand, to run effectively on the next generation of supercomputers. So that’s one of the big goals. If they don’t have MPI, they just cannot take advantage of the next-generation supercomputers. There are other aspects that we are helping enable with respect to newer problems, newer science problems that these applications are trying to solve, and they’re just not possible today. NWChem, for example: they have been trying to look at large-scale simulations. I think the largest we have done so far is a water-32 molecule quantum simulation. They are trying to do a lot of the double buckyball simulations, the two-carbon 60 molecules, a simulation of that. And they just can’t do it today with today’s machines. They want a larger machine, larger scale, better performance, more memory and so on, and they need a lot of features from MPI to be able to scale effectively for that sort of machines, allow them to take advantage of the full machine, and improve how well they can do their irregular communication that the application itself needs. So they have pressure from both architecture and the demands of how fast they can send data across to be able to simulate something like this in a reasonable amount of time. I mean, they don’t want to take two months to simulate that. That’s one example.

Other examples are bioinformatics projects that we have, like the ECP CANDLE [CANcer Distributed Learning Environment] project. They have been looking at very communication-intensive simulations. One of the big things they’re looking at is petascale assembly, assembling about a petabyte of data, again, in a reasonable amount of time. Just to be clear, the state of the art today is about 4 terabytes of data analysis that’s been done for bioinformatics. They want to get about a petabyte of data, and that’s a very difficult problem to solve, mainly because of how much communication, how much data transfer happens, not the number of bytes that are transferred but the number of messages that are sent across the wire. And without a fast MPI that can do this sort of communication really fast, you just can’t solve the problem. That science domain just remains unachievable without faster MPI. These are examples of some of the, I should say, short-term goals, I guess in the next five years, ten years maybe. These are the goals that these applications are looking at, but there are obstacles for them, and many other science objectives that would remain unsolved if we didn’t have a faster MPI.

What’s next for Exascale MPI?

Balaji: I think MPI itself as well as the Exascale MPI project still has a long way to go. There are many things that we just didn’t get the time to address in this three-year project, and we are hoping to address in the next revision of this project. For example, we have many new application domains. We got some hint of that in this round of ECP projects, but there are aspects related to nontraditional scientific computing, like machine learning, for example. The CANDLE project is just one of them, but more and more projects are looking at machine learning. How do we address the needs of these new nontraditional application domains in MPI? That’s one of the big things that always keeps coming up, and that’s something we need to look into seriously in the next step of this project.

New architectures are another big thing. Some of the architectures we are working on are prototypes of what we think might come, but there are some quite disruptive architectures that people are already thinking of, like, for example, memory systems with semi-reliability, that don’t have the full level of reliability. You might have some memory that has a full 3D error coding protection, some memory that has only simple memory protection, so things of that sort people are looking at and how MPI will be used in the context of faults is a big thing to look at as well.

And last, but not least, we can do the research on these issues, but if a supercomputing center has to use the software, it must be hardened and should be validated and verified. So we have already started kind of looking, with our supercomputing center partners, to have some sort of better validation/verification for our software. But we could imagine having a larger test suite, for example, for doing better validation, better coverage of corner cases that applications tend to get to; and so it’s a hardening of the big part of the software that we need to get to as well.

To summarize—new application domains, new architectures, and hardened software.