Pushing MPI and Open MPI to Exascale

02/15/18

A conversation with Oak Ridge National Laboratory’s David Bernholdt on the Open MPI for Exascale project

David Bernholdt, group leader for computer science research at the Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory, spoke with Exascale Computing Project (ECP) Communications at SC17 in Denver. Bernholdt is the principal investigator for the Open MPI for Exascale project, which is focusing on the communication infrastructure of MPI, or message-passing interface, an extremely widely used standard for interprocessor communications for parallel computing. This is an edited transcript of our conversation.

What is Open MPI for Exascale all about?

The MPI standard has been around for about 25 years, and it’s continually evolved to meet the needs of the research community. But for exascale, we need to get more out of it and more out of the multiple implementations of MPI. This project is working specifically with what is called Open MPI, which was developed by a consortium of people from the labs, academia, and a number of vendors that use it as the basis for their commercial products.

How is the project doing with respect to achieving milestones?

We’re performing many fairly long-term tasks. Several new proposals are before the MPI Standards Committee, so we’re working on implementations that we will prototype. Then we’ll collaborate with ECP applications to demonstrate the value of the implementations. That’s a fairly extensive process.

We are also involved in a number of other aspects of the MPI standard and implementations in Open MPI that relate to the ability to monitor performance and MPI library operation so we can gain understanding and make adjustments.

In the near term, we’ve done a fairly extensive survey of the ECP community to gain a more detailed understanding of how they’re using MPI, and we presented a paper at a workshop at SC17 reporting the results of that survey. The research community is very interested in the survey results. The reason is that this is one of the few opportunities for all the people involved in developing MPI implementations and the MPI standard to get to see the big picture of many projects really invested in very high-end computing and how they expect to use something like MPI.

We’re now mining the results of the survey to help identify which of the ECP applications and software technology projects are most likely to provide good demonstrations for the various technologies we’re developing. In this coming year, we’ll be delivering a lot of things related to resilience, a big topic in high-performance computing and, particularly, in the MPI world.

For some years now, there have been discussions and proposed standards related to resilience and how to actually implement that in a way that’s effective for the applications but manageable for the implementations to provide. And we have prototypes that we’re refining and making more robust that we’ll be delivering in this coming year.

How are collaboration and integration important to the Open MPI for Exascale project?

Pretty much every milestone that we have involves some kind of collaboration with applications or software technology projects to demonstrate the value of what we’re doing.

In reference to my previous comments about achievements, I’ll add that another big aspect of our work is improving performance and scalability within the Open MPI library. So when we make those improvements, we have a tremendous opportunity to collaborate with various application projects.

We can consider whether an Application Development or Software Technology project is sensitive to a particular area of the library and then work with them to show how the improvements we’ve made in that area impact the performance of their software.

It’s possible that even though applications may make millions or billions of short calls to the MPI library during the course of an execution, performance improvements can have a significant overall impact on the application runtime.

Has your research taken advantage of the ECP’s allocation of computer time?

A little bit. We’re trying to improve the level of testing for the Open MPI library, and, not surprisingly, with a particular interest in the systems that are relevant to the ECP. We’ve started to deploy these continuous testing infrastructures, of which Open MPI has one of its own. We’ve begun to employ that infrastructure at several of the computing facilities, but they’ve not been tested.

The tests typically take a huge amount of time, but they’re very valuable in helping ensure everything is working as expected and that we’re not breaking things as we’re doing the development. Periodically, we can do such things as at-scale performance tests, which would require our using more of the system allocations.

What is next on the horizon for the Open MPI project?

As I’ve noted, some of our resilience-related work will be coming out, followed by some of the larger efforts over the next 2 years involving really significant proposals that are before the MPI Forum. Since the MPI Forum sets the standards for MPI, we’ll be striving to demonstrate the viability of what we produce.

The MP community debates a lot of issues and is somewhat contentious, and so our being able to spend time with applications projects that are really trying to take things to the highest level will be quite valuable in helping the community resolve what is the best path forward.