Efficient communication among the compute elements within high-performance computing systems is essential for simulation performance. The message passing interface (MPI) is a community standard developed by the MPI Forum for programming these systems and handling the communication needed. MPI is the de facto programming model for large-scale scientific computing and is available on all the large systems. Most of the US Department of Energy’s (DOE’s) parallel scientific applications that run on pre-exascale systems use MPI. The goal of the Exascale-MPI project is to evolve the MPI standard to fully support the complexity of the exascale systems and deliver MPICH—a reliable, performant implementation of the MPI standard—for these systems.
Supercomputers consist of thousands to tens of thousands of computing nodes, each with separate central processing units (CPUs) and graphical processing units (GPUs—accelerators for improving computational efficiency) that have their own memory. Data often needs to move between CPUs and GPUs, and between CPUs or GPUs across the entire supercomputer. Efficiently supporting this data transfer is a major challenge in high performance computing (HPC), and scaling this process up for new machines is especially difficult, as new systems are increasingly more complex.
On the other hand, the scientists using these machines often don’t need to know how the data transfer process works. The Message Passing Interface (MPI) is a standard way to achieve the necessary portability. It consists of a set of standards describing hundreds of communication functions that, when implemented, allows scientists and developers to write their application once, and move it to a different HPC system with good performance.
While MPI provides the set of standards, the Exascale Computing Project (ECP)’s MPICH library implements these standards. One of the most popular implementations of MPI, MPICH turns the standards into a library that programmers on an HPC machine can use. MPICH addresses many challenges in scaling up data transfer, including performance and memory usage, to ensure the MPI standard is efficiently implemented on new machines.
Almost all ECP projects rely on MPI for interprocess communication, and all three Department of Energy exascale computing systems—Frontier at Oak Ridge, Aurora at Argonne, and El Capitan at Lawrence Livermore—use an MPICH-based MPI implementation. The vendors of these systems, Intel and HPE/Cray, provide an optimized version of MPICH for their customers.
One of the goals of ECP was building exascale supercomputers and enabling applications to run efficiently on these new systems. MPICH directly helps this goal.
Although MPI will continue to be a viable programming model on exascale systems, the MPI standard and MPI implementations must address the challenges posed by the increased scale, performance characteristics, evolving architectural features, and complexity expected from the exascale systems, as well as provide support for the capabilities and requirements of the applications that will run on these systems.
Therefore, this project addresses five key challenges to deliver a performant MPICH implementation: (1) scalability and performance on complex architectures that include, for example, high core counts, processor heterogeneity, and heterogeneous memory; (2) interoperability with intranode programming models that have a high thread count, such as OpenMP, OpenACC, and emerging asynchronous task models; (3) software overheads that are exacerbated by lightweight cores and low-latency networks; (4) extensions to the MPI standard based on experience with applications and high-level libraries and frameworks targeted at exascale; and (5) topics that become more significant for exascale architectures (i.e., memory and power usage and resilience).
The MPICH development effort continues to address several key challenges, such as performance and scalability, heterogeneity, hybrid programming, topology awareness, and fault tolerance. Several additional features are being developed to support the exascale machines that will be deployed, including: (1) support for multiple accelerator modes and native hardware models that will facilitate data transfers between GPU accelerators and the communication network in cases in which native hardware support is lacking and (2) offline and online performance tuning based on static and dynamic system configurations, respectively.
This team will also produce a significantly larger test suite to stress test various use cases of MPI and develop a test generation tool kit that automatically profiles MPI usage by applications via the MPI profiling interface and generates a simple test program that represents the MPI communication pattern of the application, covering basic MPI features, sanitized iterative loops, memory buffer management, and incomplete executions. These activities will help improve the reliability and performance of the MPICH implementation and other MPI implementations as they evolve.
The team will continue to engage with the MPI Forum to ensure that future MPI standards meet the needs of the Exascale Computing Project (ECP) and broader DOE applications. To achieve good performance on exascale machines, the team plans to develop new MPI features for application-specific requirements, such as alternative fault tolerance models and reduction neighborhood collectives, either through inclusion in the standard or as extensions to the standard.