An Up-Close View of the Software that Underpins the Exascale Computing Project

When exascale systems become a reality, the Exascale Computing Project (ECP) will bring to those systems both existing high-performance computing (HPC) software and promising emerging research. Accordingly, one of the objectives of the ECP is to create a production-quality base—a software stack—to support the scientific applications that will run on these systems.

Scientists developing applications for exascale systems depend on an intricate set of software that makes the computing system usable and the job of the application developer easier. The broad services this software provides are often collectively referred to as the software stack.

Virtually all of the ECP software stack developed by the US Department of Energy (DOE) is composed of open-source code, which makes the software broadly available and appealing to other programmers to contribute when the base capabilities are established. Software provided by the platform vendors, however, often consists of a combination of open-source and proprietary code; the vendors tend to focus on value-adding features that exploit the benefits of their systems.

Although the term stack implies layers, a software stack is simply a collection of software used for a variety of tasks. That noted, a discussion of how all the pieces of software do indeed “stack up” nonetheless offers some useful insight. A good approach is to begin with the software closest to the hardware and progress to the software libraries that supply mathematical algorithms central to the science mission of the applications.

System Software

System software is responsible for instructing the lowest-level hardware components how to function, and it is often the minimal set of software provided by a vendor when a supercomputer is deployed.

A key example of system software is the operating system, which has unique requirements for running at exascale. ECP system software projects also address topics such as the use of low-level threading libraries for optimally managing parallel compute resources, vendor-neutral interfaces for managing a deepening memory hierarchy, resource managers for placing and executing jobs efficiently across the exascale system, and containers for easier deployment of the software stack.

Linux is an example of a widely used open-source operating system in HPC. A broad set of system software built around the Linux ecosystem supports much of the ECP system software strategy.

Tools

Another area of the software stack is tools, which are used alongside applications to help the developers and end users compile their applications and understand the correctness and performance implications of their implementations.

A compiler is a tool that transforms source code written in a specific high-level language such as C++ or Fortran into machine-readable executable files suitable for running on the computer. Choices that users make in how they write their source code can significantly influence how efficiently the compiler can create machine-readable code, and improvements in compilers can enhance all of the applications.

Performance-analysis tools are a standard part of the HPC programmer’s toolkit and provide deep views into where performance bottlenecks occur on the hardware. Ideally, performance-analysis tools can offer insight into how developers can rethink mapping their application requirements onto an increasingly complex underlying set of architectures.

Other important types of tools are debuggers, which as the name implies, help the developer to identify bugs in software. Exascale systems demand improvements to existing debugger technology to manage the extreme scale at which defects must be found and resolved. Debuggers must also understand the complex node architectures expected to be found in exascale systems—among those are heterogeneity and multilevel memory systems and any new programming models that are adopted for applications.

Programming Models and Runtimes

The next layer in the software stack consists of programming models and runtimes.

In the context of exascale systems, the programming model primarily provides a way for the applications to express how they intend to run in parallel. Such capability is important because the languages that are commonly used in HPC applications—primarily C++ and Fortran—don’t have built-in language features to efficiently convey the abundance of parallelism that must be exploited.

The most common programming model in use today generally is referred to as MPI+X. MPI is the Message Passing Interface used for internode distributed memory communication, and “X” refers to a number of shared-memory threading models such as OpenMP, OpenACC, OpenCL, and CUDA for using on-node parallelism and heterogeneous computing devices such as graphics processing units and fine-grained shared-memory threading.

OpenMP represents a community standard with the ultimate objective of working effectively across the wide variety of nodes. Other ECP efforts provide language-based libraries that allow the application to select from a palette of programming models most suitable for a particular platform. Both approaches focus on achieving performance-portability, or the ability for an application to run effectively on multiple exascale platforms without the need to maintain multiple versions of the source code.

In addition to building on MPI+X, the ECP is exploring newer programming models primarily embodied in the concept of asynchronous many-task (AMT) models.

AMT programming models show early potential in addressing some of the bottlenecks of traditional MPI+X programs such as programmer productivity and are included in the ECP software stack for ambitious application efforts looking to exploit the potential of this new programming model approach.

Data Management and Workflows

The role of data management and workflows, found next on the software stack, is to enable applications to manage the increasingly complex data storage hierarchy and input/output (I/O) bottleneck.

Data management extends existing I/O storage libraries that provide applications with convenient methods to retrieve complex data input into, and store data generated by, applications. It also creates new ways to manage checkpoint and restart that will alleviate the I/O bottleneck that these huge but productivity-critical files can create. Furthermore, data management involves the use of novel emerging compression technologies that can reduce the amount of data that must be transferred between levels of the storage hierarchy.

Workflows are designed to increase end-user productivity and decrease the overall time-to-solution. They embody the concept of effectively using the whole storage hierarchy to help end users tie together a complete end-to-end simulation and analysis that often requires efficiently handing off large and complex data sets between those elements.

Data Analysis and Visualization

Related to data management is the task of data analysis and visualization to help make sense of the enormous amounts of data that an exascale application will generate. End users answer scientific questions by turning reams of raw data into actionable information and knowledge.

Increasingly, the standard workflow of writing files to disk for post-processing visualization is complicated by bottlenecks in the storage hierarchy. Techniques such as in situ analysis—in which visualization and analysis are built into the applications and done continuously during the run—are becoming more commonly used in the software stack. These techniques take advantage of new hardware features such as large-capacity nonvolatile memory.

The visualization tools, much like their application counterparts, must be optimized to use compute resources efficiently, and the ECP is developing a common low-level library in support of several higher-level visualization and analysis tools.

Math Libraries and Frameworks

The final layer of the ECP’s software stack is math libraries and frameworks, which are being developed as primary interfaces to the application codes in support of generalized mathematical techniques common in many applications. Although closely connected to the applications they support, the math libraries and frameworks build on large parts of the rest of the software stack. These reusable components and libraries typically embody computationally expensive algorithms and must be highly scalable, general purpose, and well engineered to integrate easily into the host applications.

Exascale and Beyond

The ultimate vision for the ECP software stack is to provide a large set of production-ready, reusable components with which end users, system operators, and applications can achieve exascale. Furthermore, almost everything being developed in the ECP software stack by the DOE labs is being released as open source to leave a legacy of software for the broader HPC community at exascale and beyond.

Topics: