Argo

Operating systems provide the necessary functionality to libraries and applications, such as allocating memory and spawning processes, and manage the resources on the nodes in an exascale system. The Argo project is augmenting and optimizing existing operating system and runtime components, as well as building portable, open-source system software that improves performance and scalability and provides increased functionality to exascale libraries, applications, and runtime systems with a focus on resource management, memory management, and power management.

Project Details

Many exascale applications have a complex runtime structure, ranging from in situ data analysis through an ensemble of largely independent individual sub-jobs to arbitrarily complex workflow structures. To meet the emerging needs of exascale workloads while providing optimal performance and resilience, the compute, memory, and interconnect resources must be managed in cooperation with applications, libraries, and runtime systems. Argo’s goal is to augment and optimize low-level system software components for use in production exascale systems, providing portable, open-source, integrated software that improves the performance and scalability of—and offers increased functionality to—exascale applications, libraries, and runtime systems. The project focuses on resource management, memory management, and power management.

The Argo team is delivering resource management infrastructure to coordinate static allocation and the dynamic management of node resources, such as processor cores, memory, and caches. It supports multiple resource management policies suitable for a variety of application workloads. By taking care of system-specific aspects, such as topology mapping and partitioning massively parallel resources, this infrastructure will improve the performance and portability of exascale applications, libraries, and their runtimes.

Memory management libraries are being developed to provide flexible and portable memory management mechanisms that make it easier to obtain high performance. One approach incorporates nonvolatile memory into complex memory hierarchies by using a memory map; another provides explicit, application-aware memory management for deep memory systems. These libraries will directly support new applications that analyze large, distributed datasets and make it easier to program heterogeneous hardware resources.

Another effort of the Argo team is providing fully integrated, end-to-end infrastructure for power and performance management, including power-aware plugins for resource managers, workflow managers, job-level runtimes, and a vendor-neutral power control library. This infrastructure directly addresses the challenge of managing the performance of exascale applications on highly power-constrained systems.

Principal Investigator(s):

Pete Beckman, Argonne National Laboratory

Collaborators:

Argonne National Laboratory, Lawrence Livermore National Laboratory

Progress to date

  • The Argo team developed an initial version of the Node Resource Manager, which provides the high-level control of node resources. It was subsequently enhanced with an interface, enabling applications to report their progress and improve resource optimizations. Multiple resource policies were developed that were suitable for varying workload types and managed by using a machine learning-based technique.
  • Multiple versions of UMap, a user-space memory map page fault handler for nonvolatile memory, have been released. Recent progress includes adding a sparse multifile backing store interface, enhancing the handler code to enable the delivery of memory pages over the network, and integrating UMap into graph processing and metagenomics applications.
  • The team developed AML, a memory library for explicitly managing deep memory architectures. Recently, the team implemented application-ready high-level abstractions based on a newly developed interface for querying performance information about the memory topology, then integrated them with a Monte Carlo mini-app.
  • The team released the first version of Variorum, a vendor-neutral library for power monitoring and control. Subsequent releases added expanded architecture support and new features. The team added support for using Variorum with GEOPM, Flux, Kokkos, and SLURM; deployed the resulting power stack on multiple clusters; and tested all components at different scales.

National Nuclear Security Administration logo U.S. Department of Energy Office of Science logo