Understanding the performance characteristics of exascale applications is necessary for identifying and addressing the barriers to achieving performance goals. This becomes more difficult as the architectures become more complex. The Performance Application Programming Interface (PAPI) provides library and application developers with generic and portable access to low-level performance counters found across the exascale machine, enabling users to see the relationships between software performance and hardware events. These relationships provide a critical step toward improving performance.

Project Details

The Exascale PAPI (Exa-PAPI) project is developing a new C++ PAPI (PAPI++) software package from the ground up that offers a standard interface and methodology for using low-level performance counters in CPUs, GPUs, on/off-chip memory, interconnects, and the I/O system, including energy/power management. PAPI++ is building on classic PAPI functionality and strengthening its path to exascale with a more efficient and flexible software design that takes advantage of C++’s object-oriented nature but preserves the low-overhead monitoring of performance counters and adds a vast testing suite.

In addition to providing hardware counter-based information, a standardizing layer for monitoring software-defined events (SDEs) is being incorporated that exposes the internal behavior of runtime systems and libraries, such as communication and math libraries, to the applications. As a result, the notion of performance events is broadened from strictly hardware-related events to include software-based information.

Enabling the monitoring of hardware and software events provides more flexibility to developers when capturing performance information.

In summary, the Exa-PAPI team is preparing PAPI support to solve the challenges posed by exascale systems by (1) widening its applicability and providing robust support for exascale hardware resources; (2) supporting finer-grain measurement and control of power, thus offering software developers a basic building block for dynamic application optimization under power constraints; extending PAPI to support SDEs; and (4) applying semantic analysis to hardware counters so that application developers can better make sense of the ever-growing list of raw hardware performance events that can be measured during execution.

The team will channel the monitoring capabilities of hardware counters, power usage, and SDEs into a robust PAPI++ software package. PAPI++ is meant to be PAPI’s replacement with a more flexible and sustainable software design.

Principal Investigator(s):

Jack Dongarra, University of Tennessee, Knoxville

Progress to date

  • On the software event front, the team finalized the development of and released the new API to expose any kind of SDE. This extends PAPI’s role so that it becomes the de facto standard for exposing performance-critical events from different software
  • Because the concept of SDEs is new to PAPI, the team worked closely with developers of different libraries and runtimes that serve as natural targets for the early adoption of the new SDE API. To date, the team has integrated SDEs into the sparse linear algebra library MAGMA, the tensor algebra library TAMM (NWChemEx), the task-scheduling runtime PaRSEC, the compiler-based performance analysis tool BYFL, and the High Performance Conjugate Gradients benchmark.
  • On the hardware counter front, the team developed several new PAPI components, such as: (1) “rocm” to support performance counters on AMD GPUs; (2) “rocm_smi” to monitor power usage on AMD GPUs, which is also writeable by users (e.g., to reduce power consumption); (3) a new “io” component to expose I/O statistics exported by the Linux kernel; and (4) “powercap_ppc” to support the monitoring and capping of power usage on IBM PowerPC architectures (Power9 and later).
  • Additionally, the team extended PAPI with CAT, the “Counter_Analysis_Toolkit.” This is a tool to assist with native performance counter disambiguation through micro-benchmarks, which are used to probe different important aspects of modern CPUs to help classify native performance events.

National Nuclear Security Administration logo U.S. Department of Energy Office of Science logo