HPCToolkit

Exascale machines will be highly complex systems that couple multicore processors with accelerators and share a deep, heterogeneous memory hierarchy. Understanding performance bottlenecks within and across the nodes in extreme-scale computer systems is a first step toward mitigating them to improve library and application performance. The HPCToolkit project is providing a suite of software tools that developers need to measure and analyze the performance of their software as it executes on today’s supercomputers and forthcoming exascale systems.

Summary

The Motivation

Today’s fastest supercomputers and new exascale systems all employ GPU-accelerated compute nodes. Almost all the computational power of a GPU-accelerated supercomputer comes from the GPUs rather than CPUs. Efficiently running on these GPU-accelerated systems is challenging because they have complex memory hierarchies that include multiple memory technologies with different bandwidth and latency characteristics. Adding to this complexity in GPU-accelerated systems are the nonuniform connections between the different memory spaces and computational elements (e.g., CPU and GPU devices).

Application developers can employ abstractions to hide some of the complexity of these parallel systems. However, performance tools that use each of the features in these heterogeneous systems provide an easier path to increased application performance. The purpose of these tools and specifically the Exascale Computing Project (ECP) HPCToolkit is to provide feedback so developers can improve application performance and efficiency. This objective requires that the performance tool appropriately measure many different hardware devices and provide analysis capabilities so developers can assess how well the individual hardware features are being used. To close the feedback loop, the tool kit must use the measurements to create actionable feedback about the application software and libraries to guide developers as they work to improve the performance, efficiency, and scalability of their applications.

The Solution

The ECP’s HPCToolkit team needed to deliver a production-ready, vendor-agnostic tool kit that can measure application performance on several exascale platforms. The team focused on adding new capabilities to measure and analyze interactions between the application software and key hardware subsystems in extreme-scale platforms, including the GPUs and their complex memory hierarchies in GPU-accelerated compute nodes.

This effort required that the HPCToolkit team enhance their software to incorporate emerging hardware and software interfaces for monitoring code performance on both CPUs and GPUs, thereby extending the capabilities of the HPCToolkit software to better measure and analyze computation, data movement, communication, and I/O as an application executes. The additional specificity provides application developers with more information to pinpoint scalability bottlenecks, quantify resource consumption, and assess inefficiencies.

The team also worked to improve performance attribution inside codes that are already optimized to support large collections of complex node-level programming models. This work included providing information for the vendor-specific programming models used on US Department of Energy (DOE) exascale systems. The team also added support to use GPU binaries in the Dyninst binary analysis tool kit, which other ECP tools also use.

To meet application developer needs, the project team has been working with various ECP teams to ensure that they can leverage HPCToolkit’s capabilities to measure, analyze, attribute, and diagnose performance issues on ECP test beds and forthcoming exascale systems.

The Impact

The HPCToolkit enables application teams to assess performance on GPU-accelerated systems via a robust and vendor-agnostic tool that can support production-level, exascale ECP applications. The success of these ECP efforts will benefit the general scientific community and help to increase their application performance on other systems.

Sustainability

To provide a sustainable foundation for performance measurement and analysis, the project team worked with community stakeholders, including standards committees, vendors, and open-source developers. This work involved improving hardware and software support for measurement and attribution of application performance on extreme-scale parallel systems.

To develop a sustainable, platform-agnostic, performance-monitoring tool kit, the project team engaged with various DOE hardware vendors to improve support for performance measurement in next-generation GPUs.

The team also worked with a variety of software teams to design and integrate new capabilities into operating systems, runtime systems, communication libraries, and application frameworks. These efforts ensure that HPCToolkit can accurately measure and attribute code performance on extreme-scale and other parallel systems.

All these activities and the general availability of the software—including via the Extreme-Scale Scientific Software Stack (E4S)—ensure a robust and vibrant user base and user community.

Technical Discussion

In recent years, the complexity and diversity of architectures for extreme-scale parallelism have dramatically increased. At the same time, the complexity of applications is also increasing

as developers struggle to exploit billion-way parallelism, map computation onto heterogeneous computing elements, and cope with the growing complexity of memory hierarchies. While library and application developers can employ abstractions to hide some of the complexity of emerging parallel systems, performance tools must assess how software interacts with each hardware component of these systems.

The HPCToolkit project is working to develop performance measurement and analysis tools to enable application, library, runtime, and tool developers to understand where and why their software does not fully exploit hardware resources within and across nodes of current and future parallel systems. To provide a foundation for performance measurement and analysis, the project team is working with community stakeholders, including standards committees, vendors, and open-source developers, to improve hardware and software support for measurement and attribution of application performance on extreme-scale parallel systems.

The HPCToolkit team is focused on influencing the development of hardware and software interfaces for performance measurement and attribution by community stakeholders; developing new capabilities to measure, analyze, and understand the performance of software running on extreme-scale parallel systems; producing a suite of software tools that developers can use to measure and analyze the performance of parallel software as it executes; and working with developers to ensure that HPCToolkit’s capabilities meet their needs. Using emerging hardware and software interfaces for monitoring code performance, the team is working to extend capabilities to measure computation, data movement, communication, and I/O as a program executes to pinpoint scalability bottlenecks, quantify resource consumption, and assess inefficiencies, enabling developers to target sections of their code for performance improvement.

Additional Information and References

HPCToolkit installation through E4S binaries, containers, or custom source code builds via Spack: https://e4s.io
http://hpctoolkit.org/
https://www.paradyn.org/
https://www.github.com/dyninst/dyninst/
Dyninst paper: Xiaozhu Meng et al., “Parallel Binary Code Analysis,” PPoPP ’21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2021): 76–89, https://doi.org/10.1145/3437801.3441604.

INTEGRATED PRODUCT SUMMARY: Dyninst

Dynamic Instrumentation to Efficiently Obtain Performance Profiles of Unmodified Executables

The Motivation

Dyninst provides a standalone set of tool kits with platform-agnostic interfaces designed to provide the high-performance, portable, and feature-rich analysis of binary files across a wide range of heterogeneous CPU and GPU hardware architectures. Tools that leverage Dyninst can analyze binaries on any supported platform without the need for modification or recompilation.

The Exascale Computing Project (ECP) focused on adding support for the analysis of GPU binaries to enhance of this binary analysis toolkit.

The Solution

The project team created platform-independent representations for GPU binaries from AMD, Intel, and NVIDIA to generate tooling that can reduce analysis time for even the largest applications. In particular, ECP performance analysis tools can now use Dyninst to instrument binaries and attribute CPU and GPU performance measurements back to detailed source code contexts, thereby enabling accurate and actionable development efforts for authors of scientific software.

The project team worked with community stakeholders, including standards committees, vendors, and open-source developers, to improve hardware and software support for building performance analysis tools. This resulted in the implementation of a first in the world set of binary analysis techniques for accelerator hardware from multiple vendors.

The Impact

The Dyninst toolkit is foundational to all of the ECP integration demonstrations, which is why the ECP invested significant funding and software development time.

The project team delivered a portable tool kit for building low-overhead, full-featured tools across a wide variety of post-exascale hardware via E4S, vendor-provided software stacks, and as an open-source project used by dozens of teams across the world. This also makes Dyninst available to the global HPC, cloud, and industry communities.

Dyninst has been integrated with the following ECP clients:

HPCToolkit is designed to measure and analyze the performance of applications, libraries, and frameworks within and across the compute nodes of GPU-accelerated platforms. Developers use it to identify bottlenecks and inefficiencies that keep codes from achieving exascale performance.
TAU (Tuning and Analysis Utilities) is an open-source performance evaluation tool kit. TAU is developing production quality, standard conforming OpenMP tool support for the latest OpenMP 5.2 specification and beyond.

Sustainability

Because the tool kit interfaces do not require users to have a working understanding of the underlying hardware, a tool that leverages Dyninst can analyze binaries on any supported platform without the need for modification or recompilation. This ensures continued and expanded use of the Dyninst toolkit on supported GPU accelerators by those who wish to gain an understanding of application performance on their systems — even when they don’t have access to source code. This is particularly useful to commercial users and those who use vendor libraries as it is unlikely that they will have access to source code.

As a demonstration of the success of the ECP effort, Dyninst successfully builds and runs on early access exascale systems. HPCToolkit, for example, uses Dyninst to analyze AMD CPU binaries to attribute performance measurements to source code contexts, including functions, loops, and lines. Such cross-platform binary analysis capability is essential to understanding and optimizing application performance, which ensures continued use and need for this toolset on new hardware platforms.

For more information and references:

Dyninst can be installed though E4S binaries, containers, or via custom source code builds via SPACK: https://e4s.io.
See https://www.paradyn.org/ for more information and a list of recent papers.

Summary

The Motivation

The Solution

The Impact

Sustainability

Technical Discussion

Additional Information and References

INTEGRATED PRODUCT SUMMARY: Dyninst

Dynamic Instrumentation to Efficiently Obtain Performance Profiles of Unmodified Executables

The Motivation

The Solution

The Impact

Sustainability

For more information and references:

Principal Investigator(s)

Collaborators