CLOVER

Scientific applications must apply efficient and scalable implementations of numerical operations, such as matrix-vector products and Fourier transforms, to simulate their phenomena of interest. Software libraries are powerful ways to share verified, optimized numerical algorithms and their implementations. The CLOVER project is delivering scalable, portable numerical algorithms to facilitate efficient simulations. The team evolves implementations to run effectively on the pre-exascale and exascale systems and adds new capabilities that might be needed by applications.

Technical Discussion

Mathematical libraries encapsulate the latest results from the mathematics and computer science communities, and many exascale applications rely on these numerical libraries to incorporate the most advanced technologies available in their simulations. Advances in mathematical libraries are necessary for enabling computational science on exascale systems as the exascale architectures introduce new complexities that algorithms and their implementations must address to be scalable, efficient, and robust. The CLOVER project is ensuring the healthy functionality of the mathematical libraries on which these applications depend. The libraries supported by the CLOVER project, SLATE, heFFTe, and Ginkgo span the range from lightweight collections of subroutines with simple application programming interfaces (APIs) to more “end-to-end” integrated environments and provide access to a wide range of algorithms for complex problems.

SLATE provides dense linear algebra operations for large-scale machines with multiple GPU accelerators per node. The team focuses on adding support to SLATE for the most critical workloads required by exascale applications: BLAS, linear systems, least squares, matrix inverses, singular value problems, and eigenvalue problems.

HeFFTe delivers highly efficient fast Fourier transforms (FFTs) for exascale computing. Applications include molecular dynamics, spectrum estimation, fast convolution and correlation, signal modulation, and wireless multimedia applications. HeFFTe implements fast and robust multidimensional FFTs and FFT specializations that target large-scale heterogeneous systems with multicore processors and hardware accelerators.

Ginkgo is an accelerator-focused production-ready, next-generation sparse linear algebra library that provides scalable preconditioned iterative solvers. To ease adoption and usage, the library employs a uniform interface to all functionality. Separating the algorithms from architecture-specific kernels provides a high level of platform portability and enables Ginkgo to run on all Exascale Computing Project (ECP) exascale systems.

PRODUCT SUMMARY: SLATE

The Motivation

SLATE (Software for Linear Algebra Targeting Exascale) provides fundamental dense linear algebra capabilities including parallel BLAS (basic linear algebra subprograms), norms, linear system, least squares, singular value, and eigenvalue solvers. These dense linear algebra routines are a foundational capability used in many science and engineering applications and are used as building blocks in other math libraries. As such, SLATE seeks to provide a complete linear algebra library offering many capabilities that diverse clients can build upon.

Recognizing the inherent challenges of designing a software package from the ground up, the SLATE project started with (1) a careful analysis of the existing and emerging implementation technologies,(2) followed by an initial designthat (3) has solidified. The team will continue to reevaluate and refactor the software as needed.

This motivated the design, implementation, and deployment of software that:

Aggressively performs GPU offloading: Virtually all the processing power in modern systems is provided by GPUs. Achieving efficiency requires aggressive offload to GPU accelerators and optimization of communication bottlenecks. Where applicable, highly optimized vendor implementations of GPU operations are used. Where necessary, custom GPU kernels are developed. Communications operations are overlapped with computation to increase performance whenever possible. Attention has been paid to GPU memory usage as SLATE’s target applications generate large dense matrices to solve, with dimensions exceeding 100,000, often exceeding the memory of a single node.
Preserves flexibility: The team recognized that standardized solutions for GPU acceleration are still evolving and are in a state of flux. SLATE uses modern C++ features and recent extensions to the OpenMP standard, which may not be fully supported by compilers and their runtime environments. The team also optimized existing matrix-multiplication routines to automatically use mixed-precision iterative refinement algorithms when appropriate.

The Solution

The ultimate objective of SLATE is to replace the venerable ScaLAPACK (Scalable Linear Algebra PACKage) library, which has become the industry standard for dense linear algebra operations in distributed memory environments but is past its end of life and can’t be readily retrofitted to support GPUs. Primarily, SLATE aims to extract the full performance potential and maximum scalability from modern multi-node HPC machines with many cores and multiple GPUs per node. This is accomplished in a portable manner by relying on standards such as MPI and OpenMP.

SLATE also seeks to deliver dense linear algebra capabilities beyond the capabilities of ScaLAPACK, including new features such as communication-avoiding and randomized algorithms, as well as the potential to support variable size tiles and block low-rank compressed tiles.

To be as widely available as possible, SLATE provides several interfaces, including a native C++ interface, C and Fortran interfaces, and a ScaLAPACK compatible wrapper. All the libraries have Spack, CMake, and makefile builds and are in the E4S and xSDK distributions to ease integration with applications and facilities.

The Impact

SLATE improved performance of several major routines, including Cholesky, QR, eigenvalue, and singular value decompositions. More components of Cholesky and QR are now on the GPU. For the eigenvalue problem, the team parallelized a major component and accelerated it using GPUs, resulting in significant improvements. The team also added computation of eigenvectors to the eigenvalue solver and are working on the divide-and-conquer algorithm, which exhibits better performance than the QR iteration algorithm for computing eigenvectors.

SLATE has been integrated with the following clients:

STRUMPACK is a library for dense and sparse linear algebra using hierarchical rank-structured matrix approximations.
WarpX enables computational explorations of key physics questions in the transport and acceleration of particle beams in long chains of plasma channels. PICSAR is a library of modular physics routines for PIC codes.
NWChemEx is a widely used open-source computational chemistry package that includes both quantum chemical and molecular dynamics functionality.
BLAS is a collection of low-level matrix and vector arithmetic operations. LAPACK is a library of dense and banded matrix linear algebra routines such as solving linear systems.

Sustainability

Sustainability is achieved through a flexible design and community engagement. The SLATE team interacts on a regular basis with the OpenMP community, represented in ECP by the SOLLVE project, and with the MPI community, represented in ECP by the OMPI-X project and the Exascale MPI project. The SLATE team also engages the vendor community through contacts at HPE Cray, IBM, Intel, NVIDIA, AMD, and ARM.

A well-established community of ScaLAPACK users effectively guarantees an extensive user community. The SLATE team also packaged the BLAS++ and LAPACK++ portability layer as stand-alone libraries for other applications to leverage for portability and modern C++ semantics.

For more information and references:

SLATE can be installed though E4S binaries, containers, or via custom source code builds via SPACK: https://e4s.io.
http://icl.utk.edu/slate
SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library
Portable and Efficient Dense Linear Algebra in the Beginning of the Exascale Era

PRODUCT SUMMARY: HeFFTE

The Motivation

HeFFTe (highly efficient FFTs for Exascale, pronounced “hefty”) enables multinode and GPU-based multidimensional fast Fourier transform (FFT) capabilities in single- and double-precision.

The Exascale Computing Project (ECP) heFFTe project developed a sustainable high-performance multidimensional Fast Fourier Transforms (FFTs) for Exascale platforms. Multidimensional FFTs can be implemented as a sequence of low-dimensional FFT operations in which the overall scalability is excellent (linear) when running in large node count instances.

The need for scalable multidimensional FFTs motivated the development and main objectives of the heFFTe project to: (1) Collect existing FFT capabilities from ECP application teams; (2) Assess gaps, extend, and make available various FFT capabilities as a sustainable math library; (3) Explore opportunities to build multidimensional FFTs while leveraging on-node concurrency from batched FFT formulations; and (4) Focus on capabilities for exascale platforms.

The Solution

HeFFTe leverages established but ad hoc software tools that have traditionally been part of application codes, but not extracted as independent, supported libraries. This required mitigating challenges in the following areas:

Communications and GPU Optimizations: FFTs are communication bound. A central focus in heFFTe is on algorithmic design to minimize communication and to provide efficient GPU implementations. Other strategies include the use of mixed-precision calculations and data compression for reduced communications (including lossy, e.g., using ZFP compression) to reduce communications overhead.
Evolving Design: The heFFTe library was designed to support the fftMPI and SWFFT functionalities, which are already integrated in ECP applications. This compatibility means that heFFTe directly benefits these applications and provides integrated solutions.
More functionalities and application-specific optimizations: These were added through heFFTe backends to support various ECP applications.
Autotuning Performance portability: These were addressed through use of standards (like 1D FFTs from vendors), portable linear algebra using MAGMA, and parameterized versions that will be autotuned across architectures as required.

The Impact

Considered to be one of the top 10 algorithms of the 20th century, the FFT is widely used by the scientific and high-performance computing (HPC) communities. Over the years, this demand has motivated the use in many applications including molecular dynamics, spectrum estimation, fast convolution and correlation, signal modulation and many wireless multimedia applications. For example, the distributed 3D FFT is one of the most important routines used in molecular dynamics (MD) computations. Its performance can affect MD scalability.

Sustainability

The importance of the FFT to scientific computing, broad utilization and need for the heFFTe multinode and GPU-based multidimensional fast Fourier transform (FFT) capabilities ensures broad, and thus continued use by many in the HPC, cloud, commercial, and academic computing communities.

For more information and references:

The heFFTe library can be installed though E4S binaries, containers, or via custom source code builds via SPACK: https://e4s.io.
See https://icl.utk.edu/fft/ for more information and a list of recent papers.
The ECP highlight, “heFFTE – A Widely Applicable, CPU/GPU, Scalable Multidimensional FFT that Can Even Support Exascale Supercomputers”

PRODUCT SUMMARY: Peeks/Ginkgo

The Motivation

Numerical software has an enormous impact on scientific computing because it acts as the gateway middleware that enables many applications to run on state-of-the-art hardware.

The Exascale Computing Project (ECP) Production-ready, Exascale-Enabled Krylov Solvers (PEEKS) effort focused on communication-minimizing Krylov solvers, parallel incomplete factorization routines, and parallel preconditioning techniques, as these building blocks form the numerical core of many complex application codes.

Those looking to solve sparse linear systems will be interested in the separate ECP Ginkgo project, which brought GPU acceleration to this modern high-performance linear algebra library for manycore systems.

The Solution

To provide performance portability, the PEEKS project relied heavily on the Kokkos and Kokkos Kernels libraries as they provide kernels that are performance portable across a variety of platforms, including CPU and GPU (NVIDIA, AMD, Intel). To increase portability, the project worked to reduce reliance on the NVIDIA UVM (Unified Virtual Memory), which is not widely supported.

The Ginkgo library design was guided by combining ecosystem extensibility with heavy, architecture specific kernel optimization using the platform-native languages CUDA (NVIDIA GPUs), HIP (AMD GPUs), DPC++ (Intel GPUs), and OpenMP (Intel/AMD/Arm multicore).

The Impact

GPU-acceleration in these production-quality scalable packages means that applications can exploit the scalability and performance of newer, exascale-capable hardware. More specifically:

PEEKS: Applications that rely on the following four solver packages on ECP platforms will benefit from the PEEKs effort: distributed linear algebra (Tpetra), Krylov solvers (Belos), algebraic preconditioners and smoothers (Ifpack2), and direct solver interfaces (Amesos2).
Ginkgo: The Ginkgo library contains functionality for solving (sparse) linear systems via iterative and direct solvers, preconditioners, algebraic multigrid (AMG), mixed-precision functionality, and batched routines. Weak and strong scalability of the functionality up to thousands of GPUs has been demonstrated.

Sustainability

PEEKS provides exascale-enabled capabilities in a robust, production-quality software package, thus ensuring continued use by applications that rely on Tpetra, Belos, ifpack2, and Amesos2.

Ginkgo provides native support for NVIDIA GPUs, AMD GPUs, and Intel GPUs to ensure successful delivery of scalable Krylov solvers in robust, production-quality software that can be relied on by ECP applications. Ginkgo is part of the Extreme-scale Scientific Software Development Kit (xSDK), which is part of E4S. This ensures both accessibility and exposure to a large user base.

For more information and references:

PEEKS and Ginkgo can be installed though E4S binaries, containers, or via custom source code builds via SPACK: https://e4s.io.
The PEEKS homepage: https://icl.utk.edu/peeks/
Ginkgo project website: https://ginkgo-project.github.io/
Ginkgo open-source Git repository: https://github.com/ginkgo-project/ginkgo/
Ginkgo performance database: https://ginkgo-project.github.io/gpe/

Technical Discussion

PRODUCT SUMMARY: SLATE

The Motivation

The Solution

The Impact

Sustainability

For more information and references:

PRODUCT SUMMARY: HeFFTE

The Motivation

The Solution

The Impact

Sustainability

For more information and references:

PRODUCT SUMMARY: Peeks/Ginkgo

The Motivation

The Solution

The Impact

Sustainability

For more information and references:

Principal Investigator(s)

Collaborators