Scientific applications must apply efficient and scalable implementations of numerical operations, such as matrix-vector products and Fourier transforms, to simulate their phenomena of interest. Software libraries are powerful ways to share verified, optimized numerical algorithms and their implementations. The CLOVER project is delivering scalable, portable numerical algorithms to facilitate efficient simulations. The team evolves implementations to run effectively on the pre-exascale and exascale systems and adds new capabilities that might be needed by applications.
Mathematical libraries encapsulate the latest results from the mathematics and computer science communities, and many exascale applications rely on these numerical libraries to incorporate the most advanced technologies available in their simulations. Advances in mathematical libraries are necessary for enabling computational science on exascale systems as the exascale architectures introduce new complexities that algorithms and their implementations must address to be scalable, efficient, and robust. The CLOVER project is ensuring the healthy functionality of the mathematical libraries on which these applications depend. The libraries supported by the CLOVER project, SLATE, heFFTe, and Ginkgo span the range from lightweight collections of subroutines with simple application programming interfaces (APIs) to more “end-to-end” integrated environments and provide access to a wide range of algorithms for complex problems.
SLATE provides dense linear algebra operations for large-scale machines with multiple GPU accelerators per node. The team focuses on adding support to SLATE for the most critical workloads required by exascale applications: BLAS, linear systems, least squares, matrix inverses, singular value problems, and eigenvalue problems.
HeFFTe delivers highly efficient fast Fourier transforms (FFTs) for exascale computing. Applications include molecular dynamics, spectrum estimation, fast convolution and correlation, signal modulation, and wireless multimedia applications. HeFFTe implements fast and robust multidimensional FFTs and FFT specializations that target large-scale heterogeneous systems with multicore processors and hardware accelerators.
Ginkgo is an accelerator-focused production-ready, next-generation sparse linear algebra library that provides scalable preconditioned iterative solvers. To ease adoption and usage, the library employs a uniform interface to all functionality. Separating the algorithms from architecture-specific kernels provides a high level of platform portability and enables Ginkgo to run on all Exascale Computing Project (ECP) exascale systems.
SLATE (Software for Linear Algebra Targeting Exascale) provides fundamental dense linear algebra capabilities including parallel BLAS (basic linear algebra subprograms), norms, linear system, least squares, singular value, and eigenvalue solvers. These dense linear algebra routines are a foundational capability used in many science and engineering applications and are used as building blocks in other math libraries. As such, SLATE seeks to provide a complete linear algebra library offering many capabilities that diverse clients can build upon.
Recognizing the inherent challenges of designing a software package from the ground up, the SLATE project started with (1) a careful analysis of the existing and emerging implementation technologies,(2) followed by an initial designthat (3) has solidified. The team will continue to reevaluate and refactor the software as needed.
This motivated the design, implementation, and deployment of software that:
The ultimate objective of SLATE is to replace the venerable ScaLAPACK (Scalable Linear Algebra PACKage) library, which has become the industry standard for dense linear algebra operations in distributed memory environments but is past its end of life and can’t be readily retrofitted to support GPUs. Primarily, SLATE aims to extract the full performance potential and maximum scalability from modern multi-node HPC machines with many cores and multiple GPUs per node. This is accomplished in a portable manner by relying on standards such as MPI and OpenMP.
SLATE also seeks to deliver dense linear algebra capabilities beyond the capabilities of ScaLAPACK, including new features such as communication-avoiding and randomized algorithms, as well as the potential to support variable size tiles and block low-rank compressed tiles.
To be as widely available as possible, SLATE provides several interfaces, including a native C++ interface, C and Fortran interfaces, and a ScaLAPACK compatible wrapper. All the libraries have Spack, CMake, and makefile builds and are in the E4S and xSDK distributions to ease integration with applications and facilities.
SLATE improved performance of several major routines, including Cholesky, QR, eigenvalue, and singular value decompositions. More components of Cholesky and QR are now on the GPU. For the eigenvalue problem, the team parallelized a major component and accelerated it using GPUs, resulting in significant improvements. The team also added computation of eigenvectors to the eigenvalue solver and are working on the divide-and-conquer algorithm, which exhibits better performance than the QR iteration algorithm for computing eigenvectors.
SLATE has been integrated with the following clients:
Sustainability is achieved through a flexible design and community engagement. The SLATE team interacts on a regular basis with the OpenMP community, represented in ECP by the SOLLVE project, and with the MPI community, represented in ECP by the OMPI-X project and the Exascale MPI project. The SLATE team also engages the vendor community through contacts at HPE Cray, IBM, Intel, NVIDIA, AMD, and ARM.
A well-established community of ScaLAPACK users effectively guarantees an extensive user community. The SLATE team also packaged the BLAS++ and LAPACK++ portability layer as stand-alone libraries for other applications to leverage for portability and modern C++ semantics.
HeFFTe (highly efficient FFTs for Exascale, pronounced “hefty”) enables multinode and GPU-based multidimensional fast Fourier transform (FFT) capabilities in single- and double-precision.
The Exascale Computing Project (ECP) heFFTe project developed a sustainable high-performance multidimensional Fast Fourier Transforms (FFTs) for Exascale platforms. Multidimensional FFTs can be implemented as a sequence of low-dimensional FFT operations in which the overall scalability is excellent (linear) when running in large node count instances.
The need for scalable multidimensional FFTs motivated the development and main objectives of the heFFTe project to: (1) Collect existing FFT capabilities from ECP application teams; (2) Assess gaps, extend, and make available various FFT capabilities as a sustainable math library; (3) Explore opportunities to build multidimensional FFTs while leveraging on-node concurrency from batched FFT formulations; and (4) Focus on capabilities for exascale platforms.
HeFFTe leverages established but ad hoc software tools that have traditionally been part of application codes, but not extracted as independent, supported libraries. This required mitigating challenges in the following areas:
Considered to be one of the top 10 algorithms of the 20th century, the FFT is widely used by the scientific and high-performance computing (HPC) communities. Over the years, this demand has motivated the use in many applications including molecular dynamics, spectrum estimation, fast convolution and correlation, signal modulation and many wireless multimedia applications. For example, the distributed 3D FFT is one of the most important routines used in molecular dynamics (MD) computations. Its performance can affect MD scalability.
The importance of the FFT to scientific computing, broad utilization and need for the heFFTe multinode and GPU-based multidimensional fast Fourier transform (FFT) capabilities ensures broad, and thus continued use by many in the HPC, cloud, commercial, and academic computing communities.
Numerical software has an enormous impact on scientific computing because it acts as the gateway middleware that enables many applications to run on state-of-the-art hardware.
The Exascale Computing Project (ECP) Production-ready, Exascale-Enabled Krylov Solvers (PEEKS) effort focused on communication-minimizing Krylov solvers, parallel incomplete factorization routines, and parallel preconditioning techniques, as these building blocks form the numerical core of many complex application codes.
Those looking to solve sparse linear systems will be interested in the separate ECP Ginkgo project, which brought GPU acceleration to this modern high-performance linear algebra library for manycore systems.
To provide performance portability, the PEEKS project relied heavily on the Kokkos and Kokkos Kernels libraries as they provide kernels that are performance portable across a variety of platforms, including CPU and GPU (NVIDIA, AMD, Intel). To increase portability, the project worked to reduce reliance on the NVIDIA UVM (Unified Virtual Memory), which is not widely supported.
The Ginkgo library design was guided by combining ecosystem extensibility with heavy, architecture specific kernel optimization using the platform-native languages CUDA (NVIDIA GPUs), HIP (AMD GPUs), DPC++ (Intel GPUs), and OpenMP (Intel/AMD/Arm multicore).
GPU-acceleration in these production-quality scalable packages means that applications can exploit the scalability and performance of newer, exascale-capable hardware. More specifically:
PEEKS provides exascale-enabled capabilities in a robust, production-quality software package, thus ensuring continued use by applications that rely on Tpetra, Belos, ifpack2, and Amesos2.
Ginkgo provides native support for NVIDIA GPUs, AMD GPUs, and Intel GPUs to ensure successful delivery of scalable Krylov solvers in robust, production-quality software that can be relied on by ECP applications. Ginkgo is part of the Extreme-scale Scientific Software Development Kit (xSDK), which is part of E4S. This ensures both accessibility and exposure to a large user base.