LLNL NNSA

Project Details

The National Nuclear Security Administration (NNSA) supports the development of open-source software technologies that are important to the success of national security applications and are externally impactful for the rest of the Exascale Computing Project (ECP) and the broader community. These software technologies are managed as part of a larger Advanced Simulation and Computing (ASC) portfolio, which provides resources to develop and apply these technologies to issues important to national security. The software technologies at Lawrence Livermore National Laboratory (LLNL) span programming models and runtimes (RAJA/Umpire/CHAI), development tools (Debugging @ Scale), mathematical libraries (MFEM), productivity technologies (DevRAMP), and workflow scheduling (Flux/Power).

The RAJA team is providing software libraries that enable application and library developers to meet advanced architecture portability challenges. The project goals are to enable writing performance portable computational kernels and coordinate complex heterogeneous memory resources among components in large integrated applications. The software products provided by this project are three complementary and interoperable libraries: RAJA provides software abstractions that enable C++ developers to write performance portable numerical kernels, Umpire is a portable memory resource management, and CHAI contains C++ “managed array” abstractions that enable transparent, on-demand data migration.

Flux, LLNL’s next-generation resource manager, is a key enabler for many science workflows. Flux provides key scheduling capabilities for complex application workflows, such as MuMMi, which is used in cancer research; uncertainty quantification; Merlin, which is used for large machine learning; recent COVID-19 drug design workflows; ECP ExaAM; and others. Flux is also a critical technology that enables the Rabbit I/O technology planned for El Capitan. Traditional resource managers, such as SLURM, lack the required scalability and the flexible resource model.

The Debugging @ Scale project provides an advanced debugging, code-correctness, and testing tool set for exascale. The current capabilities include STAT, a highly scalable lightweight debugging tool; Archer, a low-overhead OpenMP data race detector; ReMPI/NINJA, a scalable record-and-replay and smart noise injector for message passing interface; and FLiT/FPChecker, a tool suite for checking floating-point correctness.

The MFEM library is focused on providing high-performance mathematical algorithms and finite element discretizations to next-generation, high-order applications. This effort includes the development of physics enhancements in the finite element algorithms in MFEM and the MFEM-based BLAST Arbitrary Lagrangian-Eulerian code to support ASC mission applications and the development of unique unstructured adaptive mesh refinement algorithms that focus on generality, parallel scalability, and ease of integration in unstructured mesh applications

The DevRAMP team is creating tools and services that multiply the productivity of developers through automation. The capabilities include Spack, a package manager for high-performance systems that automates the process of downloading, building, and installing different versions of software packages and their dependencies, and Sonar, a software stack for performance monitoring and analysis that enables developers to understand how high-performance computers and applications interact. To deal with the complexity for packaging software for accelerated architectures, the Spack team has been focused on enhancing robustness through testing and has completely reworked the concretizer, which is the NP-complete dependency solver at the core of Spack. The new concretizer is based on answer set programming, which allows Spack to solve complex systems of first-order logic constraints to optimize users’ build configurations. Spack is the foundation of the ECP’s Extreme-Scale Scientific Software Stack and the delivery mechanism for all software in the ECP.

Principal Investigator(s):

Becky Springmeyer, Lawrence Livermore National Laboratory

Progress to date

  • The RAJA/Umpire/CHAI team released RAJA v0.13.0, CHAI v2.3.0, and Umpire v5.0.0. This included adding robust support for AMD GPUs via the HIP programming model.
  • The Debugging @ Scale team integrated the STAT and TotalView debugging tools into Flux in preparation for exascale systems and demonstrated that new floating-point debugging tools, such as FLiT and FPChecker, on production applications can identify difficult-to-find floating-point precision issues in high-performance computing codes. This work was published in Communications of the Association for Computing Machinery.
  • The MFEM team released MFEM v2 with many new features, including vectorization, HIP, and CUDA improvements; algebraic multigrid preconditioning on GPUs via NVIDIA’s AmgX; element and full assembly on GPUs; improved mesh optimization and discretization algorithms; CVODES, MKL CParadiso, SLEPc, and ADIOS2 support; libCEED, GSLIB-FindPoints, KINSOL, Gmsh, Gecko, and ParaView improvements; 18 new examples and miniapps; and much more.
  • Spack adoption has grown rapidly around the world; the tool is used on many TOP500 systems, including the number-one Fugaku supercomputer. This year, the project grew to more than 5,300 software packages. The team reworked the concretizer, which is the NP-complete dependency solver at the core of Spack.
  • The Flux team continues to push resource management forward by exploring integration with the Kubernetes orchestrator, as well as flexible power management through Variorum. Flux has become a founding technology for the ECP’s ExaWorks, which aims to curate a scalable, robust software development kit for workflows.

National Nuclear Security Administration logo U.S. Department of Energy Office of Science logo