VeloC/SZ

Long-running large-scale simulations and high-resolution, high-frequency instrument detectors are generating extremely large volumes of data at a high rate. While reliable scientific computing is routinely achieved at small scale, it becomes remarkably difficult at exascale due to both an increased number of disruptions as the machines become larger and more complex from, for example, component failures and the big data challenge. The VeloC/SZ project addresses these challenges by focusing on ensuring high reliability for long-running exascale simulations and reducing the data while keeping important scientific outcomes intact.

Summary

The Motivation

The VeloC-SZ project is extending and improving the SZ error-bounded lossy compressor for structured and unstructured scientific datasets. SZ offers an excellent compression ratio and very low distortion and compression time. This additional work was motivated by the need to improve the SZ lossy compressor for Exascale Computing Project (ECP) scientific datasets, thereby ensuring that user-set error controls are respected, integrating SZ into the leading I/O libraries, and delivering production-quality software.

The Solution

Specifically, ECP contributions to SZ include (1) optimized SZ compression ratios (improved by up to a factor of 6 compared with the initial algorithm) and accuracy based on end-user needs; (2) improved compression speed (up to 3 orders of magnitude) when using the GPU implementation of SZ, which supports multiple supercomputers with different architectures (e.g., Aurora, Frontier, Summit); (3) refactoring of SZ in C++ to support a composable compression framework and all data types used in ECP applications (the resultant SZ3.0 received an R&D 100 award in 2021); (4) integration of SZ into HDF5 and ADIOS; and (5) improved SZ code robustness and testability to make it production ready.

The Impact

SZ uses a multifaceted approach to integration, targeting integration into science applications and integration into I/O libraries such as HDF5 and ADIOS to make itself available to a broader range of potential clients after the ECP. Direct client integrations of SZ include ECP applications HACC (cosmology), Nyx (cosmology), LAMMPS (molecular dynamics), and LCLS crystallography (x-ray light source). The integration demonstrations span both AD clients and I/O library use cases.

See SZ products and their utilization on the SZ homepage. SZ received an R&D award in 2021.

Sustainability

SZ is also available through the DAV software development kit and through E4S, enabling users to use SZ on an HPC platform.

Technical Discussion

Big data challenges need to be addressed at exascale for applications to achieve their performance and science goals. The VeloC/SZ project addresses two of these big data challenges. First, the ability to run scientific simulations until completion despite disruptions along the way is critical. VeloC provides a highly reliable environment for exascale applications at a minimal cost, enabling them to fully benefit from the extreme data volume and velocity they produce with a low overhead. Second, data reduction is necessary to reduce the size of the data output to the storage system due to bandwidth and storage space limitations. SZ addresses this challenge by enabling application scientists to reduce their scientific data while keeping scientific outcomes intact.

Most large-scale scientific applications use execution state recording techniques to make sure the execution finishes, despite disruptions. If a disruption occurs, the execution state can be restored and the application can be restarted from this state. This technique is known as checkpoint/restart. At exascale, this technique is difficult to implement at low cost for the applications due to an extremely large volume/velocity of data, complex disruption modes, and limited bandwidth to the storage system. Moreover, the diversity and complexity of the storage hierarchy in exascale systems make it very difficult for application developers to implement checkpoint/restart at low cost. VeloC leverages application developer knowledge about state preservation to provide a solution optimizing the performance of checkpoint/ restart while masking the complexity and diversity of the storage hierarchies. An existing application can be adapted for VeloC in minimal time. Once adapted, the application can run in a highly reliable way on pre-exascale and exascale machines.

As data sizes increase with exascale systems and updated scientific instruments, lossy compression of scientific data becomes a necessity. Lossy compression reduces the data by removing non- useful information. Lossy compression for scientific data needs to satisfy three main requirements: it should remove only information that does not impact scientific discovery; compression and decompression need to be very fast to avoid raising a performance issue; and it needs to be effective at providing data reduction much higher than lossless compression. The SZ software provides lossy compression for scientific datasets satisfying these three requirements. To keep information relevant for scientific discovery, SZ users set constraints in terms of compression quality. To control the information loss for each data point, SZ provides point-wise error bound controls that the user supplies. To reach extremely high performance, the SZ software has a parallel implementation that benefits from GPU acceleration. The advanced compression pipelines used in SZ provide very high compression ratios compared with lossless compression, enabling SZ to overcome the big data challenges at the exascale.

For more information and references:

VeloC-SZ can be installed though E4S binaries, containers, or via custom source code builds via SPACK: https://e4s.io.

SZ lossy compression website, which provides numerous links to SZ products and their utilization.