The EZ Project Focuses on Providing Fast, Effective Exascale Lossy Compression for Scientific Data

Franck Cappello is a senior computer scientist at Argonne National Laboratory and an adjunct research professor at the University of Illinois at Urbana-Champaign. He also is principal investigator for the Exascale Computing Project (ECP) efforts called EZ and VeloC.

EZ is endeavoring to provide a production-quality lossy compressor for scientific data sets, while VeloC is centered on supplying an optimized checkpoint/restart library for applications and workflows. In addition to his work on EZ and VeloC, Cappello is the data reduction lead for ECP’s CODAR Co-Design Center. This interview specifically delves into the EZ project.

The motivation for the EZ project is to fulfill the need for some of the ECP applications to reduce the data sets they are producing. The current simulations or instruments generate too much data—more than can be properly stored, analyzed, or transferred. There are different approaches to solving the problem. One is called lossless compression, a data-reduction technique that doesn’t lose any information or introduce any noise. The drawback with lossless compression, however, is that user-entry floating-point values are very difficult to compress: the best effort reduces data by a factor of two. In contrast, ECP applications seek a data reduction factor of 10, 30, or even more.

Among the lossy compressors in software literature, one of the difficulties is that they cannot compress one- or two-dimensional data sets. The EZ project is targeting that type of data set and data sets of higher dimensions. In some cases, SZ provides better compression for the very large highly dimensional data sets than for the ones with smaller dimensions. For exascale applications, a reduction of at least one order of magnitude is required. The loss of information is acceptable, but the user must be able to set the limits for necessary accuracy. The SZ compressor provides such control to the user.

Video Chat Notes

The motivation for the EZ project [1:24]

Strict accuracy control is needed for lossy compression [3:22]

Compression for more than images [4:20]

Saving on storage footprint or bandwidth? [5:07]

Four uses cases in addition to visualization: (1) reduction of the footprint on the storage system [5:20], (2) the NYX Application and reduction of I/O time [6:10], (3) the NWChem application and lossy checkpointing [6:58], and (4) acceleration of the GAMESS application with lossy compression [8:24]

A multi-algorithm compressor [10:25]

Automatic configuration by analyzing the data set while compressing [11:30]

Deep learning is too slow to be integrated into fast lossy compressors [12:39]

Different uses, different needs [13:46]

Co-design, working with applications developers [14:31]

How the work of the EZ project is benefiting ECP [14:56]

Metrics for lossy compression quality and tools for assessing errors [16:09]

Why compressing floating-point data multiple times would be undesirable [19:49]

The results have been encouraging [21:00]

Testing on Theta and other systems [21:31]

HPC community involvement encouraged via the Scientific Data Reduction Benchmark web site: https://sdrbench.github.io [22:01]