Science at Exascale: The Art of Dealing with the Tension Between More and Less Data

By Franck Cappello1 with Bogdan Nicolae1, Sheng Di1, Kathryn Mohror2, Adam Moody2 and Greg Kosinovsky2
1Argonne National Laboratory
2Lawrence Livermore National Laboratory

 

Exascale will undoubtedly mark a historical milestone in scientific computing. However, exascale scientific computing will also be marked by the need to deal with an important tension. On the one side, users need to generate more data, which is necessary not only to increase the accuracy of their results but also to facilitate better explainability and reproducibility of results, which becomes a significant challenge given the increasing convergence of scientific computing and machine/deep learning. On the other side, the immense volumes of data that will be produced or consumed by scientific simulations, AI analytics, and experiments cannot be analyzed, stored, and communicated entirely because of limited resources.

This tension forces users to think about the value of data as it is produced and consumed during its life cycle and to respond to important trade-offs: what data is essential to keep, when it should be captured, and at what precision? The ECP VeloC-SZ project is developing two software technologies to support user needs concerning their scientific data: VeloC and SZ.

VeloC is a modular framework, developed at Argonne National Laboratory in collaboration with Lawrence Livermore National Laboratory, for data state management. It was initially designed for checkpointing and is used in several ECP applications. Through its API, VeloC enables users to select and define multiple datasets forming data states. Based on these declarations, the users can decide when to capture, save, and restore data potentially with multiple versions of the data states as checkpoints during the application runtime. VeloC performs data state management in an efficient and scalable fashion by masking the complexity and heterogeneity of storage hierarchies to the users. It optimizes data states management respecting user requirements to meet the performance and/or resource utilization trade-offs imposed by the heterogeneous storage hierarchy. Thanks to its advanced mechanisms, users can also rely on VeloC to support in situ analysis, ensemble computation, and AI explainability.

SZ, developed at Argonne National Laboratory, is a modular scientific lossy compression framework used by several ECP applications. It optimizes three important user constraints regarding data reduction: speed, ratio, and accuracy. In exascale scenarios, data reduction is done online, in situ. It needs to be fast, and SZ offers multiple implementations, including GPUs that exceed 25GB/s of compression throughput. Depending on applications and required accuracy, scientific data can be reduced with SZ by one or more orders of magnitude. More important, SZ respects user-defined data accuracy constraints, preserving the potential for scientific analysis and discovery.

The needs of data state management and reduction are likely to expand during the exascale era, forcing users and application and software developers to progressively refine data selection and reduction by potentially leveraging AI and considering advanced spatial and temporal features formed by the data.