DataLib

Exascale applications generate massive amounts of data that must be analyzed and stored to achieve their science goals. The speed at which the data can be written to and retrieved from the storage system is a critical factor for achieving these goals. As exascale architectures become more complex with multiple compute nodes and accelerators and heterogenous memory systems, the storage technologies must evolve to support these architectural features. The DataLib project is focused on three distinct and critical aspects of successful storage and I/O technologies for exascale applications: (1) enhancing and enabling traditional I/O libraries on pre-exascale and exascale architectures, (2) supporting a new paradigm of data services specialized for exascale codes, and (3) working closely with Facilities to ensure the successful deployment of their tools.

Project Details

The ability to efficiently store data to the file system is a key requirement for all scientific applications. The DataLib project is providing both standards-based and custom storage and I/O solutions for exascale applications on upcoming platforms. The primary goals of this effort are to enable users of the Hierarchical Data Format 5 (HDF5) standard to achieve the levels of performance seen from custom codes and tools, facilitate the productization and porting of data services and I/O middleware using Mochi technologies, and continue to support application and Facility interactions by using DataLib technologies Darshan, Parallel Network Common Data Form (netCDF), and ROMIO.

HDF5 is the most popular high-level application programming interface (API) for interacting with the storage system on high-performance computers. The DataLib team is undertaking a systematic software development activity to deliver an HDF5 API implementation that achieves the highest possible performance on exascale platforms. By adopting the HDF5 API, the team is able to support the I/O needs of all the exascale applications that already use this standard.

The Mochi software tool is a building block for user-level distributed data services that addresses performance, programmability, and portability. The Mochi framework components are being used by multiple exascale library and application developers, and the team is engaging with them to customize data services for its needs.

Darshan, Parallel netCDF, and ROMIO also continue to be important storage system software components. DataLib is extending Darshan to cover emerging underlying storage, such as the Intel Distributed Asynchronous Object Store (DAOS); enhancing Parallel netCDF to meet Exascale Performance Computing (ECP) application needs; and making fundamental improvements in ROMIO to improve performance and address new requirements from underlying storage technologies, such as UnifyFS.

Principal Investigator(s):

Rob Ross, Argonne National Laboratory

Collaborators:

Argonne National Laboratory, Los Alamos National Laboratory, Northwestern University

Progress to date

  • Developed in-depth instrumentation within Darshan for a range of important HPC I/O libraries: HDF5, PnetCDF, and DAOS
  • Refined PyDarshan analysis framework and developed new log analysis utilities
  • Improved overheads of Darshan runtime library timing and locking methods to limit performance overheads
  • Added extensive regression and unit testing for Darshan
  • Refined Darshan’s detailed tracing functionality and added support for automatic trace triggering (based on observed I/O patterns, file names, MPI ranks, etc.)
  • Continued performance optimization of ROMIO and PnetCDF for current and future platforms.
    • Reduced synchronization in ROMIO to benefit UnifyFS and HDF5 users
    • Improved collective I/O through intra-node aggregation and pipelining
  • Development of I/O abstraction enabling new backends for PnetCDF
  • New PnetCDF backend for use in systems with burst buffers
  • HDF Log VOL brings important new capability to the HDF I/O library.
  • HDF Log VOL passing majority of HDF correctness tests.
  • Log VOL out-performs ADIOS in all E3SM-IO production run cases except for one.
  • Log VOL can seamlessly¬†interoperate with Cache and Asynchronous I/O VOLs, developed by the ExaIO team.
  • E3SM I/O case study prompted HDF5 group to develop the new multi-dataset I/O APIs in version 1.13.3 released on Oct. 31, 2022.

National Nuclear Security Administration logo U.S. Department of Energy Office of Science logo