Addressing Fault Tolerance and Providing Data Reduction at Exascale

02/08/18

A conversation with Franck Cappello of Argonne National Laboratory

Franck Cappello, Argonne National Laboratory Franck Cappello of Argonne National Laboratory leads two ECP projects, VeloC and EZ. He discussed them with ECP Communications at SC17. This is an edited transcript.

Please describe your projects.

I’m leading two ECP software projects. The first one is called VeloC (Very Low Overhead transparent multilevel Checkpoint/restart), and it’s related to fault tolerance. The second one is EZ (Fast, effective, parallel error-bounded exascale lossy compression for scientific data) and involves data reduction. The VeloC project focuses on an important aspect of scientific applications running on extreme-scale systems. We need to anticipate that exascale systems will likely be prone to failures just based on the number of components that they will use. VeloC will provide ECP applications a fault tolerance environment that is effective, transparent, and efficient. The idea is to bring scientific simulations to completion without significant change in the application codes, and it will dramatically reduce the fault tolerance overhead perceived by the applications.

For the second project, EZ, the idea is to provide data reduction. This is already critical for many ECP applications in today’s environment—but even more so at exascale. Getting a high compression ratio is needed to address the increasing gap between the memory size and the storage system bandwidth and size in exascale systems. The EZ project is developing the SZ lossy compressor that targets a completion ratio of 10 or more, which means that what we want to do is to reduce the data at that size by 90 percent or more. And the problem here is that we need to strictly respect error bounds set by users. Just think about JPG or MP3 for scientific data with very strong accuracy requirements.

Why is your research important to the overall effort to build a capable exascale ecosystem?

Considering the fault tolerance and the VeloC project, we have to understand that failure happens in supercomputers, and if we do not protect the execution of the application, many of the executions will not complete. And the users will complain because they will not get their results. So, fault tolerance is needed, and it’s already used, in fact, in current environments. But the complexity of the exascale systems with their deeper and diverse memory hierarchies and storage hierarchies makes it difficult for application developers to develop optimized fault tolerance mechanisms for their applications. The VeloC environment will provide a checkpoint/restart solution to this problem with minimal code modification. VeloC will give application developers a rich API that takes advantage of multiple levels of storage in the storage hierarchy, thus maximizing application performance and reliability. And also, the code developers need to adapt their codes only once. This ability is important because there are different exascale systems, so this is an important gain in productivity.

For the EZ project, the problem that we are addressing is really data reduction. At extreme scale, scientific applications and experiments are already generating more data than can be stored, transmitted, or analyzed. So the data sets need to be reduced significantly. Unfortunately, if we look at lossless data reduction techniques, the compression that they can provide is not enough, so we need to look at lossy* data reduction techniques, and this is really needed and critical. But the difficulty when talking about lossy data reduction and lossy compression is the notion of accuracy and the notion of respect of the accuracy that the users wants for their data. Lossy compressors, like SZ, will produce an error, but this error needs to be controlled by the user and understood by the user, and those are the really important points.

I’m not just talking about the amount of the error. I’m also talking about the nature of the error such as its distribution, the spectral alteration, the autocorrelation alteration, or the derivatives distortion. That’s a much more complicated story than just using existing compressors that we have for images, movies, or music for scientific data. It’s much more complicated than that.

Why are your research areas important to advancing scientific discovery, industrial research, and national security?

We build exascale systems to run larger simulations at higher resolution or to analyze larger or more complex data sets. This is needed to make progress in science, industry, and national security. That’s clear. And these domains depend on our capability to run the simulation and analytics workloads to completion. If we cannot complete the execution, we don’t get the result, and we don’t benefit the users. It’s important to bring executions to completion, and that’s what the VeloC software will guarantee.

Concerning the SZ compressor in the EZ project, this increase in resolution and size of these simulations and experiments has a direct impact on the volume of the data that is produced. The current lossy compressors for scientific data are effective only on the limited class of applications or data sets. And so for many applications and ECP applications, we need to find a new solution, new techniques to effectively compress these data sets. The objective of the EZ project is to improve SZ and develop new compression techniques for that.

Your projects have been going for about a year now. What milestones have you reached?

Concerning the fault tolerance environment, one important milestone was the completion of the programming interface design. We have discussed with many ECP application teams, and we developed a prototype environment to check that the programming interface, in fact, was covering their different needs and was running correctly. So that was the first important milestone in our project, making clear that we have a correct application interface or programming interface.

Our next milestone will be the development of what we call the asynchronous back-end, which is a separate piece of software that goes with the interface that I have described. And it should handle all checkpoint movements in the storage hierarchy, so it needs to be efficient and effective in protecting the application execution. That is for the VeloC project.

For the EZ project, the important milestone was the release of the latest version of the SZ compressor last September, and it covers more ECP applications than before. And we are getting good results in terms of compression ratios for these applications and also the respect of the error controls for these applications in different domains, so we are covering cosmology, climate, quantum chemistry, and instruments data. For each of these applications, we study the characteristics of the data set and optimize the compression for them, and the optimization covers identifying the right compression parameters; but sometimes for some of the applications, we need to develop new compression techniques, and so that’s what we do for the next set of applications that we are dealing with.

What new collaboration or integration activities have you been involved with, and what new working relationships have resulted from your ECP research?

We have collaborations with several ECP applications, as I said, in cosmology, molecular dynamics simulation, computational chemistry, and x-ray laser imaging. We also are working closely with the software technology projects, developing I/O libraries such as ADIOS, HDF5, and PCDF. And we also work with the vendors to optimize the code of VeloC and SZ for their systems. So really, it’s a large set of collaborations that we have, and ECP’s unique in providing this potential of collaboration with the applications, software systems, the facilities, and the vendors.

How important are ECP collaboration and integration activities to the overall success of the project?

I think that the ECP collaboration and integration efforts are critical. It’s important to work closely with the application code developers and with the users to understand their needs and their concerns. Similarly, collaboration with software projects is critical to get the software stack consistent. We need to get right the integration of the different software with the applications. Of course, ultimately the software should run on the existing system, so continuous collaborations with the vendors and facilities are important.

Has your research taken advantage of any of the ECP’s allocation of computer time?

Yes. We need to run a lot of tests on the current systems that we can access, so Titan, Mira, and Theta are used to test all the software that we are designing and also to test how the applications work with our systems.

What’s next with this research activity?

The next year is important for us. Concerning the fault tolerance project, VeloC, our main effort will be to deliver the first release of the Veloc software and the integration of more ECP applications with the software. So the goal is to demonstrate gains in performance and reliability on current systems for these ECP applications. And for EZ, which is the data reduction project, our objective is, in the next 12 months, to improve the compression performance. So the first thing we’ll do is to provide the parallel version of SZ, and this will use OpenMP. And the other objective is to leverage the time dimension for compression. The current version of our compressor is only compressing snapshots individually, and so the next version will exploit temporal correlations.

* per Wikipedia: lossy compression or irreversible compression is the class of data encoding methods that uses inexact approximations and partial data discarding to represent the content. These techniques are used to reduce data size for storage, handling, and transmitting content.

Topics: VeloC EZ Franck Cappello