Understanding, Storing, and Curating Scientific Data at Exascale

A conversation with Jim Ahrens of Los Alamos National Laboratory

Jim Ahrens, a senior scientist at Los Alamos National Laboratory, spoke with Exascale Computing Project (ECP) Communications at SC17 in Denver. Ahrens is principal investigator for the Data and Visualization project, which is responsible for the storage and visualization aspects of the ECP and for helping its researchers understand, store, and curate scientific data. This is an edited transcript of our conversation.

Why is the data management and visualization project important to the overall efforts to build a capable exascale ecosystem?

It is important that scientists can extract knowledge from the exascale data they produce. We need to help scientists to be able to both store and understand that data. One particular challenge that we’re facing is the bandwidth and size of the storage systems are not growing at the same pace that the exascale machine can calculate results. Our challenge is to save much less data, yet still be able glean scientific insight.

Why is your research important to advancing scientific discovery, industrial research, and national security?

Our research is important to all three application areas as they are all doing new science and validating it with experimental science as well. So, by having simulation models and understanding their associated experimental scientific results, the researchers are able to get a handle on what the parameter space is of the science they’re doing and to understand what parameters drive their science.

What milestones has your project hit so far?

One particular milestone I’m excited about is automated in-situ analysis. Traditionally scientists would save their simulation data off to storage and then after the simulation has finished, interactively visualize and analyze their data. Under the current constraints of the hardware system, it’s going to be more and more difficult to do that – save entire data sets. We are going to need to do more automatic analysis while the simulation is running. We are thinking about how to select the right parameters for our operators to extract interesting science. We have a project that’s doing topological analysis of the scientific simulations to find “insightful” contours.  At a recent ECP industry council meeting, we showed industry members the difference between selecting ten contours in a linear sequence and the selecting ten topologically interesting contours for the Warp-X accelerator code. It was very clear that the topologically interesting contours represented more “interesting” science than ten contours selected from a linear sequence.

What collaboration or integration activities have you been involved in within the ECP, and what new working relationships have resulted from the ECP collaboration on this research activity?

There are a number of new ECP applications. We meet new scientists doing things like wind projects, accelerator projects, et cetera. There is the whole application side of the project and we need to make sure we are meeting their needs. We also have a number of industry partnerships. We’ve been working with Kitware for years, to open source all visualization and analysis tools that we generate; tools like Paraview and VisIt and new projects like Cinema. The ECP ADIOS project is also working with Kitware. ADIOS is a data management project. We’re working with Intel Inc. on open-source ray tracers, and there’s a lot of interest from the industry community in both using and helping to deliver the suite of exascale tools.

How would you describe the importance of collaboration and integration to the overall ECP effort?

I think it’s key. This Exascale Computing Project is very large, and it’s critical to partner with applications, industry, and facilities. All those players have to be at the table. Software technology has to be there, working together to be able to achieve this difficult but, hopefully, doable project.

Has your research taken advantage of any of the ECP’s allocation of computer time?

To a limited extent. We are starting to build our tools. For example, the idea of the Cinema project is to render “every image” you need for your post-hoc visualization. We want to prototype this approach on some of this allocation time so we can ensure it’s all working out, show scientists our results, and then start incorporating this approach into their codes.

What’s next for the ECP Data Management and Visualization project?

Within the data management part of the portfolio, there are compression algorithm projects, so scientists can compress their data. There are checkpoint restart projects to save simulation state. The checkpoint restart process is accelerated via hardware burst buffers. We need to make sure all of that’s ready to go as well. All these pieces combine to ensure we’re managing data appropriately, understand it, and are getting the data to and from these storage devices in a way that scientists can get their work done at exascale.