By Scott Gibson
A software product called the Scalable Checkpoint/Restart (SCR) Framework 2.0 recently won an R&D 100 Award. Two researchers from the Exascale Computing Project (ECP) involved in the development of SCR, Elsa Gonsiorowski and Kathryn Mohror of Lawrence Livermore National Laboratory (LLNL), are guests on the Let’s Talk Exascale podcast to discuss what SCR is and does, the challenges involved in creating it, and the impact it is expected to have in high-performance computing (HPC). The episode was recorded in Denver at SC19: The International Conference for High Performance Computing, Networking, Storage, and Analysis.
SCR enables HPC simulations to take advantage of hierarchical storage systems, without complex code modifications.
“With SCR, the input/output (I/O) performance of scientific simulations can be improved by orders of magnitude,” said Gonsiorowski, an HPC system software developer and I/O support specialist at Livermore Computing, the HPC facility at LLNL. “The results are produced in significantly less time than they could be with traditional methods.”
SCR was initiated about 2007 when an LLNL code team was trying to run their application in a new system. “Because the system was new, it still had a lot of bugs to shake out and failed frequently,” said Mohror, leader of the Data Analysis Group at the Center for Applied Scientific Computing at LLNL. “The applications failed so frequently the team struggled to write checkpoints to the parallel file system. This problem stalled progress in the development of the application.”
Adam Moody, a colleague of Gonsiorowski and Mohror, had the idea to cache checkpoints on the compute nodes, an approach that is much faster than writing them out to the parallel file system. He implemented his idea in SCR and worked with the code team to integrate it into their code.
“Using SCR, the code team was not only able to finally make progress but was also able to achieve an I/O improvement of 48X,” Mohror said. “Now, years later, we are collaborating with Argonne National Laboratory and NCSA [the National Center for Supercomputing Applications] on SCR development to continue to improve SCR for applications.”
SCR has a long history of development that has led to a number of features that support fault tolerance and performance-portable I/O, particularly in complex storage hierarchies. “We have decided to make these features available to the community by splitting the functional pieces into separate component libraries,” Gonsiorowski said. “We are currently collaborating with the ECP VeloC and UnifyFS teams, who plan to use some of these components. [UnifyFS intends to apply the encoding/decoding pieces as well as a library called AXL that manages asynchronous data transfers.]”
Ready for the Big Time
In 2007 burst buffers didn’t exist, so the only storage available for caching checkpoints was main memory. “The good thing about using main memory is that it is fast, but there a few downsides too,” Mohror said. “One is that if we use memory for checkpoints, less is available to the application. Another downside is that if a compute node fails, the data in memory is lost, which means we had to develop checkpoint protection schemes so that if a compute node went down, we could still recover the checkpoints stored on that node.”
The researchers also had to understand the different failure modes that could happen when applications are running. “These range from the whole compute center going down to failure of a process on a single node, and everywhere in between,” Mohror said. “For example, you can lose multiple nodes if they are connected to the same power supply that fails. Determining how to protect applications from those scenarios was challenging, and a lot of work went into SCR to protect applications and recover from these different failure modes.”
All the efforts have paid off in very tangible ways. “The true testament to SCR’s success comes from our users,” Gonsiorowski said. “We have documented some of their successes over the years, and I believe that contributed strongly to our submission [for the R&D 100 Award]. LLNL staff also helped us spread the word by creating an outstanding and informative video about SCR. Moreover, in the SCR 2.0 release we added a key feature that helps applications improve general I/O performance. With this generalized capability, SCR was ready for the big time with R&D 100.”
Benefits to Users
SCR helps users in two ways: good I/O performance portability and the ability to get the most out of their allocation without having to keep monitoring it.
“For I/O performance portability, application developers just need to integrate SCR, and then their I/O code is portable to different systems without needing to make changes,” Mohror said. “Also, they can get orders of magnitude better performance over the parallel file system. For example, SCR is 300 times faster than the parallel file system on LLNL’s Lassen machine.”
SCR enables codes to keep making progress. “Scientific codes can hang unexpectedly or encounter errors even if the system is running perfectly, so users will typically keep checking on their code—maybe seeing if output is still being generated—to make sure it is still running properly,” Mohror said. “One of our users compared this to being woken up every hour all night by a newborn baby when you need to ensure the results of your application come out as fast as possible. However, SCR can keep tabs on your application, detect if there is a problem, and automatically restart your job so that you don’t lose much of your allocation time.”
As new storage systems with different characteristics and new capabilities are regularly being deployed, one of the best things SCR will do for users of HPC systems is enable them to roll with the rapid pace of change.
“SCR has promised to provide portable I/O performance for our users, so we stay on top of the latest developments and newest systems,” Gonsiorowski said. “Coupled with our move to develop component libraries, we hope that our efforts to improve I/O performance can be leveraged by other ECP projects.”