Principal Investigator: Michael Lang, Los Alamos National Laboratory
This project focuses on containers for HPC parallel applications. It provides technology that eases the deployment of new application and software technology via containerization, a flexible runtime that enables containers to run across a wide variety of HPC platforms, and provides support for deploying containers that support producer-consumer workflows and job coupling.
The BEE project is essential for ECP to maintain a modern, relevant mechanism for releasing and deploying software – in particular as the number of 3rd party libraries and platforms proliferates. At present BEE is the only ECP project that bridges the portability gap that exists for the facility security-specific container solutions (Charliecloud, Shifter, and Singularity). In order to broadly leverage emerging analysis toolkits such as Jupyter Notebooks, R, and Plotly, HPC facilities must deploy container execution environments that allow scientists to bring their own containerized analysis stacks. Additionally, it appears more and more likely that even application teams will prefer to distribute containers with libraries configured and built for applications specifically, rather than the generic, most common denominator configurations that facilities are forced to provide.
LANL Resilience work is essential for ECP because it allows investigation of fault characteristics for large-scale systems and then can subject applications and runtimes to that environment for resilience testing. For example, FleCSALE is a complex hydrodynamics application which can be evaluated using parallel fault injection with P-FSEFI to find areas of the application that are more or less vulnerable to silent data corruption. With this knowledge, fault tolerance techniques can be applied to more vulnerable areas that are critical to getting a high fidelity answer.