The Flux Software Framework Manages and Schedules Modern Supercomputing Workflows

09/13/21

Exascale Computing Project · Episode 84: The Flux Software Framework Manages and Schedules Modern Supercomputing Workflows

By Scott Gibson

Clockwise from top left: Dong Ahn, Stephen Herbein, Dan Milroy, and Tapasya Patki of LLNL and the Flux project. Credit: LLNL

Hi, in this podcast we explore the efforts of the US Department of Energy’s (DOE’s) Exascale Computing Project (ECP)—from the development challenges and achievements to the ultimate expected impact of exascale computing on society.

This time we delve into a software framework developed at Lawrence Livermore National Laboratory (LLNL), called Flux, which is widely used around the world. Flux enables science and engineering work that couldn’t be done before.

For the discussion, we’re joined by Dong Ahn, Stephen Herbein, Dan Milroy, and Tapasya Patki of LLNL and the Flux team.

Our topics: Flux’s grassroots origin, benefits, importance to science and engineering, and more.

Interview Transcript

Gibson: Flux is a software framework that manages and schedules scientific workflows to make the most of computing and other resources and enables applications to run faster and more efficiently. Will you get us going by sharing why Flux was created and by outlining a bit of its history?

Ahn: Sure. I would speak to how our Flux project started because this can actually tell a lot about the problems that it is good at solving.

In the beginning, Flux was a very grassroots effort. It started when Livermore’s in-house system software experts realized that traditional workload managers in use at Livermore were a bit too brittle for future use. We should be modest, but we were pretty qualified to call it. In fact, Livermore computing has a long history of designing workload managers, such as LCRM [Livermore Computing Resource Management] and Slurm, which have had a worldwide adoption. Some of the founding members of Flux are actually some of those principal engineers behind these solutions.

It goes back to nearly eight to nine years ago. Looking forward to the next decades to come, we realized that two of the strongly emerging trends that had just begun to affect HPC [high-performance computing] workload managers would continue into the decades to come and would negatively affect us to a greater degree.

One trend was what I call the resource challenge. Back then, hardware vendors had begun to incorporate various specialized hardware beyond the CPU, which included things like GPUs and burst buffers. The traditional solutions with homogenous compute node- and compute core-centric assumptions were already being increasingly hard-pressed in coping with the so-called heterogeneous hardware environments. The problem would never become any easier if left to chance.

The other trend was users’ workloads running on HPC systems. They were also quickly changing from single MPI [Message Passing Interface] jobs to rather complex interdependent jobs and tasks that needed to be managed all together in a highly sophisticated fashion. This pattern was created as researchers increasingly embraced much more complex scientific workflows.

So, indicatively, we already saw back then so many domain-specific workflow managers with a lot of redundant business logic implemented across them to support those emerging workflows. And this produced a rather unwieldy software ecosystem workflow. One system cannot interoperate with another. A large part of this problem was that the system’s workload managers were rather simple and didn’t provide the necessary capabilities.

These resource and workflow challenges really called for a fundamental rethinking in software design. The amount of effort to bend an existing solution to meet these emerging requirements would be much greater than that of a new software design and rewrite.

By all means, we all knew that our goal was a giant undertaking. But our vision for next-gen workload manager software was compelling enough and well received by our stakeholders within the DOE [Advanced Simulation and Computing] program and then later ECP. And born out of that was Flux.

Gibson: Who can gain the most from using Flux?

Ahn: My answer would be both computing facilities that operate large HPC systems and users who run their workflows on them. But if you ask who is gaining the most from using Flux today, it is a slightly different story. So let me begin there.

Anyone who understands the amount of effort needed to replace a system’s workload manager would agree that it takes a multiyear, phased approach to deploy a new system’s solution—like replacing Slurm with Flux for all the systems in a large center. However, Flux has just two modes of operation, one of which is called single-user mode, and it turned out this drastically lowers the barrier of entry for end users to use Flux and gain from it before the center adopts its multiuser mode as its system’s workload manager. You can think of it as an overlay workload manager on top of the system-level workload manager, like Slurm, IBM LSF—you name it.

This is analogous to the concept of an overlay network, which is also called a software-defined network in computing. Just like they introduce an additional overlay of network on top of a native network to provide networking flexibility and manageability, flux allows users to do this on top of the native system’s workload manager. In fact, we matured that mode of operation first and have allowed those workflow users to do the science that wasn’t possible before using that.

For example, a cancer research workflow that had to couple applications at different scales on pre-exascale systems, made a pretty powerful testimony that they couldn’t do this without Flux on Livermore’s Sierra and Oak Ridge’s Summit. They used Flux’s single-user mode.

Another powerful recent demonstration within the ECP community was how the Exascale Additive Manufacturing [ExaAM] team was able to demonstrate a 4× job throughput performance improvement by incorporating Flux into a portion of their workflow called ExaConstit. Flux allowed them to bundle their small jobs into a large IBM LSF resource allocation, and they were able to run them all together with only about five lines of changes in their script. And they also used Flux’s single-user mode.

In these use cases, one thing we shouldn’t forget is, though, that when users use Flux like this, all of the interactions that the user’s workflows will have to do are completely isolated within Flux—that is, without having to affect the system-level workload manager’s shared resources. If massive job records coming from such a workflow every second hit a center-wide shared Slurm job database, this can lead to a significant performance degradation of the workflow itself as well as of all the jobs running on the entire center.

In fact, the UIUC [University of Illinois at Urbana-Champaign] PSAAP [Predictive Science Academic Alliance Program] center’s Flux use case of two years actually provided a good insight about this. A user in that program used Flux for their optimization workflow and immediately gained a 2× job throughput improvement on a Livermore system. But it also turned out that once this was done, their workflow drastically reduced the load to our Slurm centralized database too.

The large computing facilities that will ultimately adopt Flux’s multiuser mode as the system’s workload manager will routinely gain benefits like this from Flux’s divide-and-conquer approach at the center level. But also, don’t forget our traditional products are expected to increasingly give up on properly managing and scheduling extremely heterogenous resources like GPUs, multitiered storage, AI [artificial intelligence] accelerators, quantum computing of various configurations, and cloud computing resources. As next-gen facilities will increasingly embrace this so-called extreme heterogeneity, their gain from Flux will only become greater as well.

Gibson: What do you think is the broad importance of Flux—that is, it’s impact on science and engineering?

Ahn: A supercomputer is a tool. It is like a modern-day telescope and microscope. It serves as a scientific instrument to advance modern science and engineering. And it is an unprecedented tool. It provides insight into physical processes across huge spatial scales, from the smallest to the largest objects in the universe and generates a model to predict the future for the scientists and engineers.

And it is becoming a new pillar for various science and engineering disciplines as well. As a result, the scientific applications that use supercomputers are growing each and every year. Today, these capabilities range from solving physics calculations, to making better drugs for diseases like cancer and COVID-19, to making better use of additive manufacturing, while at the same time incorporating new data science approaches like machine learning and AI. I already mentioned a few of those examples.

Generally speaking, as supercomputers are used for a wider array of problems, researchers’ workloads are going through fundamental changes. Until just a few years ago, many users’ workloads remained quite simple. Users submitted scripts containing simple sets of instructions for running a single simulation or application. The scripts then worked with an HPC workload manager, such as Flux or Slurm, which assigned the computing resources to complete the jobs.

Nowadays, many science and engineering disciplines require far more applications than ever before in their scientific workflows. In many cases, a single job needs to run multiple simulation applications at different scales along with data analysis, in situ visualization, machine learning, and AI.

Users found again and again that the existing solutions significantly limit what they can do to do their science. Flux provides innovative solutions for them to do their science much more easily and without having to worry about the workload manager itself slowing things down.

I’m sure we’ll talk more specifics as to how Flux enables science and engineering that couldn’t be done before or could be done but, in a way, that was too expensive and not sustainable.

Gibson: Thanks, Dong. Stephen, please share some specifics as to how Flux helps users.

Herbein: There are three big advantages of Flux to users that we like to highlight. The first is higher job throughput, which means being able to push more work through the system quicker. Higher throughput enables users to run larger ensembles of jobs, which ultimately means more simulation data points and better scientific insight. And it’s not hard to achieve these speedups under Flux. As one data point, the ECP ExaAM team saw a 4× throughput improvement under Flux with only five lines of code changed.

The second advantage of Flux is better specialization of the scheduler. Traditionally, users have to make do with whatever scheduler comes with the system that they’re running on, and this includes that scheduler’s specific configuration. With Flux, users can spin up their own personal Flux instance on a system, meaning they can tweak and tune the Flux scheduler to their heart’s content. We’ve seen several large workflows take advantage of this customizability to eek out every last ounce of performance from the system.

Finally, the third major benefit of Flux to users is its portability to different computing environments. As I mentioned, it’s really easy for users to spin up their own personal Flux instance on a system. So, workflows that need to run at multiple sites no longer need to code to each site-specific scheduler. They can instead code to Flux and rely on Flux to handle the nuances of each individual site. We saw this come into play when COVID-19 hit. The National Virtual Biotechnology Laboratory effort, a consortium of several DOE labs [LLNL, Los Alamos National Laboratory, and the National Energy Research Scientific Computing Center (NERSC)], was using simulations to model the spread of COVID-19. And they needed to use whatever computing cycles they could get from wherever they could get them, but they didn’t have the time to target the different schedulers at each site. So, instead, they programmed their workflow against Flux and then let Flux manage the intricacies of running jobs at each individual computing center.

Gibson: Thanks, Stephen. Now let’s discuss a couple of Flux use cases. I understand you have examples from computational biology—for example, the RAS [from “Rat sarcoma virus”] protein, which is responsible for a third of cancers; you mentioned COVID-19; and there’s the COVID-19 small molecule anti-viral project. What are the highlights from some of the examples?

Ahn: The RAS protein simulation workflow for cancer research is one that I typically use as a real-world case that illustrates how complex users’ workflows are becoming and how quickly the capabilities of even the best-in-class workload managers are being outpaced by that. So let me begin there.

It is one of the successful workflows that ran on Sierra during its early science phase. It is from the MuMMI [Multiscale Machine-Learned Modeling Infrastructure] project, in which life science and biomedicine researchers are developing a new capability to study the RAS protein because RAS when it’s mutated is implicated in one third of human cancers.

Ideally, if they can simulate these RAS proteins at the micromolecular-dynamics level, then they’ll get what they need.

But the space and time requirements of MD [molecular dynamics] simulations are such that you cannot really do that at biologically meaningful scales even on pre-exascale machines or even the exascale machines to come. So, they use a workflow approach as a solution, combining simulations at different scales.

They use a continuum-based application to simulate these genes at the membrane level. And they ran an MD application [ddcMD multi-physics particle dynamics code] to simulate the most interesting regions on the membrane at a fine-grained MD level. Then, they used a machine-learned model in the middle to determine which regions are most interesting to do the fine-grained simulation on.

This is a highly innovative approach, but when they met Sierra for the first time, it wasn’t pretty at all. In particular, scheduling all these different simulations and other components, each with vastly different time and resource requirements, was particularly challenging for the system’s workload managers, such as IBM LSF, to handle. So, they came to Flux to solve the problem.

Here is the gist of their solution.

Their domain workflow manager of choice was Maestro, and they developed a Flux adapter for it to manage the resources using Flux as a back end. This solution allowed them to chop up the resources allocated by IBM LSF and to run heterogenous components across different sets of resources, such as running macro simulation at about 500 nodes using a subset of the resources—like 20 cores per node—and using every GPU to run MD simulations.

To do this, they specialize the scheduling policies of Flux—using its flexibility—in many different ways to get the performance they need and the resource match behavior they need.

In fact, if you crack open a compute node that is running the MuMMI workflow, you will quickly find at least five distinct tasks are running with a pretty sophisticated layout. Like 10 cores per socket for a macro model and one GPU per MD simulation along with a core closest to the GPU—very complicated.

Overall, doing this without Flux was a very difficult thing to do for the team. Flux was essential for them to be successful—and they succeeded. As one indication of their success, they won the best paper award at Supercomputing 19 [SC19] for the innovative scale-bridging computational workflow approach.

In terms of fast, machine learning–based COVID-19 antiviral drug molecule design workflow, I should say this is part of a large multidisciplinary effort at Livermore to combat the COVID-19 pandemic.

In May 2020, a [multidisciplinary] team was formed at Livermore with a goal to develop a new highly scalable, end-to-end drug design workflow that can expediently produce potential COVID-19 drug molecules for further clinical testing. This team brought together multiple scientific experts like the LBANN [Livermore Big Artificial Neural Network] scalable AI toolkit [team], ATOM’s [the Accelerating Therapeutics for Opportunities in Medicine consortium’s] generative molecular design [GMD] pipeline, and the ConveyorLC docking simulation.

After a few meetings, though, this team quickly discovered that creating such an end-to-end solution based on the existing components could present very difficult and challenging problems—workflow problems especially—and, thankfully, they found that Flux could comprehensively solve them.

In the interest of time, I’ll skip the technical details, but I’ll say that Flux’s fully hierarchical scheme was nearly ideal to overcome the scalability limitations of two of the major components that this workflow needed to couple, namely the ConveyorLC docking simulation application and GMD pipeline. And Flux’s fully hierarchical approach allowed the GMD to be able to manage a large ensemble of small docking simulation jobs as one unit in such a way that both the docking jobs and the GMD work well within their scalability limitation envelops.

Based on the Flux-enabled, end-to-end workflow blueprint, this team submitted the first paper focusing on large machine learning training to the COVID-19 special category of the [Association for Computing Machinery] Gordon Bell Prize at last year’s Supercomputing [SC20], and they were one of the four finalists. This team plans to submit the workflow-based results to the same award for Supercomputing this year [SC21] as well.

Other than that, I already mentioned the ECP ExaAM use case and the UIUC PSAAP center use case. We have a lot more, but we are at a point where we can’t really track all of the use cases. So, at this point, let me stop there. But I’ll say that one of the common threads from these uses cases is that Flux makes workflow jobs stand on equal footing with the traditional, large, long-running MPI jobs on large HPC systems enabling modern science.

Gibson: Stephen, in straightforward, layperson language, will you describe for us how Flux works? And if you don’t mind, please help us understand the terminology along the way—for example, divide and conquer in the context of Flux.

Herbein: So, have you ever been under a tight deadline and thought, “Man, I wish I could clone myself to get this work done quicker?” Well, that’s essentially what Flux can do. It can clone itself and then use that divide-and-conquer approach that you mentioned to get through the work much quicker, where each Flux instance is taking a different chunk of work and resources to schedule. With enough clones, though, things can quickly get messy when organizing and managing who’s working on what, so Flux enforces a strict hierarchy in which higher-level Flux instances help manage and coordinate the lower level ones, something like what you might see in a business structure. These hierarchies exist because each Flux instance is empowered to create its own nested child instance that manages a subset of the jobs and resources of the parent, and each one of these children can be customized and tuned for the work it’s going to handle. So, in a business setting, you imagine every manager is empowered to hire their own workers that they can then delegate tasks to.

This nesting capability comes in handy for more than just divvying up work. It’s also the reason why Flux is so portable across systems. When a child Flux instance is spinning up underneath a parent Flux instance, it has to have a way to learn about the resources that it inherited from its parent. There’s already a well-known standard for this. It’s called the Process Management Interface [PMI], which is the same interface that the very popular MPI runtime uses when it spins up.

So, child Flux instances know how to speak PMI to get information from their parents. This is great because every HPC scheduler out there also speaks PMI, which means that Flux can easily talk to every HPC scheduler to discover what resources are available for scheduling jobs. This standard use of interfaces is why Flux can be so portable across so many systems.

So, in summary, Flux’s ability to nest inside itself and create this hierarchy gives you sort of a big bang for your buck because you get better throughput because you can divvy up the work across many Flux instances, and you can specialize each one of those, and then it also makes it easy to spin up Flux underneath other schedulers.

Gibson: Thanks. Dan, A key feature of Flux is its ability to model system resources as a directed graph. Please explain the components and dynamics involved.

Milroy: Basically, a directed graph is a mathematical object with a corresponding data structure representation in computer science that associates objects, also known as vertices, through directed relationships, which are considered the edges. For example, a social media network is a directed graph: users are vertices, and communications between two users are edges where the direction itself can be defined by the user that initiates contact.

In the case of Flux, a vertex can be a hardware resource—such as a CPU or compute node—and an edge can indicate containment; in other words, for example, a server contains a CPU inside of it. Relying on a directed graph to represent resources allows Flux to manage and schedule larger, more heterogeneous and dynamic systems where resources can be disused, fail, or change price. Managing and scheduling complex combinations of resources that change over time requires elevating resource relationships to parity with resources themselves, which favors a representation that encodes this sort of information in both vertices and edges.

Now, using a directed graph as a foundation for resource representation provides Flux with several key capabilities. The first one is flexibility. Any type of resource—such as hardware, software, power distribution units—can be a vertex. Fully hierarchical scheduling, as Stephen was mentioning, assumes—actually—a really elegant form when based on a directed graph model. Each Flux instance manages and schedules a subgraph, which is a subset of the vertices and edges of the resource graph, where a child instance’s resources are a subgraph of its parent. Beyond that, a tremendous number of algorithmic techniques and optimized software libraries exist for fast operations on directed graphs. By basing its resource model on a directed graph, Flux can really quickly check resource states and schedule allocations, transform resource representations, add and remove dynamic resources, and all sorts of other useful operations.

Scheduling the operations themselves in the context of directed graphs requires really basic procedures. To request an allocation, a user specifies their needs, and then Flux actually transforms those needs into a directed graph, which it uses as a template to find matching resources in the system resource model. Finding the resources amounts to checking resource vertices for availability, which Flux performs with its highly optimized implementation of depth-first search.

Flux enhances the directed graph with additional data structures that enable sophisticated search. In Flux’s directed graph model, each resource vertex embeds something called a planner, which is an efficient data structure that allows for fast lookup for the next resource availability. Instead of performing a depth-first search of the entire resource graph, Flux actually uses a filtration mechanism that bypasses portions or subgraphs of the resource graph that aren’t available. High-level vertices, for example, containing pruning filters can track the counts of available lower-level resources. And if the depth-first search encounters [insufficient] low-level counts or resources at the pruning filter, the entire subgraph can just be skipped. Once Flux selects a matching resource subgraph, it updates those pruning filters to reflect the new job allocation. Updating the filters immediately accelerates further searches, which reduces resource allocation time and improves the scheduler throughput.

Future computing is going to need more flexibility and dynamism, and the ability to change the graph model in any way at any time is another one of Flux’s primary features. Adding or removing resources is a straightforward matter of well-known graph editing techniques for just basically inserting or adding or deleting subgraphs. And unlike existing schedulers and resource managers, Flux allows for dynamic transformation of its resource model without manual reconfiguration and restart of the scheduler, which further facilitates automated changes in resource relationships and addition or removal of resources.

Gibson: Tapasya, will you speak to the flexibility and open-source aspects of Flux?

Patki: Sure, Scott. Flux can easily run on a single user’s laptop or at an entire supercomputing center, and anything in between. So, that addresses the flexibility question that you just brought up. Users can kick off a Flux-managed workflow and monitor its progress with just a few commands. And Flux is flexible enough to be integrated with system schedulers and workflow managers if that’s what a user needs.

Broad applicability is only meaningful if other users and facilities can benefit from Flux’s capabilities, and that is why we’ve made Flux. Users outside of our lab can take advantage of its fully hierarchical resource management as well as its graph modeling capabilities in their own scientific computing workflows. Flux is already making a big impact across the world with its flexibility and capabilities.

To support the open-source community, our development team regularly gives tutorials, presentations, and workshops to help users get started with Flux and to answer their questions. Our documentation website provides clear instructions and examples along with several presentations and related publications. We also engage with users and announce Flux releases on GitHub as well as Twitter.

Gibson: Thanks, Tapasya. Dan, what is your perspective on the degree to which users are adopting Flux, including evidence you’ve seen of worldwide use?

Milroy: As Dong mentioned when he was talking about the single-user mode of Flux, that [mode] has actually really facilitated adoption of Flux, and, as a result, we’re seeing it being adopted at a worldwide scale. Flux is being used at over 50 institutions worldwide, including our collaborators in both United States and European academic institutions, US national labs, the US military and federal agencies, and domestic and international scientific computing and HPC centers, such as NERSC in California and Riken in Japan, which is home to the currently top-ranked Fugaku supercomputer.

Now, Dong also mentioned the center-level adoption. The center-level adoption for multiuser mode is going to require more time and effort, and we’re doing kind of last-mile development for this to extend our success. And as Tapasya was mentioning, the Flux team has been working really hard to provide a lot of documentation, whole tutorials, and engage with developers early on to help them design their workflows on top of Flux. So, this is all kind of contributing to the adoption as well.

We’re currently testing Flux on a test bed machine and plan to deploy it on a CTS-2 system at the laboratory early next year. [We are] also [testing] on the upcoming exascale system, El Capitan, for which we’re actually the plan-of-record scheduler, and our success is really expected to be replicated there.

Gibson: OK. Thanks, Dan. Now to Stephen. What’s in store in the near future for Flux development?

Herbein: We’ve really been focusing our development around those two system milestones that Dan just mentioned. In particular, deploying on our commodity Linux clusters, which we call CTS-2, and the upcoming exascale system, El Capitan.

For the CTS-2 systems, we’re focusing on improving the resiliency of Flux to node failures, particularly nodes that are intermediates in our tree-based overlay network. In addition, we’re adding features to the privileged component of Flux—that is, the component of Flux that uses root permissions. It’s the most vulnerable to exploitation, so we want to make sure that it’s very secure and that we have it right. So, we’re adding features for feature parity with other HPC schedulers.

And for El Capitan, we’re focusing on interfacing with HPE [Hewlett Packard Enterprise]—the vendor providing that system—and on integrating with their system software. That includes their interface for bootstrapping, HPE Cray MPI, their common tools interface for launching debuggers and tools, and then their near node I/O subsystem known as Rabbit.

This Rabbit I/O system actually presents some interesting resource challenges that Flux’s graph scheduler that Dan was describing earlier is uniquely qualified to handle. In particular, there are SSDs [solid-state drives] at the top of each rack in El Capitan, and those SSDs will be configurable at schedule time to either be accessible to the nodes within the same rack via PCI express, and so that means those nodes get really fast, very low-latency connections, or the SSDs could be configured to be accessible to nodes anywhere in the cluster via the network. But it can’t have both accessibility modes at the same time. And so, this ability to dynamically switch the access pattern and the associated locality so nodes must be in the same rack as the SSDs that they’re trying to access via PCI express—that requires extra intelligence and constraint satisfaction by the scheduler. And Flux’s graph scheduler is uniquely capable of doing that.

Gibson: And finally, how is Flux getting set for the post-exascale era?

Ahn: We are preparing for even the post-exascale era, and one of the key features of that era will be a convergence of [HPC] with cloud computing. So, I’d like to have Dan cover some of those aspects, because he’s the subject-matter expert on that topic.

Milroy: Thanks, Dong. HPC is really seeing an increasing demand for cloud technologies at this point, and cloud is one of, or really the dominant market player, in computing now. And we’re expecting to see greater integration of its research and development in HPC in the future. We’re seeing a future of what’s called converge computing, which we consider to be an environment that combines the best features of HPC, such as performance and efficiency and sophisticated scheduling, with those of the cloud, which are resiliency, elasticity, portability, and manageability.

Now, cloud bursting might be the most well-known type of converge computing, and we’ve done quite a bit of research on how Flux can actually facilitate cloud bursting in a very flexible way by combining the fully hierarchical scheduling with its directed graph resource model.

As we’ve kind of been talking about a little bit, we have a parent and child relationship with nested instances where if a child requires more instances, it can ask its parent for more. And, in this case, if the parent doesn’t have more resources, it becomes the child in this hierarchy and can ask its parent whether it has more resources—and kind of so on up this hierarchy.

Cloud can be considered kind of just another parent in this relationship, and if the top-level parent doesn’t have satisfactory resources, it can then ask the cloud. So, those resources can be delivered by the cloud and then [be] integrated at each level all the way down the hierarchy until it gets to the requesting leaf. That’s one of the things that we’ve put a lot of work into researching from the scheduling side, at least.

We’re noticing other trends as well in the converged workflows, at the laboratory and at other major institutions and within industry as well. As applications are starting to demand tighter integration with cloud orchestrations, such as Kubernetes, a major challenge that we’re seeing is actually scheduling and managing resources that kind of use both HPC resource managers and orchestrators at the same time, and we’ve forged a collaboration with IBM T. J. Watson and Red Hat OpenShift to tackle a lot of these research-related challenges with scheduling and managing converge computing resources.

Gibson: Thanks to each of you for being here.

Related Links