Revisiting the CANcer Distributed Learning Environment (CANDLE) Project
By Scott Gibson
This is episode 103 of the Let’s Talk Exascale podcast.
[Scott] Hi. In this episode, we focus on a subproject of the US Department of Energy’s Exascale Computing Project called the CANcer Distributed Learning Environment, or CANDLE. It is an open-source collaboratively developed software platform that provides deep-learning methodologies for accelerating cancer research. My guests will unpack what all of that means. And I will also note here that I’ll provide links to previous interviews we’ve done with members of the CANDLE team in case you want to take in that content as well.
Our guests from the CANDLE team this time are John Gounley and Heidi Hanson from Oak Ridge National Laboratory.
John is a computational scientist in the Biostatistics and Multiscale Systems Group within the Computational Sciences and Engineering Division, or CSED, at ORNL. His research focuses on scalability of algorithms for biomedical simulations and data. And Heidi is the Lead of Biostatistics and Multiscale Systems Modeling Group within CSED at ORNL. Additionally, she is the technical lead of Modeling Outcomes using Surveillance data and Scalable Artificial Intelligence for Cancer, also known as the MOSSAIC project. MOSSAIC is a sub-project of the Joint Design of Advanced Computing Solutions for Cancer and a partner with CANDLE.
Concerning the CANDLE team, Argonne National Laboratory spearheads the effort, and principal investigator is Rick Stevens of Argonne.
John, will you tell us about the team roster and the various forms of expertise the members bring to the work?
[John] Sure. So CANDLE’s a pretty large and distributed project. In addition to the group you mentioned at Argonne, and then our group here at Oak Ridge, there are also groups at Los Alamos, Lawrence Livermore, Pacific Northwest, and Brookhaven national labs. And there’s varying expertise and focus from those groups.
The Lawrence Livermore group has focused a lot on the computational side of things. Los Alamos has been a little more focused on the uncertainty quantification. For Brookhaven, a little more focused on workflows, etcetera. So we’ve distributed not only some of the work but the particular expertise as well.
[Scott] All right, what is the elevator speech for CANDLE, John?
[John] So for the DOE–NCI [National Cancer Institute] partnership that the CANDLE and a lot of the application projects grew out of, we really had a set of problems that NCI had come to us where they were looking for a solution that came from deep learning and from high-performance computing. And then that was sort of really falling into what ECP wanted to focus on.
And what the idea of CANDLE was for a lot of those different problems, we were going to end up needing to build very similar tools and workflows to do that. So the goal of CANDLE is really to build one unified thing so that we’re not all rebuilding and kind of reinventing the wheel. But we’re doing one thing in an integrated fashion and kind of learning from what each other is doing.
[Scott] Let’s go back to the inception of CANDLE and set the scene for the rationale behind its creation. What was the vision for CANDLE: the problems to be solved, the proposed strategies, and desired outcomes?
[John] Sure. So NCI came to us with kind of a set of three problems: one looking at cancer cell responses to drugs, one looking at cancer biology, and in particular, proteins and protein mutations within cancer cells. And then finally, looking at cancer at the population level. And each of these was really a deep learning question that we were looking to solve. I think Heidi can speak more to the population health one that we focused on.
[Heidi] Yeah, so the MOSSAIC project focuses on the National Cancer Institute’s surveillance program, which is the Surveillance, Epidemiology, and End Results, or SEER Program. What that program does is collect data on approximately 50% of the US population. And it’s cancer-specific data so that we can track the trends in cancer incidence and mortality across the nation.
A lot of these different types of these service surveillance programs generally operate in the background of what happens on a daily basis here in the US, but they become very, very important for identifying when there is something that we should be concerned about on a health or population health basis.
So, for example, COVID should bring to mind why we would national upper-level surveillance and why that’s important to national security. It’s really human health security affects economics, and SEER is something that brings that to us in the cancer space. They’re collecting over 800,000 reports on cancer in a year. And that’s a lot of data. And so, their normal processes are to manually code all of this information so that they can create a nice common structure that people can analyze.
And DOE and the CANDLE project came to the room and said, ‘Hey, let us make a deep-learning model that can help automate some of that work and speed it up.’ Currently or when the registry is manually coded, it takes about 27 months to go from report to cancer statistic that I can kind of identify what’s happening within the population with respect to cancer. That automated process speeds it up to a time that’s a matter of months.
So going back to COVID, one example of how that’s used in what happened with COVID is cancer screening rates went down because people couldn’t go to the doctor, and SEER or NCI wanted to identify, ‘How do we know …?’ Excuse me. ‘Are people getting diagnosed at later stages?’ So basically they were able to use the model that had been developed to answer that question, whereas 3 or 4 years ago it would have taken them 2 years to answer that question.
So that’s kind of what the NCI part of the project does and why it’s important to have the CANDLE and ECP program.
[Scott] How would you summarize the CANDLE journey? That is to say, the problems solved, lessons, learned, and accomplishments made.
[John] I think there’s probably sort of three categories that those things would fall into.
For a lot of ECP projects, the big challenge was getting the compute on the GPUs and really accelerating things that way. And for us, it was a little different in that a lot of deep-learning frameworks when we started were pretty much already running on GPUs. However, that didn’t take away the challenge of needing to optimize stuff and get this done in a high-performance fashion. So over the course of the project, we’ve accelerated our core workflows by 1 to 200 times in terms of where we started on OLCF [Oak Ridge Leadership Computing Facility] Titan and where we are right now on OLCF Frontier.
The second thing that I think was a challenge for the project was a lot of the infrastructure and workflows that would underly solving the questions that Heidi talked about. To the extent that it existed, when we started the project, it was really optimized for a cloud environment or like a local workstation. It wasn’t something you could run on, on a big, you know, Department of Energy supercomputer. So a lot of that really needed to be built by the project. And we’ve been using those workflows for the past couple of years.
The final challenge, and maybe the most interesting, was that, relative to a lot of science spaces, deep learning, and even cancer research for that matter, it evolved pretty quickly over the course of this project.
So we’ve needed to be agile in terms of kind of following the science and then having our development and compute work follow those new application functions as they arrived. So the big example of that is, over the last couple of years, large language models have become very important. And when we started seeing that motivation coming from the application projects like MOSSAIC, then CANDLE has pivoted to make sure that that’s part of our focus as well. And that’s also brought us sort of a new space.
From the computing side, in that when we started the project, we didn’t really have deep-learning models that would fill an entire supercomputer. And so that wasn’t really something that we were optimizing for. But as kind of the science shifted and that became a thing that was important, we have really focused our efforts in that direction.
[Scott] Heidi, will you explain at a basic level the deep learning and neural network computing aspects of the project?
[Heidi] Basically, as we start to understand this question, one of the first things you need to understand are the data that we’re using to train the models.
This is all cancer-related patient information, which means there’s information there that is identifiable. So in order to handle this data appropriately, we need to handle it within a secure computing environment, where there’s no data leakage and individuals can’t be identified from the models that we’re training. So luckily, here at Oak Ridge and the Department of Energy, we have infrastructure that allows us to do that, so air-gapped computers, where we can run everything to scale and come up with the models that we need to predict on the reports accurately.
We have a couple of models that we are using right now.
First, we have a production-level model. That one is a hierarchical self-attention model, and essentially what that model is doing is it’s taking all the words that are within the text and it’s going to first encode them into a numerical expression so that we can actually make meaning of those words. Words can’t be measured, so we have to code them into that numerical representation. And then we’re going to use that numerical representation to really come up with a prediction of the outcome. And so, there are going to be several layers that go into that.
First, we look at the token- or word-level information and create embeddings for that. And then we also look at information at the sentence level, combine all that information together and it allows us to get pretty accurate predictions on the different types of tasks that we’re interested in. Within that there’s a deep abstaining classifier that allows us to predict at high accuracy.
So you can imagine the health field is high risk, and we don’t want to make predictions that are wrong. So we have this uncertainty quantification built into the algorithm that allows us to say, ‘Hey, I’m not so certain about this prediction, so I’m not going to make the prediction and abstain on it, don’t predict.’ And that leaves us with a set of data that we’re really confident is correct. And then data is used as an auto-prediction so that essentially the registrar doesn’t have to go through and code the information, but that’s something that the computer tells them and we accept that as truth.
Outside of that production-level model we are using, we’re currently developing our research as a transformers-based model, which is an improvement for us.
Number one: it allows us to have more-accurate predictions. And it’s just a slight boost in accuracy right now. But to us, that translates to thousands of reports, so that’s a very good thing.
Outside of having that slight boost in accuracy, it also allows us to be more agile, kind of, as John mentioned before. So a lot of times, the needs of the NCI change or we need to have something that we want to predict that’s outside of what we have normally trained a model on. In the past, or the production-level model, we have to train one model per task, and so that’s computationally intensive. And we have to train a new model every time the sponsor wants something new.
What we’re allowed to do with the transformers models is we have one pre-trained embedding layer, where we take all the information from all the reports, we identify the patterns or features that are within that data. And then we can use that information to fine-tune on the different tasks that we have in mind. And that fine-tuning step is less computationally intensive and allows us to be a little bit more agile, which is kind of necessary in this environment. So, that is, the transformers model is a big bird model that allows us to really digest that long textual information rather than the short tests. And it’s a new step up for us and something that we hope to have in production very soon.
[Scott] What level of reliability do you aim for, like 90%?
[Heidi] Oh, yeah, so this is a hot topic. You hit upon something very hot. So 97% is what the NCI has asked us to do.
Now, the beautiful thing about the deep abstaining classifier is that is not a hard threshold, so you can modify that for your sponsor’s need or for own research need. So that can be dropped to a level that is, say, 80%, if it meets your own criteria. But because we want to be very sure that the predictions are correct, it’s right now set at 97%.
[Scott] High standard.
[Heidi] Yeah, yeah. Again, you know, it’s a high-risk space, so predicting incorrectly could not be good.
[Scott] Yeah, makes sense.
[Scott] All right. You know, CANDLE is open source, which means it may be developed in a collaborative, public way. So will you talk to us about ways in which it’s already being used and about the spin-offs?
[John] Yeah, maybe to start with what you mentioned about it being sort of developed in a collaborative and open way is definitely an important thing for us, because as Heidi mentioned, a lot of the development that goes into this ends up being started on an air-gapped system, where it’s very much in a closed environment. And we need to be very careful how we move stuff in and out of there.
So what CANDLE has really been important for … and having that open repository has been a place that we can distribute stuff and that we can more easily work with partners and provide code to them in a way that we wouldn’t necessarily in our current projects.
A couple of examples of what that looks like: we’ve been working recently with the HPCToolkit team out of Rice [University] that’s part of ECP and helping them work with some of their new profiling tools on deep-learning frameworks, where we’ve been providing example sets, problem sets from them, from the MOSSAIC project.
Another example of some of that collaboration has been with the FlexFlow project in ECP, and that’s with Stanford and Los Alamos. And in that project they’re developing a low-level runtime that would sit underneath a deep-learning framework like PyTorch, and that’s going to provide us hopefully with improved scalability and in performance. And so, they’ve been doing some initial testing with some of our benchmarks as well.
[Heidi] And then to speak a little bit to how it is helping outside of just the normal CANDLE project but within NCI, all of the models that we’ve developed are translational. So essentially, what’s happening is that development happened, we have the models that predict, and that is something that generally in the past has taken 10 years to get into the clinic and really make a change in the patient’s life or in the surveillance of different diseases.
What we are saying now is that our models have been deployed. The first time they were deployed was in August of 2021. They scaled up to 18 cancer registries, and we are predicting or auto-coding about 25% of records that are seen on a yearly basis. So it’s really, truly impacting the health of the population in real time.
[Scott] That’s impressive. So how has ECP accelerated CANDLE’s development—this ECP framework for collaboration and all the different things that take place under the auspices of Exascale Computing Project?
[John] I think within the project, one of the biggest things—and this is going to sound silly—but when we’ve had meetings to ask this question, this always came back to the same answer. We ended up having hackathons quite a bit over the last few years. Obviously, that was disrupted a little in terms of being in person during COVID but getting the disparate teams from all these labs into one room and working together for a few days has been one of the biggest aspects that made a difference for us, just in terms of avoiding reinventing the wheel and being able to see a little more what other teams were working on and to be able to leverage some of that work.
Beyond that and kind of going out to the broader part of ECP, some of the collaborations have been extremely instrumental. I mentioned some of the stuff with HPCToolkits and with FlexFlow, but there are definitely others: CODAR, or the Co-Design Center for Data Reduction, would be another great example of tools, where if that wasn’t something that we really had a connection to, we would have been building a lot of that stuff from scratch. But it was something that we could not only take from them but really partner with them on working on. And I think that was very important.
[Scott] What actions do you think will promote the sustainability of CANDLE when ECP ends?
[John] So CANDLE’s an interesting project in that respect because it is a very general software library. And yet, it’s also very tied to its applications and that stuff. So I think as we look toward the end of ECP and where it goes beyond, where we’re really going to see CANDLE really being sustained is within these projects that are pulling from it. And that’s going to be some of the NCI projects that are already involved with it but also future work that’s coming out in the context of bio-preparedness or other applications in the health or artificial intelligence space are going to be building on a lot of the tools that have been developed here.
[Scott] OK. Well, this one’s for both of you: I want to throw this question in and ask you your opinion, just your general thoughts about the value-add of ECP, perhaps not only to CANDLE but to other projects that you’ve interacted with. What are your thoughts?
[John] So I think one of the things that was really important about the way ECP set this up was that this wasn’t just a ‘do deep learning on a supercomputer project,’ and just kind of do that as its own thing, but that they explicitly made those connections to the projects. I think that was helpful for us in terms of having that driving motivation.
But then I also think it ended up helping in terms of us being more, like, as translational in applying that stuff. Maybe you can pick that up, Heidi.
[Heidi] Yeah, so just to build off of what I mentioned earlier, it’s really allowed there to be a change in what’s happening on the cancer surveillance front. So this is a real change in the way that we track cancer. And we’re able to see things that happen in more real time.
One example of that is during the COVID pandemic, there was a drop in the number of cancer screenings that were happening. The NCI was interested in finding out how that affected cancer incidence and mortality. If someone’s diagnosed at a later stage, it may lead to more severe cancer and more adverse outcomes. And so, because we had developed this algorithm, they were able to study that in a more real-time fashion than waiting two years, as that’s the normal time it takes to go from seeing a cancer diagnosis to someone coding it and then to having it turned into a statistic.
So that rapid time from seeing a disease diagnosis to statistic to figuring out what are the next steps forward from a population health perspective, I think are very important. And those are ways that I see the CANDLE project continuing in things like bio-preparedness or other National Institutes of Health–based programs. But I do think the Exascale Computing Project has been really pivotal to moving that forward.
[Scott] Heidi, John, any closing comments?
[John] I think what Heidi said is really important and something I don’t think we really planned on, but it kind of ending up being pretty cool for the project that when COVID came along, there was a lot of stuff that we were able to repurpose reasonably quickly for stuff that was, you know, very timely. And I think that was both a very interesting opportunity for the team and also a nice use case of the capabilities of the project.
[Scott] All right. Well, thank you both so much for taking the time to talk with us.
[John] Thank you.
[Heidi] All right. Thank you.
[Scott] Yes, and to you the listener, thanks as well. Visit exascale project.org. Subscribe to ECPs YouTube channel; our handle is Exascale Computing Project. Additionally, follow ECP on Twitter; we’re @exascaleproject. The Exascale Computing Project is a US Department of Energy multi-lab collaboration to develop a capable and enduring exascale ecosystem for the nation.
CANDLE description (National Cancer Institute website)
CANDLE description (Exascale Computing Project website)
NCI Surveillance, Epidemiology, and End Results Program (SEER)
Article: “Researchers honored for innovative use of AI to fight cancer”
Let’s Talk Exascale Code Development: CANcer Distributed Learning Environment (CANDLE)
The Cancer Distributed Learning Environment (CANDLE): An Interview with Rick Stevens
Lighting the Way to Exascale Precision Medicine
Scott Gibson is a communications professional who has been creating content about high-performance computing for over a decade.