The US Department of Energy (DOE) has entered into a partnership with the National Cancer Institute (NCI) of the National Institutes of Health (NIH). This partnership has identified three key science challenges that the combined resources of DOE and NCI can accelerate. The first challenge, called the drug response problem, is to develop predictive models for drug response that can be used to optimize preclinical drug screening and drive precision medicine-based treatments for cancer patients. The second challenge, called the RAS pathway problem, is to understand the molecular basis of key protein interactions in the RAS/RAF pathway that is present in 30% of cancers. The Ras-Raf-MEK-ERK pathway is a ubiquitously expressed signaling module that regulates the proliferation, differentiation and survival of cells.

The third challenge, called the treatment strategy problem, is to automate the analysis and extraction of information from millions of cancer patient records to determine optimal cancer treatment strategies across a range of patient lifestyles, environmental exposures, cancer types, and health care systems. Although these challenges are at different scales and have specific scientific teams collaborating on the data acquisition, data analysis, model formulation, and scientific runs of simulations, they also share several common threads. The CANDLE (Cancer Distributed Learning Environment) project focuses on the machine learning aspect of the challenges and in particular builds on a single scalable deep neural network (DNN) code called CANDLE.

Project Details

The CANDLE challenge problem is to solve large-scale machine learning problems for three cancer-related pilot applications: the drug response problem, RAS pathway problem, and treatment strategy problem. For the drug response problem, unsupervised machine learning methods are used to capture the complex, nonlinear relationships between the properties of drugs and the properties of tumors to predict treatment response with the goal of developing a model that can provide treatment recommendations for a given tumor. For the RAS pathway problem, multiscale MD (Molecular Dynamics) runs are guided through a large-scale state-space search by using unsupervised learning to determine the scope and scale of the next series of simulations based on the history of previous simulations. For the treatment strategy problem, semi-supervised machine learning is used to automatically read and encode millions of clinical reports into a form that can be computed upon. Each problem requires a different approach to the embedded learning problem, but all approaches are supported with the same scalable deep learning code in CANDLE.

The CANDLE software suite broadly consists of two distinct, interoperating levels: the DNN codes and the Supervisor portion, which handles work distribution across a distributed network. At the DNN level, the CANDLE utility library provides a series of utility functions that streamline the process of writing CANDLE-compliant code. This enables the essential functionality for network hyperparameters to be set either from a default model file or from the command line. This in turn enables experiments to be designed that efficiently sweep across a range of network hyperparameters. The Supervisor framework provides a set of modules to enable various hyperparameter optimization (HPO) schemes and to automatically distribute the workload across available computing resources. Together, these capabilities allow users to efficiently perform HPO on the large compute resources available across the DOE complex, as well as on any local compute resources.

The challenge for exascale manifests in the need to train large numbers of models. A need inherent to each pilot application requires the production of high-resolution models that cover the space of specific predictions (i.e., individualized in the precision medicine sense), such as training a model that is specific to a certain drug and individual cancer.

Starting with 1,000 different cancer cell lines and 1,000 different drugs, a leave-one-out strategy to create a high-resolution model for each drug by cancers requires approximately 1 million models. These models are similar enough that we can use a transfer learning strategy, where weights are shared during training in a way that avoids information leakage, which significantly reduces the time needed to train a large set of models.

Principal Investigator(s):

Rick Stevens, Argonne National Laboratory


Argonne National Laboratory; Oak Ridge National Laboratory; Lawrence Livermore National Laboratory; Los Alamos National Laboratory, Frederick National Laboratory for Cancer Research, National Cancer Institute (NCI)

Progress to date

CANDLE, a partnership between DOE and NCI, developed highly efficient DNNs optimized for the unique architectures provided by next-generation exascale platforms to address three significant science challenge problems in cancer research.

Overall, the CANDLE software suite is a unique and powerful platform that brings together machine learning, deep learning, and cancer research to accelerate the discovery of new cancer therapies and treatments while providing benchmarks to the hardware industry to aid in the development of novel hardware solutions.

The following technical highlights represent recent achievements and accomplishments by the collaborative research teams to apply artificial intelligence to problems in cancer, COVID-19, and beyond.

  • Work on transformers led to the 2022 ACM Gordon Bell Special Prize for High Performance Computing-Based COVID-19 Research, “GenSLMs: Genome-Scale Language Models Reveal SARS-CoV-2 Evolutionary Dynamics” on ALCF’s Polaris system, and two other major U.S. supercomputers, that recognized outstanding research achievement toward understanding of the COVID-19 pandemic using high-performance computing.
  • CANDLE benchmarks are currently being used by several new AI hardware companies to test their hardware and software, including Cerebras, SambaNova, Groq, Intel, and AMD, among others.
  • CANDLE’s 2023 release offered an impressive, combined performance of the Uno and P3B1 benchmarks, a 264x performance improvement on Oak Ridge National Laboratory’s next-generation supercomputer Frontier, observed relative to Titan, the largest DOE supercomputer at the time when the ECP-CANDLE project began. This improvement marks a substantial increase, well beyond the 50x goal set forth by the ECP in 2018 as a key performance parameter (KPP) for the new Frontier and Aurora (Argonne National Laboratory) supercomputers.
  • The CANDLE library is being used by the DOE-NCI collaboration projects MOSAIC and IMPROVE to automate extraction of clinical data from medical records enabling future studies focused on cancer reoccurrence and to compare deep learning drug response prediction models that provide options for new animal studies on novel anticancer compounds.
  • The ability to identify and understand low-confidence predictions of DNNs was significantly increased by training several thousand models on the DOE leadership computers and applying statistical methods to the outputs.
  • Combining molecular simulation and artificial intelligence on leadership-scale supercomputers is resulting in promising new insights into future COVID-19 therapeutics.
  • The CANDLE collaborative research team applied the latest deep learning techniques for information extraction from COVID-19 and cancer-related literature. Several hundred thousand scientific reports and clinical records can be quickly and accurately scanned for relationships that shed new light on the underlying basis of diseases and provide insights toward new therapeutics.
  • Implemented functions for uncertainty quantification, outlier detection, estimation of low-confidence predictions and empirical calibration, including support for data partitioning, data subsampling, dropout, and regularization in training and dropout in inferencing. These functions generate confidence intervals over predictions and produce empirically calibrated uncertainty, allowing further analysis on the performance of these models.
  • Provided a prototype capability of integrated binding free energy computations with adaptive sampling, including initial developments of both DeepDriveMD and ESMACS workflows have, (i) been generalized to support different Deep Learning (DL) frameworks, (ii) extended to support different DL models, and (iii) optimized to have resource utilization(s) upwards of 95%.
  • Demonstrated the development of deep learning models as surrogates to the more expensive physics-based docking calculations and showed the impact of using 3-D drug descriptors in the feature set. These models were subsequently used as part of the National Virtual Biotechnology Laboratory (NVBL) efforts to search a much larger space of compounds and provided input into the selection of compounds for whole cell viral inhibition assays.
  • Presented semi-supervised BERT-based approach for information extraction from COVID-19-related literature. With this new approach, datasets such as the COVID-19 Open Research Dataset (CORD-19) from the White House and institutional partners can be analyzed in an automated, efficient manner.

National Nuclear Security Administration logo U.S. Department of Energy Office of Science logo