The US Department of Energy (DOE) has entered into a partnership with the National Cancer Institute (NCI) of the National Institutes of Health (NIH). This partnership has identified three key science challenges that the combined resources of DOE and NCI can accelerate. The first challenge, called the drug response problem, is to develop predictive models for drug response that can be used to optimize preclinical drug screening and drive precision medicine-based treatments for cancer patients. The second challenge, called the RAS pathway problem, is to understand the molecular basis of key protein interactions in the RAS/RAF pathway that is present in 30% of cancers. The Ras-Raf-MEK-ERK pathway is a ubiquitously expressed signaling module that regulates the proliferation, differentiation and survival of cells.

The third challenge, called the treatment strategy problem, is to automate the analysis and extraction of information from millions of cancer patient records to determine optimal cancer treatment strategies across a range of patient lifestyles, environmental exposures, cancer types, and health care systems. Although these challenges are at different scales and have specific scientific teams collaborating on the data acquisition, data analysis, model formulation, and scientific runs of simulations, they also share several common threads. The CANDLE (Cancer Distributed Learning Environment) project focuses on the machine learning aspect of the challenges and in particular builds on a single scalable deep neural network (DNN) code called CANDLE.

Project Details

The CANDLE challenge problem is to solve large-scale machine learning problems for three cancer-related pilot applications: the drug response problem, RAS pathway problem, and treatment strategy problem. For the drug response problem, unsupervised machine learning methods are used to capture the complex, nonlinear relationships between the properties of drugs and the properties of tumors to predict treatment response with the goal of developing a model that can provide treatment recommendations for a given tumor. For the RAS pathway problem, multiscale MD (Molecular Dynamics) runs are guided through a large-scale state-space search by using unsupervised learning to determine the scope and scale of the next series of simulations based on the history of previous simulations. For the treatment strategy problem, semi-supervised machine learning is used to automatically read and encode millions of clinical reports into a form that can be computed upon. Each problem requires a different approach to the embedded learning problem, but all approaches are supported with the same scalable deep learning code in CANDLE.

The CANDLE software suite broadly consists of two distinct, interoperating levels: the DNN codes and the Supervisor portion, which handles work distribution across a distributed network. At the DNN level, the CANDLE utility library provides a series of utility functions that streamline the process of writing CANDLE-compliant code. This enables the essential functionality for network hyperparameters to be set either from a default model file or from the command line. This in turn enables experiments to be designed that efficiently sweep across a range of network hyperparameters. The Supervisor framework provides a set of modules to enable various hyperparameter optimization (HPO) schemes and to automatically distribute the workload across available computing resources. Together, these capabilities allow users to efficiently perform HPO on the large compute resources available across the DOE complex, as well as on any local compute resources.

The challenge for exascale manifests in the need to train large numbers of models. A need inherent to each pilot application requires the production of high-resolution models that cover the space of specific predictions (i.e., individualized in the precision medicine sense), such as training a model that is specific to a certain drug and individual cancer.

Starting with 1,000 different cancer cell lines and 1,000 different drugs, a leave-one-out strategy to create a high-resolution model for each drug by cancers requires approximately 1 million models. These models are similar enough that we can use a transfer learning strategy, where weights are shared during training in a way that avoids information leakage, which significantly reduces the time needed to train a large set of models.

Principal Investigator(s):

Rick Stevens, Argonne National Laboratory


Argonne National Laboratory; Oak Ridge National Laboratory; Lawrence Livermore National Laboratory; Los Alamos National Laboratory, Frederick National Laboratory for Cancer Research, National Cancer Institute (NCI)

Progress to date

CANDLE, a partnership between DOE and NCI, is developing highly efficient DNNs optimized for the unique architectures provided by next-generation exascale platforms to address three significant science challenge problems in cancer research.

In early 2020, CANDLE supported a transition from the CANCER focus to a COVID-19 temporary push. This push was instrumental in accelerating DOE’s response in what was soon to become the Office of Science’s National Virtual Biotechnology Laboratory.

The following technical highlights represent recent achievements and accomplishments by the collaborative research teams to apply artificial intelligence to problems in cancer, COVID-19, and beyond.

  • The ability to identify and understand low-confidence predictions of DNNs were significantly increased by training several thousand models on the DOE leadership computers and applying statistical methods to the outputs.
  • Combining molecular simulation and artificial intelligence on leadership-scale supercomputers—including the DOE supercomputers Summit and Theta, as well as the National Science Foundation Frontera supercomputer—is resulting in promising new insights into future COVID-19 therapeutics.
  • The CANDLE collaborative research team is applying the latest deep learning techniques for information extraction from COVID-19 and cancer-related literature. Several hundred thousand scientific reports and clinical records can be quickly and accurately scanned for relationships that shed new light on the underlying basis of diseases and provide insights toward new therapeutics.

National Nuclear Security Administration logo U.S. Department of Energy Office of Science logo