CANDLE

The US Department of Energy (DOE) has entered into a partnership with the National Cancer Institute (NCI) of the National Institutes of Health (NIH). This partnership has identified three key science challenges that the combined resources of DOE and NCI can accelerate. The first challenge, called the drug response problem, is to develop predictive models for drug response that can be used to optimize preclinical drug screening and drive precision medicine-based treatments for cancer patients. The second challenge, called the RAS pathway problem, is to understand the molecular basis of key protein interactions in the RAS/RAF pathway that is present in 30% of cancers. The Ras-Raf-MEK-ERK pathway is a ubiquitously expressed signaling module that regulates the proliferation, differentiation and survival of cells.

The third challenge, called the treatment strategy problem, is to automate the analysis and extraction of information from millions of cancer patient records to determine optimal cancer treatment strategies across a range of patient lifestyles, environmental exposures, cancer types, and health care systems. Although these challenges are at different scales and have specific scientific teams collaborating on the data acquisition, data analysis, model formulation, and scientific runs of simulations, they also share several common threads. The CANDLE (Cancer Distributed Learning Environment) project focuses on the machine learning aspect of the challenges and in particular builds on a single scalable deep neural network (DNN) code called CANDLE.

Summary

Cancer is the second most common cause of death in the United States. According to the NIH, roughly two million Americans were diagnosed with cancer in 2023, and more than 600,000 of those people died from the illness. In addition to cancer’s staggering incidence and mortality rate, it imposes severe economic burdens on affected families, totaling more than $170 billion in direct medical costs alone in 2020. High performance computers and advanced software solutions such as artificial intelligence and machine learning are among the most critical new tools being employed in the fight against cancer. These new technologies can greatly accelerate the pace of research in drug design, treatment optimization and analysis, and fundamental cancer biology, all with the goal of improving patient prognoses and quality of life.

The Cancer Distributed Learning Environment (CANDLE) application is a software platform sponsored by the Exascale Computing Project (ECP) and built to combine and apply high performance computing and machine learning to key areas of cancer research and care. The application uses a deep neural network, a type of machine learning algorithm capable of recognizing patterns and classifying information with minimal human supervision, to rapidly complete large and complex data analysis and modeling tasks when accelerated with the power of modern supercomputers.

Using CANDLE, researchers can quickly create and train large numbers of computational models for three key tasks: predicting drug interactions with various cancer types to optimize treatment and drive precision medicine tailored to individual patients; understanding the molecular dynamics in the RAS/RAF pathway, a ubiquitous network of cellular interaction involved in 30% of cancers; and automating the analysis of millions of patient treatment records to find patterns which are not recognizable to human analysts across factors such as patient lifestyles, cancer types, and environmental exposures.

Machine learning cannot be applied to cancer research and clinical intervention without tools to quickly train effective computational models for highly specialized research and clinical tasks. Creating and training the millions of unique machine learning models required to adequately address sweeping research questions–such as interactions between thousands of different drugs and cancer types–requires an enormous amount of computational power, as a single model may take several hours to train given traditional methods. This computational demand requires improvements in hardware capabilities and innovation in machine learning training approaches.

The CANDLE application team has leveraged the power of exascale computation to dramatically improve researchers’ ability to create machine learning models for cancer research. Using the CANDLE tool, researchers can create models at unprecedented speeds—almost 500 times more quickly than on previous petascale machines—and with greatly improved versatility. The application boasts features which allow researchers to view models’ confidence in the validity their predictions and to screen predictions for minimum acceptable accuracy. Furthermore, CANDLE’s versatility in training machine learning models allows the application to be implemented in research on other illnesses, such as COVID-19.

Exascale-accelerated machine learning is positioned to transform cancer research and care in the US. With support from the CANDLE application, researchers will accelerate development of new cancer treatments with improved clinical outcomes, optimize healthcare providers’ decision-making by providing largescale analysis across nation-wide databases, and support basic research by elucidating the fundamental networks of cellular interaction which cause cancers.

Technical Discussion

The CANDLE challenge problem is to solve large-scale machine learning problems for three cancer-related pilot applications: the drug response problem, RAS pathway problem, and treatment strategy problem. For the drug response problem, unsupervised machine learning methods are used to capture the complex, nonlinear relationships between the properties of drugs and the properties of tumors to predict treatment response with the goal of developing a model that can provide treatment recommendations for a given tumor. For the RAS pathway problem, multiscale MD (Molecular Dynamics) runs are guided through a large-scale state-space search by using unsupervised learning to determine the scope and scale of the next series of simulations based on the history of previous simulations. For the treatment strategy problem, semi-supervised machine learning is used to automatically read and encode millions of clinical reports into a form that can be computed upon. Each problem requires a different approach to the embedded learning problem, but all approaches are supported with the same scalable deep learning code in CANDLE.

The CANDLE software suite broadly consists of two distinct, interoperating levels: the DNN codes and the Supervisor portion, which handles work distribution across a distributed network. At the DNN level, the CANDLE utility library provides a series of utility functions that streamline the process of writing CANDLE-compliant code. This enables the essential functionality for network hyperparameters to be set either from a default model file or from the command line. This in turn enables experiments to be designed that efficiently sweep across a range of network hyperparameters. The Supervisor framework provides a set of modules to enable various hyperparameter optimization (HPO) schemes and to automatically distribute the workload across available computing resources. Together, these capabilities allow users to efficiently perform HPO on the large compute resources available across the DOE complex, as well as on any local compute resources.

The challenge for exascale manifests in the need to train large numbers of models. A need inherent to each pilot application requires the production of high-resolution models that cover the space of specific predictions (i.e., individualized in the precision medicine sense), such as training a model that is specific to a certain drug and individual cancer.

Starting with 1,000 different cancer cell lines and 1,000 different drugs, a leave-one-out strategy to create a high-resolution model for each drug by cancers requires approximately 1 million models. These models are similar enough that we can use a transfer learning strategy, where weights are shared during training in a way that avoids information leakage, which significantly reduces the time needed to train a large set of models.

Summary

Technical Discussion

Principal Investigator(s)

Collaborators