Developing a Codebase for Deep Learning on Supercomputers to Fight Cancer

By Scott Gibson

The Exascale Deep Learning–Enabled Precision Medicine for Cancer (CANDLE) project is directed at developing pre-clinical response models, predicting mechanisms of RAS/RAF–driven cancers, and developing treatment strategies. CANDLE is a partnership of Argonne (ANL), Lawrence Livermore, Los Alamos, and Oak Ridge (ORNL) National Laboratories, with Rick Stevens of ANL as principal investigator (PI).

Gina Tourassi of Oak Ridge National Laboratory

Gina Tourassi of Oak Ridge National Laboratory

Gina Tourassi, who was recently named director of the National Center for Computational Sciences (NCCS), a division of the Computing and Computational Sciences Directorate at ORNL, is PI of the ORNL effort within CANDLE. The NCCS is home to the Summit supercomputer and the Oak Ridge Leadership Computing Facility (OLCF), a Department of Energy (DOE) Office of Science User Facility. She joined the Let’s Talk Exascale podcast for an interview in Denver at SC19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, which took place in November.

Within the CANDLE project, ORNL’s “end goal is to develop a general CANDLE library so that anyone with a deep learning code and a dataset can train their model at scale on a big HPC [high-performance computing] system without significantly modifying their code,” Tourassi said. “We’re also doing this in the context of DOE’s partnership with the National Cancer Institute, so specific deep learning models that we are developing are focused on cancer research and precision medicine challenges. The CANDLE effort at ORNL focuses on deep learning for natural language processing [NLP], specifically information extraction from unstructured cancer pathology data to semi-automate reporting processes within the national cancer surveillance program.”

CANDLE is a part of the Application Development (AD) research focus area of DOE’s Exascale Computing Project (ECP). “Most AD projects involve taking existing highly scalable applications and optimizing them for new architectures; specifically, those target architectures are the heterogenous nodes of OLCF Summit and of the CORAL-2 machines like Frontier where the majority of the compute power is on the GPUs,” Tourassi said.

However, CANDLE is a bit unusual compared with the other AD projects. “Deep learning packages like Tensorflow or PyTorch ran on GPUs almost from day one, so that’s less of an issue for us,” Tourassi said. “But scalability of deep learning codes in a distributed memory environment is still a rapidly developing field. Just in the last 2 years, we’ve seen the Horovod package from Uber revolutionize how communication is performed for deep learning codes. Yet data parallelism isn’t the only aspect of scalability for deep learning: workflows like hyperparameter optimization and neural architecture search can also be significantly accelerated if they are performed in parallel.” CANDLE’s role is to enable those capabilities.

“CANDLE seeks to provide a consistent API to which application developers can write or modify their deep learning codes, and we hope to provide highly optimized workflows that allow users to scale efficiently on big HPC systems,” Tourassi said.

Through ECP, CANDLE is able to take advantage of beneficial channels of interaction. “First, we have a major collaboration with the ExaLearn [Co-Design Center for Exascale Machine Learning Technologies] project, which is using CANDLE in support of several deep learning problems motivated by domain science,” Tourassi said. “In return, they are providing us feedback about additional functionality and how we can improve usability. Second, we’re also working with other ECP projects that we can incorporate into our software stack, such as CODAR [the Center for Online Data Analysis and Reduction at the Exascale].”

The CANDLE effort at ORNL is directed at extracting information from medical text data, and developing deep learning models for that purpose requires overcoming a number of obstacles. “One challenge is the need to identify the hyperparameters and neural architecture that deliver the best clinical performance—that is, they perform the classification tasks accurately, and they are robust with respect to the high variability of medical text data,” Tourassi said. “Since the parameter spaces can be quite large, this demands scalable methods for something like hyperparameter optimization. We want to be able to train hundreds of models in parallel to obtain convergence to an optimal set of hyperparameters. To address this, we developed a hyperparameter optimization scheme called HyperSpace, which performs distributed Bayesian optimization. To make this run efficiently in parallel, we borrowed a standard practice from modeling and simulation—domain decomposition—and assigned different regions of the hyperparameter space to different nodes. We are currently working on developing similar solutions for performing distributed neural architecture search.”

Another challenge is balancing improvements in clinical performance with increases in computational cost. “An N% improvement in clinical performance sounds great if computational cost increases by 5%, but much less so if the increase is 500%,” Tourassi said. “Consequently, as we improve our deep learning architectures, we continually look to see if we can deliver these advancements with acceptable computational performance. Moreover, we’re focused on developing architectures that make better use of the hardware, such as maintaining high GPU utilization or employing the FP16 arithmetic available on Summit’s V100 GPUs.”

CANDLE must support deep learning for sensitive text processing that maintains model and data privacy. “While de-identification is fairly straightforward with medical images, that’s not the case with clinical text; existing off-the-shelf products do not yet meet the requirements of the national surveillance program,” Tourassi said. “Developing scalable algorithmic solutions for deep learning NLP that preserve model and data privacy during training and inference is an area of interest with broad applications beyond the cancer research space.”

The CANDLE project holds promise for having a lasting impact on the future in a couple of ways. “Basically, as we are leveraging supercomputing and artificial intelligence to accelerate cancer research, we are also seeing how we can drive the next generation of supercomputing,” Tourassi said. “But at a higher level, I believe this particular partnership exemplifies the importance of bringing two diverse communities together to advance their respective missions—one agency has both unique computing resources coupled with the requisite outstanding brain power, while the other excellent agency has unique domain challenges and close interactions with domain experts. So I would say this is a paradigm shift that needs to be scaled.”