Novel method combining machine learning and data partitioning benefits cancer records data extraction

A team of cancer researchers and computer scientists have applied machine learning (ML) ensemble techniques to reduce training time, mitigate task complexity, and improve accuracy and classification performance for information extraction with cancer pathology reports. Their work improves automated methods for harvesting useful information for cancer surveillance from today’s vast amounts of varied-quality clinical data. Their findings were published in the September 2020 issue of Journal of Biomedical Informatics.

Population-based cancer registries depend on hospitals, doctors’ offices, and other institutions to provide case-level data to report on cancer characteristics such as tumor type, stage at diagnosis, and type of surgery received. Unstructured cancer pathology reports (i.e., narratives) present a range of challenges, from inconsistent terminology to ungrammatical, fragmented notations that contain abbreviations and typographical errors, necessitating heavy manual effort and slowing scaled-up data gathering. Cancer surveillance allows for assessments of progress against cancer around the world and guides cancer control policies and interventions.

Leveraging high-performance computing (HPC) resources at the Oak Ridge Leadership Computing Facility, the team applied novel ML and deep learning approaches—bootstrap aggregation and data partitioning—for extracting information by designing and training classifiers to read, extract features, and understand the contents of clinical reports. Bootstrap aggregation or “bagging,” an ML algorithm that obtains an ensemble of models trained by resampled cases, achieves stability and boosts task performance. Combining data partitioning with bootstrapping addresses training time and hardware resource (i.e., accelerator) challenges associated with large datasets. The team’s work showed that using HPC decreases time to build bagging classifiers and that the partitioned bagging model is well suited to HPC and supercomputers. Future work includes better ensembling of partitioned models, optimal data partitioning for higher scalability and better task performance, and evaluating portability of the team’s method to exascale supercomputers.

Yoon, H.J., H.B. Klasky, J.P. Gounley, M. Alawad, S. Gao, E.B. Durbin, X.-C. Wu, A. Stroup, J. Doherty, L. Coyle, L. Penberthy, J.B. Christian, G.D. Tourassi. “Accelerated Training of Bootstrap Aggregation-based Deep Information Extraction Systems from Cancer Pathology Reports.” Journal of Biomedical Informatics 110 (September 2020): 103564.