Joeky Senders

128 Chapter 7 Abstract Introduction Although the bulk of patient-generated health data is increasing exponentially, its utilization is impeded because most data comes in unstructured format, namely free- text clinical reports. A variety of natural language processing (NLP) methods have emerged to automate the processing of free text ranging from statistical to deep learning-based models; however, the optimal approach for medical text analysis remains to be determined. The aim of this study was to provide a head-to-head comparison of novel NLP techniques and inform future studies about their utility for automated medical text analysis. Methods Magnetic resonance imaging reports of patients with brain metastases treated in two tertiary centers were retrieved and manually annotated using a binary classification (single metastasis versus two or more metastases). Multiple bag-of-words and sequence-based NLP models were developed and compared after randomly splitting the annotated reports into a training and test set in an 80:20 ratio. Results A total of 1479 radiology reports of patients diagnosed with brain metastases were retrieved. The LASSO regression model demonstrated the best overall performance on the hold-out test set with an area under the receiver operating curve of 0.92 (95%CI 0.89–0.94), accuracy of 83% (95%CI 80–87%), calibration intercept of -0.06 (95%CI -0.14– 0.01), and calibration slope of 1.06 (95%CI 0.95-1.17). Conclusion Among various NLP techniques, the bag-of-words approach combined with a LASSO regression model demonstrated the best overall performance in extracting binary outcomes from free-text clinical reports. This study provides a framework for the development of machine learning-based NLP models, as well as a clinical vignette in patients diagnosed with brain metastasis.