Joeky Senders

138 Chapter 7 LASSO regression demonstrated superior overall performance among the bag- of-words models and 1D-convolutional neural networks among the sequence- based models. Although their preprocessing and analytical approaches differ, both algorithms provide strong methods for regularization to avoid overfitting. 31, 32 LASSO regression encourages simple models by penalizing the use of many coefficients, and convolutional layers extract higher-level features by applying filters on local regions of the input. Regularization is a key concept in machine learning and appears to be vital for both bag-of-words and sequence-based approaches in the current NLP task as well. 29 Although sequence-based approaches harnessed with recurrent and convolutional neural network architectures demonstrated higher overall performance than most bag-of-word approaches, their resultant models lacked the interpretability of regression-based algorithms, demanded longer training and prediction times, and required more careful hyperparameter tuning. When constructing an NLP model, the choice of algorithm should be guided by the nature of the NLP task. If the NLP model should be fast, interpretable, and effective on a range of problems without tedious hyperparameter tuning, a bag-of-words approach based on a LASSO regression algorithm can be ideal. 31 If the order of the words is important, as with follow-up notes over time or higher-level relationships across distinct paragraphs, sequence-based approaches might be preferential. 33,34 Similarly, the metric of performance should align the overarching goal as close as possible. For example, sensitivity can be the metric of choice when comprehensiveness is the goal, and false-positives are more acceptable. On the other hand, specificity might be preferred when predicted cases should not be diluted with non-cases, and when false- negatives are more acceptable. Future research should externally validate the current findings, thereby exploring and comparing the utility of bag-of-words and sequence-based NLP modeling in various patient populations, clinical reports, and outcome measures. In the current study, supervised learning methods were evaluated to investigate the utility of NLP for data extraction of unambiguous outcomes; however, future studies can also focus on extracting higher-lever concepts, such as the patient’s survival probability or perception of quality of life. Although it remains questionable to what extent NLP can extract this information from clinical reports, it has the potential to pick up undetected patterns related to these outcomes. Furthermore, exploring the use of unsupervised learning in the absence of a prespecified outcome of interest might help in identifying natural, yet unknown clusters within the data. Lastly, future studies should consider the implications of automated medical text analysis parallel to the development of these techniques. NLP has the potential to increase the scale and velocity at which data