Joeky Senders

120 Chapter 6 review can be done, but also allows for the assembly of large-scale prospective data registries. Both are currently assembled by manual chart review, which is expensive in terms of time and human resources. 2 Significant intra- and interrater variability can be introduced because human coding is subject to fatigue, personal interpretations, biased preconceptions, and progressive insight. Furthermore, inconsistent data collection could even propagate into biased study results and interpretations. The use of consensus labels could attenuate the variation in human coding and develop NLP algorithms that are fast, deterministic, and reproducible by nature. This study also provides insight into the feasibility of automated extraction of medical information by investigating the correlation betweenmodel performance and statistical properties of the variables to extract. These findings suggest that small sample sizes (i.e., as low as 50-100 observations in the minority group) and relatively unbalanced outcomes (i.e., class imbalance up to a 9:1 ratio) should not be considered as limitations or absolute contraindications for NLP modeling. The strong correlation with interrater agreement underlines that a predictive model is as good as the examples it learns from. Interrater agreement might therefore serve as a useful screening tool for the feasibility of text mining on the variables of interest. Furthermore, it might also reflect the lexical complexity of the NLP task at hand. Clinical assertions on left or right-sided tumor involvement might, for example, be less subject to interpretation than higher- level and abstract concepts, such as a patient’s perception of quality of life. Limitations Several limitations of this study should be mentioned. This pipeline has been developed on a text corpus of radiology reports of a homogenous patient cohort at a single institution, which limits its generalizability to other clinical reports, patient cohorts, and institutions. Instead of the resultant models, the underlying code pipeline was therefore made publicly-accessible in order to promote the reproducibility and external generalizability of the current work. Preserving a residual hold-out test set would have been the most rigorous method for assessing model performance. However, the considerably small text corpus (n = 562 reports) increases the risk of selecting a non-representative sample for final evaluation. To avoid the risk of systematic over- or underestimation of model performance due to a non-representative hold-out test set, model performance was evaluated by means of 10-fold cross-validation which provided a pooled estimate across all validation folds. Although equivalence in the frequency distribution was not significantly associated with model performance in the current feasibility analysis, an association above the examined thresholds (i.e., class imbalance above a 9:1 ratio and less than 50 observations in the minority group) cannot