Joeky Senders

133 Comparing NLP methods Training and evaluation The total data set was divided into a training and hold-out test set in an 80:20 ratio. Five- fold cross-validation was performed on the training set to optimize the hyperparameter settings. The final models were evaluated on the hold-out test set, which had not been used for preprocessing and hyperparameter tuning in any form. The output of the NLP models can be a predicted probability (between 0 and 1) or binary prediction (yes or no). Based on the type of output, the performance of the classification was captured in several parameters, including the area under the receiver operating curve (AUC), accuracy, and calibration. 7 The AUC is a measure of discrimination and represents the probability that an algorithm will rate cases higher than non-cases when two observations are chosen at random. Accuracy represents the percentage of reports classified correctly when the output of the model is binary. Logistic regression was considered as a benchmark for comparison with all other algorithms. The agreement between the predicted probabilities and the observed prevalence was visually assessed in a calibration plot and numerically assessed according to the calibration intercept and slope. A calibration intercept of 0 and slope of 1 is considered as perfect calibration. The NLP models were developed and evaluated in Python version 3.6 (Python Software Foundation, http://www.python.org) using the Keras and Scikit-learn libraries. 8,9 The difference in AUC was evaluated by means of the DeLong test and the difference in accuracy by means of the Chi-square test in R version 3.3.3 (R Core Team, Vienna, Austria, https://cran.r-project.org ). The Benjamini-Hochberg procedure was used to correct for multiple testing. To promote the transparency and reproducibility of our work, we have deployed the source code on a publicly accessible GitHub repository (https://github.com/jtsenders/nlp_brain_metastasis ). Additionally, a pseudocode is provided in Supplementary Table S1, which can be used to guide similar work in other clinical applications. The datasets generated and analyzed in the current study are available from the corresponding author on request. Results A total of 1479 reports of patients treated in one of the two Partners Hospitals was extracted by the RPDR query and eligible for inclusion in the current study. The annotated reports were divided into a training and hold-out test set of 1179 (79.7%) and 300 (20.3%) patients, respectively. The mean discordance rate between individual reviewers was 36.2%. The AUCs on the hold-out test set of all six algorithms ranged between 0.87 and 0.93 (Figure 1), and the overall accuracies ranged between 64% and 87% (Table 2). By AUC,