Joeky Senders

147 Deep Learning and NLP learning curves TABLE 1. Pseudocode of the current study in a generalizable format. Step 1: Data importation and general preprocessing A. Import data frame with three columns containing the patient identifier, group label, and original clinical report. B. In the original report column, subsequently a. remove all redundant information (date, time, physician’s signature, white spaces between sections, punctuation between letters, and stop words) and transform all letters to lower case letters. b. remove all English stop words except ‘no’ and ‘not’ c. apply Porter stemming algorithm d. apply preprocessing steps C. Divide the stemmed reports at the patient-level into a training, validation, and test set in a 2:1:1 ratio. Step 2: Hyperparameter optimization A. Hyperparameter optimization by means of weighted bootstrapping on the training and validation set. Hyperparameter settings are further explained in Supplementary Table S1. Step 3: Evaluate model performance on the hold-out test set. A. Train final models with optimal hyperparameter settings on the training set with 100 bootstraps for each model and each training fraction. B. Compute the predicted probabilities in the residual hold-out test set. C. Calculate the pooled mean AUC with standard deviation for each model and training fraction. D. Plot the performance in AUC against the training sample size to visualize the resultant learning curve for each model. Abbreviations: AUC=area under the receiver operating curve Development of a natural language processing model To compare the learning curve of the distinct approaches, model performance was evaluated for each diagnosis with varying samples ranging between 25 and 3000 reports of the training set. A bootstrapping procedure was utilized to optimize the hyperparameter settings. We used bootstrapping with replacement to draw random samples fromthe parent training set and trained ‘naïve’ algorithmswith every iteration. 23 As such, all resultant models were solely trained on the randomly drawn sample while ignoring the rest of the parent training set. This bootstrapping procedure provided an estimate of performance for each hyperparameter setting by pooling the performance estimates of the distinct, sample-based models. To account for the higher variability in performance in the smaller samples and preserve the computational reproducibility of this study, the number of bootstraps was inversely weighted according to the sample size and comprised the integer division between the size of the total training set (n=3000) and the size of the training sample. For example, training with a sample of 25 reports was bootstrapped 120 times. Logistic regression, least absolute shrinkage and selection operator (LASSO) regression, and deep learning models were developed and compared as classifiers. 24 For the regression-based models, the use of mono-, bi-, and trigrams, the size of the vocabulary