Joeky Senders

115 Automating clinical chart review TABLE 1. Pseudocode utilized in the current study in a format that is generalizable to other NLP applications in clinical research. Phase Steps Phase 1: Data Import and preprocessing A. Import dataframe with the report ID, original report, and binary labels per outcome of interest. B. Randomly shuffle all observations C. In the original report column, subsequently a. remove all redundant information (date, time, physician’s signature, white spaces between sections, and punctuation between letters) and transform all letters to lower case letters. b. remove all English stop words except ‘no’ and ‘not’ c. apply Porter stemmer algorithm D. Tokenize all reports Phase 2: Hyperparameter tuning A. Load preprocessed reports B. Construct hyperparameter grid including the following hyperparameters for a. TFIDF vectorization: i. maximal number of features ii. N-gram range b. LASSO regression algorithm i. l2 regularization C. For each grid search (i.e., unique hyperparameter setting) subsequently: a. apply the TFIDF vectorizer on the total text corpus b. perform k-fold cross-validation c. calculate the mean performance and standard deviation across all folds Phase 3: Compute final results A. For each outcome, extract the optimal hyperparameter settings based on a single or composite performance metric of interest B. Compute final cross-validated results using optimal hyperparameter settings C. Compute cross-validated ROC plots using optimal hyperparameter settings Abbreviations: LASSO=Least Absolute Shrinkage and Selection Operator; ROC=receiver operating characteristic curve; TF-IDF=term frequency–inverse document frequency