Joeky Senders

113 Automating clinical chart review into the 2-gram ‘ring_enhancement’. Lastly, all text documents were converted to a numeric format by means of the term frequency-inverse document frequency (TF-IDF) vectorizer. 8 This vectorizer converts each text document into a 1-dimensional array of numbers, each of which represent the relative frequency of specific n-grams in a document compared to their prevalence in the total text corpus. Hyperparameter tuning The preprocessed radiology reports were used as input for the NLP algorithm and the consensus-based ground truth labels as the associated outcomes. A least absolute shrinkage and selection operator (LASSO) regression algorithm was used as final classifier due to its speed and regularizing capacity. 9 Because the preprocessing and classification centers around the relative frequency of word or word combinations rather than the order of these words, this NLP approach is considered a bag-of-words approach. In a recent comparative study, we demonstrated that a bag-of-words approach harnessed with a LASSO-regression algorithm outperforms other competing statistical, classical machine learning, and deep learning approaches in classifying free- text radiology reports. 10 Therefore, this approach was utilized in the current study as well. The use of mono-, bi-, tri-, and tetra-grams, size of the vocabulary in the TF-IDF vectorizer, and l2-regularization were presented to the algorithm as hyperparameters, which were optimized by means of 10-fold cross-validation. Model evaluation The NLP model can compute predicted probabilities (number between 0 and 1) representing the probability of an observation belonging to a certain class or directly compute the predicted class in a binary fashion (0 or 1). Predicted probabilities can be used to measure model performance according to the area under the receiver operating characteristic curve (AUC). 11 The AUC is a measure of discrimination and equal to the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example (i.e., the predicted probabilities are higher for reports with the clinical assertions of interest versus reports without). The predicted class can be used to calculate classification accuracy; the percentage of reports classified correctly when the output of the model is binary. Final model performance was calculated as the pooled mean performance and standard deviation of performance across all validation folds.