Joeky Senders

146 Chapter 8 identified through a departmental database that registers all neurosurgical patients who undergo an operation within our department. To retrieve the associated free- text pathology reports through which the diagnosis was made, we cross-linked the patient identification number and date of surgery with the pathology reports in our centralized institutional clinical data registry. These free-text pathology reports were used as input data for the natural language processing model. Manual annotations on the histopathological diagnosis were provided by a clinical reviewer with over 20 years of experience (R.W.). These annotations were used as labels for the target outcome in a binary fashion (diagnosis of interest versus other diagnoses). Some patients underwent multiple operations, thereby providing multiple pathology reports and associated diagnosis labels in the analysis. The total text corpus was split at the patient- level into a training, validation, and hold-out test set according to a 2:1:1 ratio. The test set was kept separate until the final performance evaluation. Differences in baseline characteristics between the training, validation, and hold-out test set were compared by means of the Chi-square test, analysis of variance (ANOVA), and Kruskal-Wallis test depending on the nature and distribution of the baseline characteristics. Preprocessing The algorithms used for this purpose can be classified into two broad categories, regression-based and neural network-based algorithms. 16,17 The regression-based algorithms utilized a bag-of-words/n-grams approach, thereby considering the relative frequency of words or adjacent word combinations in a document but ignoring their order. 17 Deep learning-based algorithms, on the other hand, modeled the order of the words and the semantic relationships among them, as well. 16,18 The analysis of free-text pathology reports required both generic and approach-specific preprocessing steps. Pseudocode in a generalizable format is provided in Table 1. These preprocessing steps are required to compress the lexical content of free-text pathology reports to the most parsimonious representation and convert these reports to a numeric format that could be processed by a classification algorithm. Redundant or duplicate text (time, date, pathologist’s signature, unnecessary white spaces etc.) was removed, and stemming was used to converge words with a similar lexical root. 19 For the regression-based algorithms, we used n-grams to assign unique value and meaning to adjacent word combinations. 20 The term frequency-inverse document frequency (TF-IDF) vectorization was used to convert each document into an array of numbers reflecting the relative frequency of these word or word combinations. 21 For the deep learning models, we tokenized and zero padded the documents to convert them to a numeric format with the same length. 22