Joeky Senders

112 Chapter 6 included in this study. Glioblastoma constitutes the most prevalent type of primary malignant brain tumor. 5 Patients with glioblastoma generally undergo thorough radiological workup, which is used for diagnostic purposes, as well as neurosurgical planning. The free-text brain MRI reports therefore contain a variety of radiological entities ideally suited to develop an NLP pipeline for clinical text mining. Patients were identified through a departmental database that registers all neurosurgical patients undergoing surgery at our institution. All unique, complete brain MRI reports of the preoperative magnetic resonance imaging (MRI) studies were retrieved by cross- linking the patient identification number with the radiology reports in our centralized institutional data registry. Reports were excluded if the patient underwent any form of oncological treatment (i.e., surgical resection, chemotherapy or radiotherapy) prior to the date of the MRI study or if the reports described lesions suspected for a diagnosis other than a malignant brain tumor. Ground truth labels Ground truth labels of the radiological characteristics of interest were provided manually by clinical reviewers. The total text corpus was divided into two blocks that were each labeled by three independent raters (I.S., J.A, J.M., K.A., L.C., P.C.) for assertions of specific radiological characteristics in a binary fashion (i.e., reported to be present or not). Because each report was labelled by three independent raters, the ground truth was based on the consensus between two or more raters. The radiological characteristics of interest included laterality (left-sided involvement, right-sided involvement, multifocality), location (involvement of the frontal lobe, temporal lobe, parietal lobe, occipital lobe, and corpus callosum), tumor aspect (necrosis, cystic, ring enhancement, heterogenous enhancement), and the presence of other radiological characteristics (hemorrhage, edema, mass effect). Preprocessing Several preprocessing steps were required to convert the brain MRI reports to a numeric format that can be processed by an NLP algorithm. Furthermore, these steps allow for the most parsimonious representation of the lexical content, thereby reducing the feature space and thus the likelihood of overfitting to the training data. Redundant and duplicate text (e.g., date, time, the physician’s signature, stop words etc.) was removed, and a Porter stemming algorithm was used to converge words with a similar lexical root. 6 For example, ‘necrosis’ and ‘necrotic’ can both be converted to ‘necro’. After splitting the stemmed reports into individual words (i.e., tokenization), n-grams were constructed to assign unique value and meaning to adjacent combinations of words. 7 For example, the adjacent words ‘ring’ and ‘enhancement’ can be combined