Joeky Senders

132 Chapter 7 TABLE 1. Generic and algorithm specific preprocessing steps. Preprocessing step Explanation Example Generic preprocessing Raw text report Unprocessed raw text reports “…Exam is somewhat limited secondary to motion artifact. There is a 3.5 x 3.1 x 3.1 cm (TV by AP by CC) heterogeneously, predominantly peripherally enhancing mass centered within the right frontal lobe (series 13 image 87, series 14 image 9), which corresponds to the mass lesion identified on the recent CT 1/22/2010…” Cleaning Removal of redundant information (e.g., date, time, radiologist’s signature, white spaces between sections, punctuation between letters, and stop words) and transformation to lower case letters. “…exam somewhat limited secondary motion artifact 3.5 x 3.1 x 3.1 cm tv ap cc heterogeneously predominantly peripherally enhancing mass centered within right frontal lobe series 13 image 87 series 14 image 9 corresponds mass lesion identified recent ct…” Stemming Words with a similar lexical root are converged to the same stem word. For example, ‘heterogeneously’ and ‘heterogeneity’ are both converged to ‘heterogen’. “…exam somewhat limit secondari motion artifact 3.5 x 3.1 x 3.1 cm tv ap cc heterogen predominantli peripher enhanc mass center within right frontal lobe seri 13 imag 87 seri 14 imag 9 correspond mass lesion identifi recent ct…” Preprocessing for bag-of-words models* N-gram construction Adjacent individual word tokens were combined in mono-, bi-, and/ or trigrams. In the example on the right, the stemmed report is converted to mono- and bi-grams. “…exam exam_somewhat somewhat somewhat_limit limit limit_secondari secondari secondari_motion motion motion_ artifact artifact artifact_3.5 3.5 3.5_x x x_3.1 3.1 3.1_x x x_3.1 3.1 3.1_cm cm cm_tv tv tv_ap ap ap_cc cc cc_heterogen heterogen…” TF-IDF word vectorization The relative frequency of individual word tokens in each document was calculated. Each document is represented by a vector, in which each number corresponds with the relative frequency of a certain grams in the document. [0.08497, 0.06189, 0.06895, 0.06642, 0.05214, 0.05105, 0.08855, 0.11227, 0.15729, 0.06813, 0.06677, 0.05419, 0.05193, 0.06535, 0.06875, 0.07164, 0.13677, 0.08250, 0.06798, 0.09174, …] Preprocessing for sequence-based models** Embedding layer An 8-dimensional embedding layer was trained and added as the first layer of each model. Each word in the document is represented by an 8-dimensional vector. [[0.12, 0.28, 0.14, 0.48, 0.98, 0.77, 0.21, 0.87], [0.79, 0.66, 0.49, 0.49, 0.56, 0.39, 0.32, 0.51], [0.54, 0.33, 0.84, 0.72, 0.34, 0.47, 0.12, 0.42], …] Abbreviations: 1D=one dimensional; GRU=gated recurrent unit; LASSO=least absolute shrinkage and selection operator; LSTM=long-short term memory; TF-IDF=term frequency-inverse document frequency. *Logistic regression, LASSO-regression, and multi-layer perceptron **1D-convolutional neural networks, LSTM, and GRU