Joeky Senders

131 Comparing NLP methods Preprocessing The analysis of free-text reports required both generic and approach-specific preprocessing steps as described in Table 1. Free-text reports were cleaned from redundant or duplicate information (e.g., time, date, radiologist’s signature, and white spaces between paragraphs), and stemming was used to teach the algorithm the equivalency between words with a similar lexical root and further reduce the vocabulary. These steps result in the most parsimonious representation of the lexical meaning in a text report. Additional preprocessing steps for the bag-of-words approach included the n-gram technique and term frequency-inverse document frequency (TF-IDF) vectorization. 4,5 Because the bag-of-words approach ignores the order of the words, important word combinations can be missed. N-grams were, therefore, constructed to join adjacent word combinations and give them unique value and meaning. Distinct words such as ‘midline’ and ‘shift’ can, for example, be combined into the bigram ‘midline_shift’. The use of mono-, bi-, and trigrams was included as hyperparameter during cross-validation. The TF-IDF vectorization converts the text document into an array of numbers that reflects the frequency of words in the document relative to the frequency of these words across all documents. An embedding layer was created for all sequence-based algorithms. In the embedding layer, a word can be represented by a vector of numbers instead of a single number. These numbers represent the coordinates of the word in the word embedding space. The words ‘man’, ‘woman’, ‘boy’, and ‘girl’ could, for example, be located in the same plane in the word embedding space but separated by dimensions related to gender and age. Word embedding therefore allows for the mapping of lexical relationships between individual words, and thus the statistical properties of a language. 6 The embedding layer was trained on the training set in a supervised fashion using a single perceptron as output node.