Milea Timbergen

147 Feature selection was performed to eliminate features which were not useful to distinguish between the classes, e.g., DTF vs. non-DTF. These included; 1) a variance threshold, in which features with a low variance (<0.01) are removed. This method was always used, as this serves as a feature sanity check with almost zero risk of removing relevant features; 2) optionally, a group-wise search, in which specific groups of features (i.e. intensity, shape, and the subgroups of texture features as defined in Supplemental material 1) are selected or deleted. To this end, each feature group had an on/off variable which is randomly activated or deactivated, which were all included as hyperparameters in the optimization; 3) optionally, individual feature selection through univariate testing. To this end, for each feature, a Mann-Whitney U test is performed to test for significant differences in distribution between the labels (e.g., DTF vs non-DTF). Afterwards, only features with a p-value above a certain threshold are selected. A Mann-Whitney U test was chosen as features may not be normally distributed and the samples (i.e., patients) were independent; and 4) optionally, principal component analysis (PCA), in which either only those linear combinations of features were kept which explained 95% of the variance in the features or a limited amount of components (between 10 – 50). These feature selection methods may be combined by WORC, but only in the mentioned order. Oversampling was used to make sure the classes were balanced in the training dataset. These included; 1) random oversampling, which randomly repeats patients of the minority class; and 2) the synthetic minority oversampling technique (SMOTE) 10 , which creates new synthetic “patients” using a combination of the features in the minority class. Randomly, either one of these methods or no oversampling method was used. Lastly, machine learning methods were used to determine a decision rule to distinguish the classes. These included; 1) logistic regression; 2) support vector machines; 3) random forests; 4) naive Bayes; and 5) linear and quadratic discriminant analysis. Most of the included methods require specific settings or parameters to be set, which may have a large impact on the performance. As these parameters have to be determined before executing the workflow, these are so-called “hyperparameters”. In WORC, all parameters of all mentioned methods are treated as hyperparameters, since they may all influence the decision model creation. WORC simultaneously estimates which combination of algorithms and hyperparameters performs best. A comprehensive overview of all parameters is provided in the WORC documentation 7 . 5