Milea Timbergen

129 Performance was evaluated using the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve, balanced classification accuracy (BCA), sensitivity, specificity, negative predictive value (NPV), and positive predictive value (PPV). For the multiclass models, we reported the multiclass AUC 25 and overall BCA 26 . The positive classes included: DTF in the differential diagnosis, and the presence of the mutation in the mutation analysis. The 95% confidence intervals were constructed using the corrected resampled t-test, thereby taking into account that the samples in the cross-validation splits are not statistically independent 27 . Both the mean and the confidence intervals are reported. ROC confidence bands were constructed using fixed-width bands 28 . To assess the predictive value of the various features, models were trained based on: 1) volume; 2) age and sex; 3) T1w-MRI imaging; 4) T1w-MRI imaging, age and sex. Model 1 was created to verify that the imaging models were not solely based on volume. Model 2 was created to evaluate potential age and gender biases. In model 4, the imaging and clinical characteristics are combined by using both the imaging features and age and sex as features for a total of 413 features. This allows WORC to combine the imaging and clinical characteristics in the most optimal way. Additionally, a model was made for each combination of T1w-MRI and one of the other included MRI sequences (e.g., based on T1w- MRI and T2w-MRI) to evaluate the added value of these other sequences. When a sequence was missing for a patient, feature imputation was used to estimate the missing values. The code for the feature extraction, model creation and evaluation has been published open-source 29 . Model insight To explore the predictive value of individual features, the Mann-Whitney U univariate statistical test was used. P-values were corrected for multiple testing using the Bonferroni correction, and were considered statistically significant at a p-value < 0.05. Feature robustness to variations in the segmentations was assessed on the subset of 30 DTF segmented by two observers using the intra-class correlation coefficient (ICC), were an ICC > 0.75 indicated good reliability 30 . To evaluate model reliability, a separate model was trained using only these features with a good reliability. To gain insight into the models, the patients were ranked based on the consistency of the model predictions. Typical examples for each class consisted of the patients that were correctly classified in all cross-validation iterations; atypical vice versa. 5

RkJQdWJsaXNoZXIy ODAyMDc0