Milea Timbergen

136 Table 4. Performance of the two radiologists and the radiomics models in differentiating between DTF (n=20) and non-DTF (n = 20) in the location-matched cohort. Outcomes are presented with the 95% confidence interval. Model 2 Age +Sex Model 3 T1w Model 4 T1w + Age + Sex Rad 1 Rad 2 AUC 0.93 [0.84, >1] 0.87 [0.73, >1] 0.98 [0.92, >1] 0.80 0.88 BCA 0.85 [0.71, 1.00] 0.71 [0.56, 0.87] 0.88 [0.77, 0.99] 0.75 0.90 Sensitivity 0.79 [0.57, >1] 0.49 [0.21, 0.77] 0.78 [0.57, 1.00] 0.65 0.90 Specificity 0.90 [0.71, >1] 0.93 [0.78, >1] 0.98 [0.91, >1] 0.85 0.89 NPV 0.82 [0.61, >1] 0.65 [0.43, 0.76] 0.82 [0.64, >1] 0.71 0.89 PPV 0.91 [0.72, >1] 0.81 [0.47, >1] 0.98 [0.91, >1] 0.81 0.90 T1w: T1-weighted; AUC: area under the receiver operator characteristic curve; BCA: balanced classification accuracy; PPV: positive predictive value; NPV: negative predictive value CTNNB1 mutation status stratification Table 5 depicts the performance of the radiomics models for the CTNNB1 mutation stratification. Model 4, using T1w-MRI, age, and sex, had a high specificity (S45F: 0.83, T41A: 0.59 and WT: 0.72), but a sensitivity similar to guessing (S45F: 0.15, T41A: 0.49 and WT: 0.56). This indicates a strong bias in the models towards the negative classes, i.e. not-S45F, not-T41A and not-WT. As model 4 did not perform well, models 1, 2, and 3 were omitted from the results, as these contain a subset of these features. Adding the T2w or T1w post-contrast imaging, i.e. models 5 and 6, did not improve the performance. Hence, the models using either only non-FatSat or FatSat scans were omitted, as these contain subsets of the scans from models 5 and 6. Model insight As the CTNNB1 mutation status stratification models did not perform well, the model insight analysis was only conducted for the differential diagnosis. The p-values from the Mann-Whitney U test between the DTF and non-DTF patients of all features are shown in Supplemental Table 3. In the feature importance analysis, 76 T1w-MRI features had significant p-values (5.4x10 -8 to 4.8x10 -2 ). These included two intensity features (entropy and peak), two shape features (radial distance and volume), and 72 texture features. The p-value of age (1x10 -11 ) was lower than that of all imaging features. The ICC values of all T1w-MRI features are shown in Supplemental Table 4. Of the 411 features, 270 (66%) had an ICC > 0.75 and thus good reliability. Only using these features with a good reliability in model 3 did not alter the performance. 5