Joeky Senders

116 Chapter 6 TABLE 2. Descriptive table presenting the prevalence of all radiographic characteristics in the total data set (n = 562 brain MRI reports), as well as the associated interrater agreement for the manually provided labels. Domain Subdomain Frequency κ* n % Laterality left-sided involvement 302 53.7 0.868 right-sided involvement 281 50.0 0.874 multifocality 174 31.0 0.297 Location frontal lobe 235 41.8 0.847 temporal lobe 250 44.5 0.831 parietal lobe 175 31.1 0.813 occipital lobe 73 13.0 0.821 corpus callosum 59 10.5 0.574 Tumor aspect necrosis 165 29.4 0.734 cystic 85 15.1 0.625 ring enhancement 122 21.7 0.379 heterogenous enhancement 232 41.3 0.225 Other characteristic hemorrhage 151 26.9 0.620 edema 236 42.0 0.610 mass effect 288 51.2 0.493 Abbreviations: κ=Fleiss’ Kappa statistic *The interrater agreement for the consensus labels was calculated by means of the Fleiss’ Kappa statistic. The strength of the interrater agreement can be categorized according to this score as less than chance (<0), slight (0.01-0.20), fair (0.21-0.40), moderate (0.41-0.60), substantial (0.61-0.80), and near perfect (0.81-1). The overall interrater agreement was substantial ( κ = 0.670) and ranged between fair agreement ( κ = 0.225 for heterogenous enhancement) to near perfect agreement for ( κ = 0.868 for right-sided tumor involvement). The cross-validated AUC ranged between 0.816 for multifocality and 0.984 for left-sided tumor involvement (Table 3, Figure 1), and the binary classification accuracy ranged between 78.6% for multifocality and 96.6% for tumor involvement of the occipital lobe. The feasibility analysis revealed that the frequency distribution of the variables of interest was not correlated with model performance (rho = 0.179, p = 0.52) (Figure 2a). Excellent model performance (i.e., AUCs > 0.95) could even be achieved for variables with small sample sizes (i.e., as low as 50-100 observations in the minority group) and relatively unbalanced outcomes (i.e., class imbalance up to a 9:1 ratio). In contrast, model performance was strongly correlated with the interrater agreement of the consensus labels (rho = 0.904, p < 0.001)(Figure 2b). As the strength of the interrater

RkJQdWJsaXNoZXIy ODAyMDc0