Timo Soeterik

114 CHAPTER 6 DISCUSSION In this study, we assessed the performance of a nomogram predicting side- specific EPE, using data from an external patient population undergoing RARP. The discriminative ability of this nomogram observed in our study was fair, given the AUC of 0.78. However, calibration of the nomogram was poor, as the nomogram resulted in systematic underestimation of side-specific EPE risk. Although model updating improved the agreement between overall mean predicted and observed risk, substantial and unsystematic disagreement between predicted and observed probabilities on the calibration plot persisted. Use of this model for individualized patient EPE risk prediction is therefore not recommended. Based on the superior net benefit of this nomogram compared to use of MRI alone, the nomogram could potentially be of value if risk thresholds were used to determine side-specific present EPE. Based on the high net benefit and acceptable sensitivity of 82%, the most optimal results would be reached if a risk threshold of 25% is used. However, clinicians who aim to use this model in daily practice should realize that specificity for this threshold is low (58%). Using a 25% threshold would thus lead to overtreatment, as the nomogram would advise against nerve sparing in a large number of cases with ipsilateral organ-confined disease. Our findings concerning calibration (underestimation) of the original nomogram were consistent with those reported by Sighinolfi et al. 19 These authors also observed poor model fit of the Martini nomogram in their external validation study. 19 Findings regarding discrimination were inconsistent, as their reported AUC was substantially lower (0.68 versus 0.78). It should be noted that this previous external validation study consisted of a retrospective series of only 106 patients, accounting for a total of 137 biopsy-positive lobes of which 40 lobes contained EPE. Given this relatively small population with low number of events and subsequent low EPV (<7), sample size could be too small for a reliable validation. Generalisability of a prediction model strongly depends on a number of factors, of which case mix is crucial. When comparing this validation cohort with the derivation cohort, a number of important differences can be addressed that may explain the poor fit. First, prevalence of EPE on final pathology was much higher in the validation cohort compared to the development cohort (250/792 [32%] vs. 142/829 [17%]). This may partially explain the systematic underestimation of the predicted probabilities on initial calibration. Also, evaluation of the distribution of the predictors revealed that presence of EPE on mpMRI was more common in the external validation cohort (20% vs. 14%), as well as the number of cases with maximum % core involvement >50 (44% vs. 34%), compared with the development cohort. The prevalence of predictors in the study sample is important for the generalisability of the model since they establish the total variance that is being explained by the model. Compared with the validation cohort, the prevalence of predictors were lower in the development cohort. The relatively low prevalence of the predictors also explains the relatively wide 95% CI around the odds ratios of the original model,