Joyce Molenaar

74 CHAPTER 3 Variable importance with OOB-observations, including sensitivity analyses Variable importance was measured with RF in the following way. RF takes a bootstrap sample for every tree that it constructs. The data that are not used in the bootstrap sample are called the out-of-bag (OOB) observations. RF makes a prediction for these OOB-observation based on the tree that is constructed on the bootstrap-sample, leading to an OOB-error. Next, to determine the tree-specific importance of a variable, a variable is randomly shuffled (permuted) in the bootstrap sample. In this new variant of the bootstrap sample, a new tree is grown which gives a new OOB-error. This OOB-error is then compared to the original OOB-error. If permuting a variable increases the error, it is considered important as the model relied on it for prediction. Consequently, by permuting a variable and comparing the OOB-error rates of the predictions before and after permutation (6), we obtain a measure of variable importance for each variable for a single tree. The OOBerrors increase for each variable are averaged over all trees and compared. The average of all these tree importance values yields the ranking of variables for the model (7). As sensitivity analyses, we also checked the permutation importance and Partial Dependence Plots (PDP) (8-10). Permutation importance permutes the values of a specific variable in the full dataset (rather than individual trees) to measure the impact on the model’s performance. The PDP of each variable provides insight into the direction and strength of the relationship with the dependent variable while holding all other predictors constant. We checked whether the direction of the important variables aligned with their categorization as risk or protective factors. References 1. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016;. p. 785-94. 2. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 2010; 33(1): 1. 3. Wright MN, Wager S, Probst P. Package ‘ranger’. 2023. Available from: https://mirror.las. iastate.edu/CRAN/web/packages/ranger/ranger.pdf. 4. Friedman J, Hastie T, Tibshirani R, Narasimhan B, Tay K, Simon N, Qian J. Package ‘glmnet’. 2023. 5. Chen T, He T, Benesty M, Khotilovich V. Package ‘xgboost’. 2023. Available from: https:// cran.utstat.utoronto.ca/web/packages/xgboost/xgboost.pdf. 6. Breiman L. Random forests. Machine learning 2001; 45: 5-32. 7. Janitza S, Celik E, Boulesteix A-L. A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification 2018; 12: 885-915. 8. Molnar C. Interpretable machine learning: Lulu. com; 2020. 9. Fisher A, Rudin C, Dominici F. All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously. J Mach Learn Res 2019; 20(177): 1-81. 10. Friedman JH. Greedy function approximation: a gradient boosting machine. Annals of statistics 2001: 1189-232.

RkJQdWJsaXNoZXIy MTk4NDMw