Stefan Elbers

34 Chapter 2 Roberts et al., 2017). In addition, a sensitivity analysis for the within-subject correlation is available in the multimedia appendix for all r-values between 0 and 1. Sample sizes lower than n = 10 were not included in the analysis, because this could lead to inaccurate estimates of the standardized mean gain (Morris 2000). In some interventions, the main treatment programme was followed by follow-up treatment activities to enhance maintenance. In these situations, we considered end of treatment as the moment that the main treatment programme (i.e.. that covered the core of the treatment procedures) ended. Hence, follow- up meetings, booster sessions or reinforcement sessions were not considered as main treatment and could continue after post-treatment assessments. All assessments within 1 month after end of treatment were considered as a 'post' measure. We used the last available time-point for the follow-up contrast. We calculated standard deviations from standard errors by multiplying them with the square root of the corresponding sample size (Higgins et al., 2019). If medians and range were provided, we used the formula of Hozo and colleagues (2005) to estimate the mean value and corresponding SD. For studies that presented change scores, we calculated final value mean scores and imputed the baseline standard deviation. If the latter was not available, the study was not included in the meta- analysis. For medians and inter quartile ranges, we estimated means and SDs using the assumption that the interquartile range (IQR) width is 1.35 SD (Higgins et al., 2019). In case of missing measures of variability at follow-up, we imputed the baseline value or otherwise used the mean SD of the remaining trials that reported on that outcome. If data of the cohort was presented for different subgroups, we calculated one composite mean and SD. For data that was only presented in figures (e.g. boxplots) we measured the central tendency and measure of dispersion if the figure was of sufficient quality. Subsequently, we summarized the effect sizes per outcome, by describing the direction of effect for each of the included cohorts over time. We a priori decided not to perform any pooling, because this was not in line with our study aims and we expected substantial heterogeneity among the included studies. To facilitate interpretation of the effect sizes, we re-expressed the median pre-post effect size on the most commonly used measurement instrument, using the weighted standard deviation of all available post-intervention scores of that instrument. To assess the statistical heterogeneity of the study outcomes we also calculated the I 2 and the Q test for each outcome domain at every time point. A statistically significant Q test rejects the hypothesis that all effect sizes are equal (Huedo-Medina et al., 2006). In addition, the I 2 index provides an indication of the proportion of variability in observed effects that is either due to between-study variability or due to within-study variability (i.e. sampling error) (Borenstein et al., 2017). This analysis was performed with the R metaphor package in RStudio (R Core Team, 2013; RStudio Team, 2020; Viechtbauer, 2010).