Marga Hoogendoorn

25 of development of a scoring system, methods used to measure reliability and validity of the system, results regarding reliability and validity and methods used to translate the workload measurement in needed nursing time. Assessment of validity and reliability of scoring systems For all included full papers the validity and reliability of the scoring systems were assessed using the following criteria. Content validity: we considered a scoring system content-validwhen nursing professionals participated in the selection of interventions and activities included in the scoring system, and when expert-consensus in focus groups or Delphi rounds were used or when a Content Validity Index for the overall system was at least 0.9 10, 11 . Reliability: we assessed data on inter-rater reliability (level of agreement between the scores of different nurses scoring the nursing interventions of the same patient) and intra-rater reliability (level of agreement between assessment and reassessment of the nursing intervention scores of a patient by the same nurse). The following statistical tests and cut-off values were used for the assessment of the reliability: Cohen’s Kappa and the Intra-Class Coefficient (ICC). For the Kappa we used the ranges of kappa according to Landis and Koch meaning a value of 0.41–0.60 as moderate; 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement 12 . For evaluation of the ICC we used the Cronbach’s alpha with a cut-off point of 0.70 for an acceptable reliability 13 . Validity: we defined the validity as to which extent interventions or activities of a scoring system actually measured the true outcome i.e. needed nursing time. We distinguished two methods to assess the validity: 1. By comparing the results of a scoring system with the ‘gold standard’ observed time- measurements. 2. By comparing a newly developed scoring system with an already existing system. We considered method 2 a weaker method for validation. The following statistical methods were used for the assessment of the validity: linear regression equation (r 2 ) and the Correlation Coefficient (Pearson’s r or Spearman’s r s ). For interpretation of the results we did categorize the results as a weak (r/r 2 <0.25), moderate (0.25 ≤ r/r 2 <0.75) or strong (r/ r 2 ≥ 0.75) correlation 14 . We used the same methods to assess the validity of the translation of the measurement of nursing time into the need for nursing staff, often translated into a Nurse:Patient-ratio.