Tjitske van Engelen

178 Chapter 8 observer agreement in 168 patients who experienced an infectious episode in the intensive care unit [5]. Each case was independently assessed by two research physicians working at least six months on the project, who scored the source of infection using a composite reference standard. Agreement was 89% and 69% for a partial and complete diagnostic match, respectively. In addition, the authors found varying agreement from 35% to 97% within specific diagnostic subgroups, with 89% concordance for CAP [5]. A study that investigated the effects of an imperfect reference standard on study outcome suggests that even an almost perfect reference standard can lead to estimates with considerable error [12]. In the present study, total agreement after first assessment by medical students was observed for 132 of the 240 participants (55%). Agreement among prevalent diagnoses varied from 49% (LRTI other than CAP) to 81% (CAP), indicating that assessment of predefined cases by medical students may lead to similar agreement rates as compared to the assessment by well-trained physicians in Klein Klouwenberg’s study [5]. Notably, after individual assessment of paper vignettes by the internist, pulmonologist and, if necessary, cardiologist, the members of the expert panel reached mutual agreement in only 34 of 60 cases (57% concordance; 95% CI 44–69%). Furthermore, inter-observer agreement between members of the expert panel also varied (e.g., concordance for LRTI other than CAP was 55%, for cardiac failure 75% and for CAP 100%). Our results underscore the necessity for consensus diagnosis as individual assessment of cases leads to considerable disagreement - not only among students or residents, but also among fully trained and experienced medical specialists. Our study differs from previous work as this additional assessment was performed if there was disagreement after the first assessment. The validation process of the study has the intrinsic difficulty that the student and residents followed the guidelines presented in the structured handbook, whereas the expert panel also carries its years’ long experience. The qualitative evaluation of the 60 validation cases showed that differences mainly occurred in less-severe diagnoses, such as an URTI. For this specific diagnostic label, students may conclude based on the strict guidelines of the handbook that a patient suffered from an URTI, whereas the medical experts may attach little importance to this if the more clinically relevant diagnosis of heart failure is also present. As shown, comparing the classification of 60 patients by medical students and residents with that of the panel of medical specialists resulted in agreement on the clinical diagnosis for 50 of the 60 patients. If the URTIs are not counted as disagreement, there would have been agreement for 55 of the 60 cases (92% concordance, 95% CI 85–99%). We conclude that students formally classify more lesssevere diagnoses such as URTI, that a medical specialist would put aside. What is undisputable is that the presented method is efficient, with classification by medical experts limited to 24% of the study population and further shown by the estimated reduction of 703 hours of work by medical specialists in this RCT. The development and use of the diagnostic handbook can be considered one of the strengths of our approach. All members of the classification teams (i.e. students, residents and the expert panel) used the same structured diagnostic labels as a guideline for their assessment. The provision of such a reference will add to the consistency of disease classification in all steps of the process. In addition, we structured the consensus meetings, including blinding of the attendants for each other’s assessments,

RkJQdWJsaXNoZXIy MTk4NDMw