Mariska Tuut

Mariska Tuut GUIDELINE TESTING DEVELOPMENTON HEALTHCARE RELATED

Guideline development on healthcare related testing Mariska Tuut

The research presented in this thesis was conducted at CAPHRI Care and Public Health Research Institute, department of Family Practice at Maastricht University. CAPHRI participates in the Netherlands School of Public Health and Care Research (CaRe). Colophon Lay-out: Mariska Tuut Cover design: Ridderprint | www.ridderprint.nl Printing: Ridderprint | www.ridderprint.nl © Copyright: Mariska Tuut, Maastricht 2024 ISBN: 978-94-6506-258-7

Guideline development on healthcare related testing Proefschrift voor het behalen van de graad van doctor aan de Universiteit Maastricht onder gezag van Rector Magnificus, Prof. dr. Pamela Habibović, overeenkomstig het besluit van het College van Decanen, te verdedigen in het openbaar op dinsdag 8 oktober 2024 om 13.00 uur door Margaretha Klaasje Tuut geboren op 26 januari 1975 in Groningen

Promotores Prof. dr. T. van der Weijden Prof. dr. J.S. Burgers Co-promotor Dr. M.W. Langendam (University of Amsterdam) Beoordelingscommissie Prof. dr. M.H.J.M. Majoie (voorzitter) Prof. dr. P.J. van der Wees (Radboud Universiteit Nijmegen) Prof. dr. S.M.A.A. Evers Dr. M.M. Tabbers (University of Amsterdam)

Table of contents Preface ................................................................................................................. 9 Chapter 1. General introduction........................................................................... 11 Chapter 2. Applying GRADE for diagnosis revealed methodological challenges: an illustrative example for guideline developers..................................................... 27 Chapter 3. Do clinical practice guidelines consider evidence about diagnostic test consequences on patient-relevant outcomes? A critical document analysis....101 Chapter 4. Required knowledge for guideline panel members to develop healthcare related testing recommendations – a developmental study ..................139 Chapter 5. Developing guideline recommendations about tests: educational examples of test-management pathways .............................................................175 Chapter 6. Co-creation of a step-by-step guide for specifying the test-management pathway to formulate focused guideline questions about healthcare related tests ......................................................................................189 Chapter 7. General discussion ............................................................................219 Impact ...............................................................................................................239 Summary ...........................................................................................................247 Samenvatting ....................................................................................................255 Publiekssamenvatting .......................................................................................265 Bibliography ......................................................................................................271 Over de auteur / About the author .....................................................................283 Dankwoord ........................................................................................................287

Preface ‘Good guidelines can only make you better’ [1]. ‘The challenge of scientific research is not to find answers, but to formulate the question.’ [2]. ‘Guideline development reveals the dilemmas and uncertainties associated with the application of medical knowledge. The guideline should not cover this up, but make it transparent, and link patient decision aids to preference sensitive recommendations.’ [3]. The above three propositions, cited from my supervisors have been published decades ago, but still underpin the urgency of this thesis. These statements not only show confidence in the profession of guideline development, but also enlighten ongoing challenges in guideline development methods. But foremost, these propositions inspire. They align with my experience as a guideline methodologist, in which I had, and have the opportunity to work with so many dedicated healthcare professionals and patient(representative)s, methodologists/process leads and guideline panel chairs, in whom I saw enthusiasm and expertise, but in whom I also saw their struggles in using the right ingredients in the right way to ‘cook the right guidelines’. It is my personal ambition to improve and facilitate guideline development methods - especially in the area of recommendations about healthcare related testing – and thereby to be able to contribute to the improvement of healthcare quality. Therefore, this thesis focuses on knowledge and tools that can help guideline developers (in the broadest sense) in appropriately developing recommendations about healthcare related testing. References 1. Burgers JS. Quality of clinical practice guidelines. Nijmegen: Catholic University Nijmegen; 2002. 2. Langendam MW. The impact of harm reduction-based methadone treatment on HIV infection and mortality. Amsterdam: University of Amsterdam; 2000. 3. Van der Weijden T. Richtlijnen in de spreekkamer, van dogma naar dans. Maastricht: Maastricht University; 2010.

Chapter 1. General introduction

12 Chapter 1

General introduction 13 1 General introduction This introduction chapter guides through the various pillars that are essential for addressing challenges in guideline development and healthcare related testing in practice. It sets the rationale for this thesis, outlining and bringing together the worlds of guideline development, testing in practice, and test evaluation in research to finally arrive at the aim of this thesis and the research questions. Guidelines Guidelines, including clinical practice guidelines and public health guidelines, are documents providing recommendations intended to optimize patient care. They are developed using a systematic review of the available evidence and an analysis of benefits and harms of alternative care options. To be regarded as trustworthy according to the Institute of Medicine, guidelines should: - be based on a systematic review of the existing evidence; - be developed by a knowledgeable, multidisciplinary panel of experts and representatives from key affected groups; - consider important patient subgroups and patient preferences, as appropriate; - be based on an explicit and transparent process that minimizes distortions, biases, and conflicts of interest; - provide a clear explanation of the logical relationships between alternative care options and health outcomes; - provide ratings of both the quality of evidence and the strength of recommendations; and - be reconsidered and revised as appropriate when important new evidence warrants modifications of recommendations [1]. Guideline development follows a clear process, which is crucial for acceptance and implementation. The first step concerns an analysis of problems to be addressed, the identification of the specific topic, target group(s) and target population of the guideline. Next, a guideline panel (also known as a guideline development group/committee) is established, consisting of representatives from all relevant professional groups, patient/consumer/people representatives and methodologists. Following that, the scope of the guideline is defined including the formulation of key questions that need to be addressed. After that, a draft guideline is developed. This process includes a series of steps, in which available guidelines are reviewed, scientific evidence is identified and critically assessed, and relevant expertise and experience is considered, after which draft recommendations are formulated. Next, the draft guideline is disseminated to all relevant stakeholders and target groups for

14 Chapter 1 comments and feedback. This step may include pilot testing of the draft guideline to identify barriers for implementation. Then, the final version of the guideline can be submitted for endorsement or authorisation. Finally, the guideline, and any related materials, such as summaries, patient versions and decision support tools are published. The guideline outlines specific criteria for reviewing and updating the guideline [2]. Note that endorsement and authorisation is not universal in guideline development worldwide. In the Netherlands, authorised guidelines become part of the professional standard for healthcare providers. This guarantees legal embedding of guidelines in the healthcare process and fosters their implementation. Several manuals and guides are available for the development of guidelines [3-5]. The Grades of Recommendation, Assessment, Development and Evaluation (GRADE) Working Group was established in 2000 to provide assistance for the process of guideline development. The GRADE approach highlights the importance of evaluating the certainty of evidence in the development of recommendations, for example by assessing risk of bias and indirectness [6]. Another crucial aspect of this methodology is its emphasis on clinical relevant differences in outcomes that are regarded as important by patients and consumers, so-called people-important outcomes [7]. The GRADE evidence-to-decision framework systematically considers relevant issues such as balance of benefits and harms, values, resources, and acceptability [8, 9]. The GRADE Working Group has produced and continues to produce comprehensive guides for guideline development [7, 10-25]. The GRADE approach has been adopted by many organisations worldwide, including the Netherlands [26]. In the GRADE approach, special attention is given to the development of guideline recommendations on testing, as the link between testing and the impact on peopleimportant outcomes is indirect and requires a specific approach [27-30]. This includes consideration of the consequences of false positive, false negative, and inconclusive test results, specific risk of bias assessment, moving from test results to peopleimportant outcomes (so-called linked evidence), and the need for formal or informal modelling. Competencies needed for guideline development While the essential steps for guideline development have been outlined [31-33], there is limited understanding of the competencies required for the appropriate development of guidelines, particularly those that feature recommendations about testing. Some research has been conducted in this area: Sultan et al. provided a theoretical framework for competencies and educational milestones that should be acquired by

General introduction 15 1 guideline developers for example through training. The authors identified three core competencies: 1. Facilitate the development of guideline structure and setup 2. Make judgments about the quality or certainty of the evidence 3. Transform evidence to a recommendation These core competencies are divided into subcompetencies and milestones. Additionally, the authors acknowledge that a guideline panel includes various roles, i.e. chair, methodologist, and panel members, with different competencies [34]. The specific knowledge and competencies needed for creating guidelines on testing are not explicitly incorporated in this framework. Testing and people-important outcomes A test refers to any procedure performed on a person to detect, diagnose or monitor a condition. This includes testing of a person’s fluids, cells, tissue, functioning and subjective experience. The final objective of testing is to improve people-important outcomes (and/or to prevent deterioration of people-important outcomes). Additional objectives may include offering other benefits (such as simplifying healthcare organisation or reducing expenses) without worsening people-important outcomes. People-important outcomes, also known as people relevant outcomes, patient important outcomes, patient relevant outcomes or patient-centered outcomes, are components of people’s (health) status following an intervention. These outcomes serve to evaluate the effectiveness of the intervention [35]. People-important outcomes may differ depending on the condition and the individual. Common examples include mortality, morbidity, quality of life, and quality of life subscales such as functioning capacity and societal participation. When assessing the effectiveness of a specific treatment, the link between treatment and change in people-important outcomes is usually clear. For example, antibiotic treatment is related to curing bacterial pneumonia (and reducing mortality), radiotherapy is linked to reducing pain in patients with bone metastases, and hip replacement surgery to improved walking function (although side effects and complications should be considered in all cases). Unlike treatment, testing itself typically has no immediate effect on people-important outcomes, although reassurance when a serious illness is ruled out, and the occurrence of serious burden (such as serious adverse events) due to testing are common exceptions to this statement. In general, to progress from testing to people-

16 Chapter 1 important outcomes, a series of essentials steps – such as treatment of a certain condition – should be taken. Testing in clinical practice Clinical decision-making with the use of a test or testing strategy is daily practice. Healthcare professionals may consider the use of tests after history taking and physical examination. Patients may also demand for tests for various reasons, such as family history of disease, concern about physical conditions, or the need for regular testing. Most patients have high expectations regarding the value of tests: they do not expect false positive or false negative test results and do believe that test results are reliable. In other words, test results would give them certainty about their health status and reassure them in case of test results in the normal range [36]. Testing is frequently used for diagnostic purposes. In clinical practice, the diagnostic process is an empirical iterative process [37]. It has inductive and deductive elements, based on Bayes’ theorem [38]. Bayes’ theorem, also known as Bayes’ rule, states that the a posteriori probability of an event (such as a disease or condition) is conditional and depends on the a priori probability of that event and test results. Taking medical history (anamnesis), conducting physical examination and routine medical testing (such as routine laboratory tests) are generally inductive processes for making a general diagnosis (‘rough selection’). Clinicians use signs and symptoms and combine them inductively to move in a diagnostic direction. This can be seen as hypothesis generation. In addition, specific tests (such as spirometry or a dementia test) can be conducted as part of deductive processes. These are targeted tests, intended to confirm or rule out a specific diagnosis. These can be seen as hypothesis testing. The entire diagnostic process in the clinical practice is called the hypothetico-deductive method [39-41]. The diagnostic process includes both sense (including clinical reasoning, understanding, experience and common sense) and science (including evidence, theory and testing) [42]. Clinical experience, which includes gut feelings (‘pluis/niet pluis’), is a crucial element of patient care during consultations [43]. Accordingly, tests serve as complementary tools in clinical practice. Testing in healthcare In this thesis, a test or testing refers to all healthcare related tests and testing strategies that are used for different purposes and roles [44]. Thus, this thesis extends beyond the use of tests for diagnostic purposes by healthcare providers in the consultation room to encompass the entire healthcare, including public health.

General introduction 17 1 Healthcare related tests can be used for several purposes: screening, surveillance, risk classification, diagnosis, staging, treatment triage, determination of prognosis and monitoring/follow-up [44, 45]. Examples of these purposes are shown in table 1. A single test can serve multiple purposes, such as an MRI for women with increased risk or suspected of, or diagnosed breast cancer. It can be used for screening, risk classification, diagnosis, staging, and monitoring/follow-up. Table 1. Testing purpose and examples Testing purposes Examples Screening  Faecal occult blood testing in people aged 55-75 years to screen for colorectal cancer  Anoscopy in people with HIV to screen for anal intraepithelial neoplasia to reduce the risk of anal cancer-related mortality  Hip examination in youth care to select infants at high risk of having hip dysplasia Surveillance  Influenza surveillance to gain insight in the spread and typology of influenza viruses, and their impact  Antimicrobial surveillance to understand antibiotic resistance patterns Risk classification  Measurement of blood cholesterol levels and blood pressure in primary care patients to stratify the risk of a cardiovascular event  Bone mineral density measurement using DEXA scanning to determine the risk of an osteoporotic fracture Diagnosis  Urine dipstick to diagnose urinary tract infection in primary care  Amniocentesis including chromosomal testing to rule out trisomy 21 (Down’s syndrome)  X-ray to diagnose bone fracture  Vision test to detect visual impairment Staging  Histology to stage cancer disease  CT scanning in patients with breast cancer to detect metastases  Beck Depression Inventory to assess level of depression Treatment triage  Allergen testing in patients with asthma to guide asthma management  Bacteriological test to guide antibiotic treatment Prognosis  6-minute walk distance test (6MWD) to estimate risk of death in patients with heart failure  Advanced Dementia Prognostic Tool (ADEPT) to estimate survival in people with dementia Monitoring/ follow-up  Blood glucose monitoring to monitor diabetes mellitus  Weight measurement to monitor weight loss therapy  Spirometry to monitor COPD  Cardiac ultrasound to follow-up patients with heart failure As illustrated in table 1, there is a variety of tests, including self-tests, laboratory tests, imaging, functional tests, and questionnaires, as well as a variety of settings in which testing can be performed, such as public health, primary care, secondary care and long-term care.

18 Chapter 1 History taking and physical examination can also be considered as tests but are outside the focus of this thesis due to their general nature and routine application. Additionally, tests unrelated to healthcare are also outside the scope of this thesis. Such tests include e.g. weight and muscle measurements in gyms, or genealogy tests to trace one’s ancestors. Scientific evaluation of a test To assess the value of a healthcare related test, different aspects should be taken into account [46]: - Analytical performance - Clinical performance - Clinical effectiveness - Cost-effectiveness - Broader impact These concepts are elaborated on in box 1. Box 1. Components of test evaluation Analytical performance: this refers to the ability of the test to accurately detect or measure a particular  measurand. Parameters of analytical performance include:  trueness: the determination whether the test measures the variable of interest  precision: the assessment of the reproducibility of the test.  detection limits: a test might not detect a measurand below or above a certain level or might not be specific enough.  cross-reactivity: the influence of factors on the test result beyond the measurand of interest. Clinical performance: this refers to the ability of a test in correctly classifying individuals with and without the target condition (such as a disease). This is also called the diagnostic accuracy of a test. Parameters of diagnostic accuracy can be established by comparing the index and reference tests. The index test is the test of interest, while the reference test (also known as the reference standard) is the test to which the index test is compared. The reference test can be the gold standard, but also other options (such as the test in usual care/practice) are used. Clinical performance measures can be obtained by categorizing people with and without the target condition according to their test results in a 2x2 table (table 2): Table 2. Clinical performance of a test in a 2x2 table People with the target condition People without the target condition Total Positive test result TP FP TP+FP Negative test result FN TN FN+TN Total TP+FN FP+TN Total TP: true positives, FP: false positives, FN: false negatives, TN: true negatives Such a table provides insight into the numbers of true positive (TP), true negative (TN), false positive (FP) and false negative (FN) test results. A test can have an inconclusive result as well. Other frequently used parameters of the clinical performance of a test include:  sensitivity: the probability of getting a positive test result in people with the target condition (TP/(TP+FN))

General introduction 19 1  specificity: the probability of getting a negative test result in people without the target condition (TN/(TN+FP))  positive predictive value (PPV): the probability of having the target condition in people with a positive test result (TP/(TP+FP))  negative predictive value (NPV): the probability of not having the target condition in people with a negative test result (TN/(TN+FN)) Clinical effectiveness (also known as clinical utility): this refers to the ability of a test to improve peopleimportant outcomes. Cost-effectiveness: this refers to the assessment of changes in costs and people relevant outcomes resulting from the introduction of a test. There are several perspectives from which costs can be determined, such as the individual patient perspective (e.g. costs that patients have to pay to undergo a test), the healthcare perspective (e.g. costs because of time invested by healthcare professionals and other resources needed for performing a test) and societal perspective (e.g. costs of testing covered by health insurance). In all perspectives, direct costs (such as costs of the tests), and indirect costs can be taken into account. Indirect costs could include travel expenses and costs for childcare for the patient while travelling to the hospital, loss of income, and social security expenses due to absence at work. Broader impact: this refers to consequences of the test beyond clinical effectiveness and costeffectiveness, such as acceptability, implementability, and consequences on legal, ethical, and organisational issues. Besides the above-mentioned evaluation scenarios, it is essential to define the role of a new test in comparison to the existing test, as this influences the interpretation of the new test’s value. Various roles are acknowledged [47, 48]: - Triage - Replacement of a reference standard or an existing test - Add-on - Parallel/combined These roles are described in detail in table 2. All of the above factors can be relevant when considering the benefits and harms of testing in specific circumstances and for specific populations. Impact of inappropriate testing There is considerable practice variation in test usage in practice, with both underuse and overuse of tests being common [49, 50]. Sullivan et al. conducted a systematic review on over- and undertesting in primary care, in which they explored the frequency of inappropriate ordering of 103 diagnostic tests in relation to their respective guidelines. The results showed a wide range of non-compliance to the testing recommendations in guidelines (median: 40.0%; range: 0.2-100%). Examples of underuse (inappropriately not performed tests) include echocardiography for heart failure (89% underuse) or atrial fibrillation (56% underuse), and pulmonary function tests for COPD (73% underuse). Examples of overuse (inaccurately performed tests) include echocardiography in people with no symptoms or signs of cardiovascular disease (77-92% overuse), urine cultures (77% overuse), upper gastrointestinal

20 Chapter 1 endoscopy (37-54% overuse) and colonoscopy (52% overuse). Besides, an increase in overuse of CT and MRI scans for headaches was seen in the United States [51]. Table 2. Roles of a new test compared to an existing test Role Explanation Examples Triage The new test is intended to be used before the existing test, and the existing test is then solely offered to patients with a specific result on the new test. The new test may have reduced accuracy compared to the existing test, but it can offer other advantages such as less burden or costs. Screening all persons aged 55-75 years for faecal occult blood. Only those who have a positive test will receive colonoscopy Replacement The new test is intended to replace the existing test when it is more accurate or offers other advantages (such as reduced burden or costs) compared to the existing test.  Magnetic resonance imaging (MRI) instead of mammography in women suspected of having familial breast cancer  Polymerase chain reaction testing to detect herpes simplex virus instead of viral culture Add-on The new test is intended to be performed after the existing test, which restricts the test’s application to a subset of people, for instance those who evaluate positive on the existing test. Implementing the new test may increase the accuracy of the testing pathway, but it could also have drawbacks such as increased burden and costs. Positron emission tomography (PET) in patients with cancer after having a negative computed tomography (CT) scan for metastases Parallel/ combined The new test is intended to be used together with an existing test. Determination of eGFR and albumin creatinine ratio to diagnose chronic kidney disease Healthcare spending on laboratory diagnostics among both American and German oncologists and cardiologists was investigated by Rohr et al. They found that laboratory diagnostics accounted for 2.3% and 1.4% of healthcare spending in the United States and Germany respectively, influencing 64% and 67% of clinical decisions [52]. Incorrect testing can result in high healthcare costs, and in unnecessary test burden and anxiety [53]. Physicians acknowledge that unnecessary testing is a significant problem. Reasons for unnecessary test ordering include concerns of liability, providing reassurance, patient demands, keeping patients satisfied, and insufficient time to consult with patients. Most physicians have a sense of responsibility to prevent unnecessary testing. A majority of physicians also state that providing evidence-based recommendations in a format intended for patient communication (e.g. with icon arrays or graphs), would be effective in reducing unnecessary testing [54].

General introduction 21 1 Challenges in guideline development about testing Guideline panel members face challenges when interpreting test accuracy measures, such as sensitivity and specificity. Recalculating these measures to determine the number of true positives, true negatives, false positives and false negatives per 1000 people tested provides greater clarity, which is easier to understand [55]. Formulating key questions about testing that include people-important outcomes can be challenging as well. Moreover, there are barriers in searching and synthesizing the evidence, such as a lack of valid search filters, complex meta-analysis methods and the inclusion of outcomes beyond diagnostic accuracy. Interpreting and applying GRADE criteria for the evaluation of the clinical performance of a test can be difficult because the assessment of inconsistency and imprecision differs from the evaluation of intervention studies on clinical performance of a treatment [56]. Formulating recommendations about testing is challenging due to a lack of evidence, conflicting expert opinions, and insufficient knowledge and competencies [57]. Given the numerous challenges, it is suspected that consequences of testing on peopleimportant outcomes are hardly considered when developing recommendations on healthcare related testing. Aim and research questions Developing guidelines comes with various issues, particularly when focusing on developing recommendations about testing, as described in the previous sections. There are indications from evidence and experience from guideline methodologists that the process of guideline development related to testing is suboptimal, which may lead to inaccurate consideration of the benefits and harms of testing. It is not yet known which knowledge or tools are necessary and/or helpful in appropriately developing guideline recommendations about testing. Therefore, this thesis focuses on barriers and solutions in the development of guideline recommendations about testing, with specific attention to the required expertise for developing these recommendations and tools to facilitate this process. The aim of this thesis is to facilitate and improve guideline development concerning healthcare related testing. The first objective is to identify problems by exploring current practice and challenges in developing guidelines for healthcare related testing. The second objective is to improve this process by identifying the knowledge needed to develop testing recommendations in guidelines. The third objective is to facilitate the guideline development process by developing and testing a tool to support the formulation of appropriate guideline questions on healthcare related testing.

22 Chapter 1 This has led to the following research questions: 1. What are challenges and possible solutions when assessing the certainty of evidence of a test-management pathway? 2. Which types of evidence (diagnostic accuracy, burden of the test, natural course, treatment effectiveness, link between test result and administration of treatment) are used to support guideline recommendations about testing? 3. What is the minimum knowledge required for guideline panel members involved in developing recommendations about testing? 4. Can a step-by-step guide aid guideline developers in formulating key questions about testing? Outline of the thesis After this introduction chapter, chapter 2 presents findings from a case study on the application of GRADE for tests and test strategies, including the identification of methodological challenges, and suggestions for solutions to these challenges (research question 1). This study evaluated the full test-management pathway for the net benefit of IgE (immunoglobulin E) in the diagnosis of allergic rhinitis. Chapter 3 presents a systematic document analysis including quality assessment of publicly available guidelines on three diagnostic tests: C-reactive protein, colonoscopy, and fractional exhaled nitric oxide. This study evaluated the incorporation of the various components of the test-management pathway in the evidence base for the guideline recommendation, including factors contributing to the comprehensiveness of the evidence as well as explanations for eventual differences between the guidelines (research question 2). Chapter 4 presents the results of a developmental study with the aim of defining the minimum knowledge required by guideline panel members who are involved in developing recommendations about testing. This study used a literature review and expert interviews to formulate a list of required knowledge components (research question 3). During the development and presentation of the required knowledge components, it became clear that practical examples of test-management pathways were needed. Chapter 5 provides detailed examples that can aid in the understanding and implementation of the test-management pathway concept. Chapter 6 presents the outcomes of developing and testing a step-by-step guide for guideline developers. The guide’s objective was to assist guideline panel members in formulating key questions regarding testing (research question 4). Finally, chapter 7 offers a general discussion summarising the results of the studies, reflecting on these results, and outlining implications for practice. Additionally, it provides suggestions for further research.

General introduction 23 1 References 1. Institute of Medicine Committee on Standards for Developing Trustworthy Clinical Practice Guidelines. Clinical Practice Guidelines We Can Trust. In: Graham R, Mancher M, Miller Wolman D, Greenfield S, Steinberg E, editors. Washington (DC): National Academies Press (US). Copyright 2011 by the National Academy of Sciences. All rights reserved.; 2011 isbn: doi:10.17226/13058. 2. Burgers JS, Van der Weijden T, Grol R. Richtlijnen als hulpmiddel bij de verbetering van de zorg. In: Wensing M, Grol R, editors. Implementatie. 7 ed. Houten: Bohn Stafleu van Loghum; 2017. p. 99-124. isbn: 978-90-368-1731-8. doi:10.1007/978-90-368-1732-5. 3. Adviesgroep Kwaliteitsstandaarden Zorginstituut Nederland. AQUA-Leidraad. Zorginstituut Nederland; 2021. Available from: https://www.zorginzicht.nl/binaries/content/assets/zorginzicht/ontwikkeltoolsontwikkelen/aqua-leidraad.pdf. 4. National Institute for Health and Care Excellence. How we develop NICE guidelines. National Institute for Health and Care Excellence; 2021. Available from: https://www.nice.org.uk/about/what-we-do/ourprogrammes/nice-guidance/nice-guidelines/how-we-develop-nice-guidelines. 5. Scottish Intercollegiate Guidelines Network. SIGN 50. A guideline developer's handbook. Scottish Intercollegiate Guidelines Network; 2019. Available from: https://www.sign.ac.uk/media/2038/sign50_2019.pdf. 6. Guyatt GH, Oxman AD, Vist GE, Kunz R, Falck-Ytter Y, Alonso-Coello P, et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ. 2008;336(7650):9246. doi:10.1136/bmj.39489.470347.AD. 7. Guyatt GH, Oxman AD, Kunz R, Atkins D, Brozek J, Vist G, et al. GRADE guidelines: 2. Framing the question and deciding on important outcomes. J Clin Epidemiol. 2011;64(4):395-400. doi:10.1016/j.jclinepi.2010.09.012. 8. Alonso-Coello P, Oxman AD, Moberg J, Brignardello-Petersen R, Akl EA, Davoli M, et al. GRADE Evidence to Decision (EtD) frameworks: a systematic and transparent approach to making well informed healthcare choices. 2: Clinical practice guidelines. BMJ. 2016;353:i2089. doi:10.1136/bmj.i2089. 9. Alonso-Coello P, Schunemann HJ, Moberg J, Brignardello-Petersen R, Akl EA, Davoli M, et al. GRADE Evidence to Decision (EtD) frameworks: a systematic and transparent approach to making well informed healthcare choices. 1: Introduction. BMJ. 2016;353:i2016. doi:10.1136/bmj.i2016. 10. Guyatt GH, Oxman AD, Kunz R, Brozek J, Alonso-Coello P, Rind D, et al. GRADE guidelines 6. Rating the quality of evidence--imprecision. J Clin Epidemiol. 2011;64(12):1283-93. doi:10.1016/j.jclinepi.2011.01.012. 11. Guyatt GH, Ebrahim S, Alonso-Coello P, Johnston BC, Mathioudakis AG, Briel M, et al. GRADE guidelines 17: assessing the risk of bias associated with missing participant outcome data in a body of evidence. J Clin Epidemiol. 2017;87:14-22. doi:10.1016/j.jclinepi.2017.05.005. 12. Guyatt G, Oxman AD, Akl EA, Kunz R, Vist G, Brozek J, et al. GRADE guidelines: 1. Introduction-GRADE evidence profiles and summary of findings tables. J Clin Epidemiol. 2011;64(4):383-94. doi:10.1016/j.jclinepi.2010.04.026. 13. Balshem H, Helfand M, Schunemann HJ, Oxman AD, Kunz R, Brozek J, et al. GRADE guidelines: 3. Rating the quality of evidence. J Clin Epidemiol. 2011;64(4):401-6. doi:10.1016/j.jclinepi.2010.07.015. 14. Guyatt GH, Oxman AD, Vist G, Kunz R, Brozek J, Alonso-Coello P, et al. GRADE guidelines: 4. Rating the quality of evidence--study limitations (risk of bias). J Clin Epidemiol. 2011;64(4):407-15. doi:10.1016/j.jclinepi.2010.07.017. 15. Guyatt GH, Oxman AD, Montori V, Vist G, Kunz R, Brozek J, et al. GRADE guidelines: 5. Rating the quality of evidence--publication bias. J Clin Epidemiol. 2011;64(12):1277-82. doi:10.1016/j.jclinepi.2011.01.011. 16. Guyatt GH, Oxman AD, Kunz R, Woodcock J, Brozek J, Helfand M, et al. GRADE guidelines: 7. Rating the quality of evidence--inconsistency. J Clin Epidemiol. 2011;64(12):1294-302. doi:10.1016/j.jclinepi.2011.03.017. 17. Guyatt GH, Oxman AD, Kunz R, Woodcock J, Brozek J, Helfand M, et al. GRADE guidelines: 8. Rating the quality of evidence--indirectness. J Clin Epidemiol. 2011;64(12):1303-10. doi:10.1016/j.jclinepi.2011.04.014. 18. Guyatt GH, Oxman AD, Sultan S, Glasziou P, Akl EA, Alonso-Coello P, et al. GRADE guidelines: 9. Rating up the quality of evidence. J Clin Epidemiol. 2011;64(12):1311-6. doi:10.1016/j.jclinepi.2011.06.004.

24 Chapter 1 19. Brunetti M, Shemilt I, Pregno S, Vale L, Oxman AD, Lord J, et al. GRADE guidelines: 10. Considering resource use and rating the quality of economic evidence. J Clin Epidemiol. 2013;66(2):140-50. doi:10.1016/j.jclinepi.2012.04.012. 20. Guyatt G, Oxman AD, Sultan S, Brozek J, Glasziou P, Alonso-Coello P, et al. GRADE guidelines: 11. Making an overall rating of confidence in effect estimates for a single outcome and for all outcomes. J Clin Epidemiol. 2013;66(2):151-7. doi:10.1016/j.jclinepi.2012.01.006. 21. Guyatt GH, Oxman AD, Santesso N, Helfand M, Vist G, Kunz R, et al. GRADE guidelines: 12. Preparing summary of findings tables-binary outcomes. J Clin Epidemiol. 2013;66(2):158-72. doi:10.1016/j.jclinepi.2012.01.012. 22. Guyatt GH, Thorlund K, Oxman AD, Walter SD, Patrick D, Furukawa TA, et al. GRADE guidelines: 13. Preparing summary of findings tables and evidence profiles-continuous outcomes. J Clin Epidemiol. 2013;66(2):173-83. doi:10.1016/j.jclinepi.2012.08.001. 23. Andrews J, Guyatt G, Oxman AD, Alderson P, Dahm P, Falck-Ytter Y, et al. GRADE guidelines: 14. Going from evidence to recommendations: the significance and presentation of recommendations. J Clin Epidemiol. 2013;66(7):719-25. doi:10.1016/j.jclinepi.2012.03.013. 24. Andrews JC, Schunemann HJ, Oxman AD, Pottie K, Meerpohl JJ, Coello PA, et al. GRADE guidelines: 15. Going from evidence to recommendation-determinants of a recommendation's direction and strength. J Clin Epidemiol. 2013;66(7):726-35. doi:10.1016/j.jclinepi.2013.02.003. 25. Schünemann HJ, Mustafa R, Brozek J, Santesso N, Alonso-Coello P, Guyatt G, et al. GRADE Guidelines: 16. GRADE evidence to decision frameworks for tests in clinical practice and public health. J Clin Epidemiol. 2016;76:89-98. doi:10.1016/j.jclinepi.2016.01.032. 26. GRADE Working Group. GRADE Working Group. Available from: https://www.gradeworkinggroup.org/#. 27. Schunemann HJ, Oxman AD, Brozek J, Glasziou P, Jaeschke R, Vist GE, et al. Grading quality of evidence and strength of recommendations for diagnostic tests and strategies. BMJ. 2008;336(7653):1106-10. doi:10.1136/bmj.39500.677199.AE. 28. Schunemann HJ, Oxman AD, Brozek J, Glasziou P, Bossuyt P, Chang S, et al. GRADE: assessing the quality of evidence for diagnostic recommendations. Evidence-based medicine. 2008;13(6):162-3. doi:10.1136/ebm.13.6.162-a. 29. Brozek JL, Akl EA, Jaeschke R, Lang DM, Bossuyt P, Glasziou P, et al. Grading quality of evidence and strength of recommendations in clinical practice guidelines: Part 2 of 3. The GRADE approach to grading quality of evidence about diagnostic tests and strategies. Allergy. 2009;64(8):1109-16. doi:10.1111/j.1398-9995.2009.02083.x. 30. Schunemann HJ, Mustafa R, Brozek J, Santesso N, Alonso-Coello P, Guyatt G, et al. GRADE Guidelines: 16. GRADE evidence to decision frameworks for tests in clinical practice and public health. J Clin Epidemiol. 2016;76:89-98. doi:10.1016/j.jclinepi.2016.01.032. 31. Brouwers MC, Kho ME, Browman GP, Burgers JS, Cluzeau F, Feder G, et al. AGREE II: advancing guideline development, reporting and evaluation in health care. CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne. 2010;182(18):E839-42. doi:10.1503/cmaj.090449. 32. GIN-McMaster. GIN-McMaster Guideline Development Checklist (GDC). 2014. Available from: https://cebgrade.mcmaster.ca/guidelinechecklistprintable.pdf. 33. Schunemann HJ, Wiercioch W, Etxeandia I, Falavigna M, Santesso N, Mustafa R, et al. Guidelines 2.0: systematic development of a comprehensive checklist for a successful guideline enterprise. CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne. 2014;186(3):E123-42. doi:10.1503/cmaj.131237. 34. Sultan S, Morgan RL, Murad MH, Falck-Ytter Y, Dahm P, Schünemann HJ, et al. A Theoretical Framework and Competency-Based Approach to Training in Guideline Development. J Gen Intern Med. 2020;35(2):561-7. doi:10.1007/s11606-019-05502-9. 35. GRADE Working Group. GRADE Handbook. Handbook for grading the quality of evidence and the strength of recommendations using the GRADE approach. 2013. Available from: https://gdt.gradepro.org/app/handbook/handbook.html. 36. Van Bokhoven MA, Pleunis-Van Empel MC, Koch H, Grol RP, Dinant GJ, Van der Weijden T. Why do patients want to have their blood tested? A qualitative study of patient expectations in general practice. BMC Fam Pract. 2006;7:75. doi:10.1186/1471-2296-7-75. 37. Norman G, Barraclough K, Dolovich L, Price D. Iterative diagnosis. BMJ. 2009;339:b3490. doi:10.1136/bmj.b3490.

General introduction 25 1 38. Wulff HR. (eds.). Principes van klinisch denken en handelen; Nederlandse bewerking. Utrecht: Bohn, Scheltema & Holkema; 1980. isbn:90 313 0399 2. 39. Elstein AS, Schwartz A. Clinical problem solving and diagnostic decision making: selective review of the cognitive literature. BMJ. 2002;324(7339):729-32. doi:10.1136/bmj.324.7339.729. 40. Hopayian K. Why medicine still needs a scientific foundation: restating the hypotheticodeductive model - part two. The British journal of general practice : the journal of the Royal College of General Practitioners. 2004;54(502):402-3; discussion 4-5. 41. Hopayian K. Why medicine still needs a scientific foundation: restating the hypotheticodeductive model - part one. The British journal of general practice : the journal of the Royal College of General Practitioners. 2004;54(502):400-1; discussion 4-5. 42. Van Leeuwen YD, Baggen JL. De medische beslissing: juist én zinnig? Huisarts en Wetenschap. 2002;45(2):66-9. 43. Stolper E. Gut feelings in general practice. Maastricht: Maastricht University; 2010. 44. Mustafa RA, Wiercioch W, Santesso N, Cheung A, Prediger B, Baldeh T, et al. Decision-Making about Healthcare Related Tests and Diagnostic Strategies: User Testing of GRADE Evidence Tables. PLoS One. 2015;10(10):e0134553. doi:10.1371/journal.pone.0134553. 45. Deeks JJ. Systematic reviews in health care: Systematic reviews of evaluations of diagnostic and screening tests. BMJ. 2001;323(7305):157-62. doi:10.1136/bmj.323.7305.157. 46. Horvath AR, Lord SJ, StJohn A, Sandberg S, Cobbaert CM, Lorenz S, et al. From biomarkers to medical tests: the changing landscape of test evaluation. Clin Chim Acta. 2014;427:49-57. doi:10.1016/j.cca.2013.09.018. 47. Bossuyt PM, Irwig L, Craig J, Glasziou P. Comparative accuracy: assessing new tests against existing diagnostic pathways. BMJ. 2006;332(7549):1089-92. doi:10.1136/bmj.332.7549.1089. 48. Mustafa RA, Wiercioch W, Cheung A, Prediger B, Brozek J, Bossuyt P, et al. Decision making about healthcare-related tests and diagnostic test strategies. Paper 2: a review of methodological and practical challenges. J Clin Epidemiol. 2017;92:18-28. doi:10.1016/j.jclinepi.2017.09.003. 49. Welch HG, Black WC. Overdiagnosis in cancer. J Natl Cancer Inst. 2010;102(9):605-13. doi:10.1093/jnci/djq099. 50. Jacobs TS, Forno E, Brehm JM, Acosta-Perez E, Han YY, Blatter J, et al. Underdiagnosis of allergic rhinitis in underserved children. Journal of Allergy & Clinical Immunology. 2014;134(3):737-9.e6. doi:10.1016/j.jaci.2014.03.028. 51. O'Sullivan JW, Albasri A, Nicholson BD, Perera R, Aronson JK, Roberts N, et al. Overtesting and undertesting in primary care: a systematic review and meta-analysis. BMJ Open. 2018;8(2):e018557. doi:10.1136/bmjopen-2017-018557. 52. Rohr UP, Binder C, Dieterle T, Giusti F, Messina CG, Toerien E, et al. The Value of In Vitro Diagnostic Testing in Medical Practice: A Status Report. PLoS One. 2016;11(3):e0149856. doi:10.1371/journal.pone.0149856. 53. Choosing wisely. Do you really need that medical test or treatment? The answer may be no. Choosing wisely; 2017. Available from: https://www.choosingwisely.org/files/Do-You-Need-That-Test_4x9Eng.pdf. 54. PerryUndem Research/Communication. Unnecessary Tests and Procedures in the Health Care System. 2014. Available from: https://www.choosingwisely.org/files/Final-Choosing-Wisely-SurveyReport.pdf. 55. Hsu J, Brozek JL, Terracciano L, Kreis J, Compalati E, Stein AT, et al. Application of GRADE: making evidence-based recommendations about diagnostic tests in clinical practice guidelines. Implement Sci. 2011;6:62. doi:10.1186/1748-5908-6-62. 56. Gopalakrishna G, Mustafa RA, Davenport C, Scholten RJ, Hyde C, Brozek J, et al. Applying Grading of Recommendations Assessment, Development and Evaluation (GRADE) to diagnostic tests was challenging but doable. J Clin Epidemiol. 2014;67(7):760-8. doi:10.1016/j.jclinepi.2014.01.006. 57. Gopalakrishna G, Leeflang MM, Davenport C, Sanabria AJ, Alonso-Coello P, McCaffery K, et al. Barriers to making recommendations about medical tests: a qualitative study of European guideline developers. BMJ Open. 2016;6(9):e010549. doi:10.1136/bmjopen-2015-010549.

Chapter 2. Applying GRADE for diagnosis revealed methodological challenges: an illustrative example for guideline developers Mariska Tuut Hans de Beer Jako Burgers Erik-Jonas van de Griendt Trudy van der Weijden Miranda Langendam Journal of Clinical Epidemiology 2021; https://doi.org/10.1016/j.jclinepi.2020.11.021

28 Chapter 2 Abstract Objective: To identify challenges in the application of GRADE for diagnosis when assessing the certainty of evidence in the test-treatment strategy (diagnostic accuracy, test burden, management effectiveness, natural course, linked evidence) in an illustrative example and to propose solutions to these challenges. Study design: A case study in applying GRADE for diagnosis that looked at the added value of IgE for diagnosing allergic rhinitis. Results: Evaluation of the full test-treatment strategy showed a lack of (high-quality) evidence for all elements. In our example, we found a lack of evidence for test burden, natural course and link between test result and clinical management. Overall, systematically reviewing the evidence for all elements of a test-treatment strategy is more time-consuming than only considering test accuracy results and management effectiveness. To increase efficiency, the guideline panel could determine critical elements of the test-treatment strategy that need a systematic review of the evidence. For less critical elements, a guideline panel can rely on grey literature and professional expertise. Conclusion: A lack of high-quality evidence and time investment if the full testtreatment strategy is assessed create challenges in applying GRADE for diagnosis. Discussion within guideline panels about critical elements that need to be reviewed might help. Keywords: GRADE, diagnosis, guidelines, evidence, medical tests, systematic review

Applying GRADE for diagnosis 29 2 Introduction Clinicians use tests to ascertain or reject a clinical diagnosis [1]. The clinical value of a test depends on various elements: the patient population characteristics (e.g. prevalence of the disease), test characteristics (e.g. sensitivity and specificity) and its downstream consequences on patient-relevant outcomes (e.g. test burden, natural course of the disease and management following the test results) [2]. Since direct evidence evaluating the impact of tests on patient important outcomes (diagnostic randomised trial) is scarce, different types of evidence (e.g. for diagnostic accuracy and management effectiveness) need to be assessed and linked. Clinicians often have a limited ability to assess the value of a test in clinical practice [3, 4]. Therefore, clinical practice guidelines (CPG) have been developed to provide decision support to clinicians and patients[5]. The GRADE approach for diagnostic tests and test strategies facilitates this process by linking the elements of a testtreatment strategy and assessment of the certainty of the evidence for each element [6-8]. It is challenging to appropriately evaluate diagnostic tests (e.g. assessing the certainty of the evidence, including patient-important outcomes in evaluating test accuracy) [9, 10]. In this study, we aimed to identify the challenges of applying GRADE for diagnosis for all elements of the test-treatment strategy. We assessed the certainty in the evidence in an illustrative example and proposed solutions to overcome the barriers. This study may serve as an example for systematic reviewers and guideline developers. Methods Clinical question The illustrative example is the clinical question: what is the value of specific immunoglobulin E (sIgE) blood testing as an add-on test to history taking (I) compared to history taking alone (C) in patients suspected of having allergic rhinitis (AR) in primary care (P), with relief of nasal or ocular symptoms as critical outcomes (O) [8, 11]? Concentration, sleep problems, work/school absence and quality of life (QoL) were considered important outcomes [12]. Consequences of true positive, true negative, false positive, false negative, and failed test results were discussed. We formulated PICOs for each element of the test-treatment strategy (see table 1). Search strategy Detailed methods for searching and assessing evidence for each evidence element are presented in table 2. We searched Medline and Embase databases to retrieve relevant evidence (Appendix 1). We searched for publications from 1998 to 11 January 2019

30 Chapter 2 (because of sIgE-testing and non-sedating antihistamines were used since then) [12]. We used combinations of MeSH (medical subject headings) and key words and searched unrestricted to setting but limited the search to English, German or Dutch language publications. Table 1. PICOs per sub question Element Patient (P) Intervention (I) Control (C) Outcome (O) Diagnostic accuracy Patients suspected of having allergic rhinitis in primary care sIgE-test for at least one of the allergens:  Grass pollen  Birch/tree pollen  Herb pollen (any)  House dust mite (any Dermatophagoides)  Mould  Cat epithelium  Dog epithelium Nasal provocation of allergens  Accuracy measures (sensitivity, specificity);  The target condition is allergic rhinitis, measured with nasal provocation (nasal challenge) Test burden Adults/children in general Any venipuncture for diagnostic or screening purposes - Complications of testing (vasovagal reactions, pain, nerve injuries, haematoma) Management  Patients with confirmed allergic rhinitis (doctor diagnosed/ sIgEtesting/ provocation)  Exclusion: selfdiagnosed allergic rhinitis  Allergen avoidance measures  Antihistamines  Nasal corticosteroids  Other treatment  No treatment  Placebo  Relief of nasal symptoms  Relief of ocular symptoms  Concentration  Sleep problems  Work/school absence  Quality of life (QoL) Natural course  Patients with confirmed allergic rhinitis (doctor diagnosed/ sIgEtesting/ provocation)  Exclusion: selfdiagnosed allergic rhinitis - -  Relief of nasal symptoms  Relief of ocular symptoms  Concentration  Sleep problems  Work/school absence  Quality of life (QoL) Link between test and management Patients with a positive sIgE-test result - -  Allergen avoidance  Use of corticosteroids  Use of antihistamines  Compliance  Treatment difficulties

RkJQdWJsaXNoZXIy MTk4NDMw