Aernoud Fiolet

315 Text-mining in electronic healthcare records can be used for screening and data collection INTRODUCTION C linical research requires highly detailed information on large numbers of subjects, often acquired by many investigators and supporting staff. In particular, prospective research such as registries and randomized clinical trials (RCT) need to comply with high standards of data validity. 1,2 Scientific and regulatory requirements make such endeavors laborious and increase costs to a level only large companies are able to meet. Cardiovascular outcome trials with moderate to low absolute risks nowadays require over 10,000 participants and are estimated to cost between 35,000 and 45,000 US dollars per participant, with total costs up to half a billion US dollars for conduct. 3,4 Amajor part of these costs is attributable to participant recruitment and follow-up, for a large part comprising data collection. 5,6 Standing practice for clinical trials is that dedicated personnel enters source data in distinct (electronic) clinical report forms (CRFs). This data, however, is generally already collected in clinical care and available in electronic healthcare records (EHRs), thus creating overlapping copies of data that are already available (Fig. 1A). Automated EHR data-mining may provide a valuable method to complement or even substitute current data collection methods 7 , which could save up to one- third of recruitment costs. 8 In recent years, several supervised patient-diagnosis registries with labeled clinical data emerged to improve trial efficiency. 9 The use of automatically collected EHR data in trials, however, is still very limited. 10 Conventional data collection methods generally involve retrieving information through researcher-patient interviews and manual data extraction. After retrieval, data is then entered manually in electronic data capture (EDC) systems as part of CRFs. Data quality is guaranteed up to a certain level by automated control processes and internal and external monitoring. 11 If EHR data are to be used to identify participants or as an alternative data source, these data should be of sufficient quality. High data quality is paramount, yet will differ per objective. The accuracy level is relative to the nature of the data. Outcome data that is used to estimate a treatment effect requires higher fidelity than baseline data. 12 We hypothesized that patients eligible for trial participation can be effectively identified on information already present in EHRs using automated text-mining. Second, we hypothesized that the majority of data collected for the purpose of the trial is also already available in EHRs. If extracted automatically with acceptable

RkJQdWJsaXNoZXIy ODAyMDc0