Using pooled electronic health records data to conduct pharmacoepidemiology safety studies: Challenges and lessons learned

We assessed the suitability of pooled electronic health record (EHR) data from clinical research networks (CRNs) of the patient‐centered outcomes research network to conduct studies of the association between tumor necrosis factor inhibitors (TNFi) and infections.

• Only a minority of hospitalized infection outcomes identified in linked claims data were observed in EHR data.
• Extreme care should be exercised when using EHR data alone for pharmacoepidemiology studies of biologic agents with a new user design. For both exposure and outcome ascertainment, EHR data would benefit from supplementation by other sources (e.g., linked administrative claims, patient self-reported data).

Plain Language Summary
We assessed the suitability of pooling electronic health record (EHR) data from clinical research networks (CRN) of the patient-centered research network to conduct studies of hospitalized infection among patients starting treatment with tumor necrosis factor inhibitors (TNFi). For a subset of patients, we linked EHR data to Centers for Medicare and Medicaid Services (CMS) administrative data to assess the accuracy and completeness of EHR records. The study included 45 483 patients prescribed TNFi, and 1416 of these patients were linked to their CMS data. When a new prescription for TNFi appeared in the EHR, 44% of the patients did not actually receive the medication according to CMS data. Patients identified as newly starting a TNFi in EHR data were correctly identified as having not received the medication before 85%-95% of the time according to CMS data. Information about TNFi prescription refills in the EHR data was frequently missing and not usable. Compared to EHR data, there were 2-8 times as many infections identified in CMS data, presumably because hospitalizations occurred at institutions outside the EHR systems. Overall, EHR data are challenging to use for medication safety studies and would benefit from supplementation by other data sources.

| INTRODUCTION
The advent of targeted therapies over the two decades has revolutionized the care of autoimmune diseases, such as rheumatoid arthritis (RA), psoriasis (PSO), psoriatic arthritis (PsA), ankylosing spondylitis (AS), Crohn's disease (CD), ulcerative colitis (UC), and juvenile idiopathic arthritis (JIA). While all therapeutic agents demonstrate efficacy and safety in clinical trials prior to approval, much remains to be learned after approval, especially with respect to safety events that are uncommon, have a long latency period, or differentially affect patients with specific characteristics. Therefore, the safety evaluation of therapeutics in routine clinical practice is an important and informative undertaking.
There are many potential data sources to assess the real-world safety of therapeutic agents. Administrative claims data from large health plans have been used extensively. These data have the distinct advantage of capturing all clinical encounters and medications dispensed but contain very little clinical information (e.g., disease activity assessments). Given the strong incentives to adopt electronic health records (EHR) in the United States, 1 real-world data derived from EHRs is a huge potential source of existing data to assess medication safety. With this recognition, the Patient-Centered Outcomes Research Institute (PCORI) developed the patient-centered outcomes research network (PCORnet) in which clinical research networks (CRNs) pool EHR data from multiple health systems and clinical networks representing about 66 million people in the United States. 2 The data from these health systems are standardized within a common data model (CDM). The CDM creates common variable definitions and organizational standards for data that allows for consistent data use across CRNs. 2 PCORnet is still early in its development, and PCORnet data have not previously been used to assess medication safety in autoimmune diseases.
Two of the challenges in using EHR data to assess medication safety are the ascertainment of medication exposures and safety outcomes. EHR contain prescription data, but new medication initiations (which are preferred for most safety analyses) may not be easily identified. Information about continued medication exposure after initiation may not be reliable due to inaccurate or missing prescription refill information and the absence of medication discontinuation documentation in the EHR. Unlike claims data which capture all payable services provided, the completeness of safety event outcome ascertainment using EHR data is uncertain. For example, patients may receive care from a specialist inside a large health system, but for an acute event such as a hospitalized infection, they may receive care at a hospital outside that clinical network.
Person-level linkage of EHR data to claims data could provide a more complete record of prescriptions filled and hospitalizations and severe as a "gold standard" for comparison with EHR data alone. Previously published studies have used administrative claims data and national registers to evaluate the rate of hospitalized infection for patients initiating therapy with tumor necrosis factor inhibitors (TNFi). 3,4 These estimates from claims data and registers can also be used as an external benchmark to assess the completeness of capturing infections using PCORnet. Herein, we report the results of a demonstration project to assess the suitability of CRN data to study the safety of biologic therapy, specifically the association between TNFi and subsequent hospitalized infection.  Table S1 for list of diagnosis codes). The primary study medications (exposures) of interest were TNFi, including adalimumab, certolizumab, etanercept, golimumab, and infliximab.

| Linkage of CMS data
We performed a person-specific linkage between patients in the CRN data and their Centers for Medicare and Medicaid Services (CMS) Chronic Conditions Warehouse fee-for-service administrative claims data from 2014 to 2019 in a subset of the cohort where a limited dataset was shared and linkage was permitted. Patients from Duke, MUSC, UCSD, UNC, and VUMC were linked to their CMS data using a rules-based matching algorithm that included date of birth, sex, healthcare encounter dates and types, and autoimmune disease diagnoses. 5 The total number of exact matches for encounter dates were used to break any ties.

| New user definitions in EHR data
Generally, a new user design (i.e., restricted to patients with incident medication use) is preferred in medication safety studies. 6 This is particularly true in studies of infection associated with biologics because the risk of infection is known to be higher during the first 6 months following initiation of treatment with biologics. 7 Because the most appropriate method of identifying new medication users in EHR data is uncertain, we assessed five definitions that required different previous medical encounters or prescriptions to occur at least 6 months prior to the first observed prescription for a specific medication ("index EHR prescription"). The intent was to avoid misclassifying prevalent medication use as incident medication use due to left censoring (i.e., the specific medication was first prescribed to the patient prior to the index EHR prescription observed in the dataset). In increasing order of anticipated specificity, the definitions were labeled Definition 0 through Definition 4. Definition 0 had no additional requirements (i.e., the first observed prescription for a specific medication ("index EHR prescription") was assumed to be a new medication initiation). Definition 1 required ≥1 medical encounter of any type occurring >6 months prior to the index prescription. Definition 2 required ≥1 prescription for any other medication occurring >6 months prior to the index prescription. Definition 3 required ≥1 medical encounter for the patient's autoimmune disease >6 months prior to the index prescription. Definition 4 required ≥1 prescription for a different medication to treat autoimmune disease (see Table S2) occurring >6 months prior to the index prescription. In situations where the first observed prescription for a specific medication (index EHR prescription) did not meet the definition, then the patient was not considered to be a new user of that specific medication but could subsequently be a new user of other medications if the definition was satisfied. If the definition was met, then the date of the index prescription was the start of follow-up (index date).

| Completeness of EHR prescribing refill information
The completeness of data on medication prescribing within the various partner PCORnet data systems was previously unknown and was assessed during the design phase of this study. The presence and quality of prescription refill data were very poor; greater than 80% of all medication prescriptions from all CRNs had either zero refills or missing refill information. Hence, refill data were deemed unreliable for determining continuing medication use or medication discontinuation dates, thus making an as-exposed or as-observed analyses infeasible. Therefore, all medication exposure episodes were censored 365 days after the index date (the maximum duration for a single prescription in the United States) irrespective of any refill information, consistent with an intention-to-treat analysis.

| Claims assessment of index EHR prescriptions
To assess the accuracy of each of the new user definitions, we assessed patients with linked CMS claims data for at least 12 months prior to the index EHR prescription. New user misclassification was defined by any dispensing or infusion of the medication observed in the claims data either in the 12 months prior to or at any time prior to the index prescription (i.e., using all available data).
To assess for true medication use following index EHR TNFi prescriptions that satisfied new user Definition 1, among patients with Part D (prescription medication coverage) we ascertained the presence of corresponding CMS claims between 90 days prior to and 365 days after the index EHR TNFi prescription. Among patients with at least one corresponding claim during this time period, we assessed the proportion of patients with continuous claims (defined as <30 day gap between the days supplied by the previous claim and each subsequent claim) from the first claim to 365 days after the index EHR TNFi prescription (consistent with our intention-to-treat analysis).

| Outcome assessment
Hospitalized infection outcomes were defined by inpatient hospital encounters with an associated diagnosis code for infection, including all infectious agents (e.g., bacterial, viral, and fungal). We determined crude rates of hospitalized infection following new use of a specific TNFi (i.e., not restricted to TNFi-naïve patients), stratified by autoimmune disease. Follow-up for a specific TNFi was censored for any of the following: end of observation in the data (e.g., most recent medical encounter); new prescription for a clinically incompatible concomitant medication (e.g., concurrent use of >1 biologic) because medication stop dates in the EHR were uncertain; or experiencing the outcome (i.e., only one outcome event per treatment initiation was allowed).
More than one new TNFi exposure episode per patient was allowed.
Infection rates for specific TNFi were pooled together to assess new TNFi exposure overall.
Restricted to patients with successful linkage of CRN and CMS data, we identified all hospitalized infections in either data source in the 12 months following the EHR index TNFi prescriptions (without censoring rules) to assess the completeness of outcome ascertainment in the CRN data.

| RESULTS
Overall, 45 483 patients were identified as TNFi users and included.
Their characteristics are shown in Table 1. Using a linkage algorithm, we identified 1416 patients with a match between their CRN data and CMS data and who had an index TNFi prescription in the CRN data. Table 1 shows that patients with successful linkage were older compared to the study patients overall.

| Assessment of new user definitions in EHR data
We assessed the proportion of patients who were included by the dif-

| Claims assessment of index EHR prescriptions
Results from linked claims data showed varying degrees of misclassification among the different medications and new user definitions (Table 2) We assessed 499 index EHR prescriptions for TNFi that met new user Definition 1 and had corresponding linked CMS claims (Table 3).
Overall, the proportion of patients with at least one claim following the index EHR prescription was 56% and ranged from a low of 42% for adalimumab to a high of 82% for infliximab. Of patients with observed claims for the index EHR TNFi, 22% had continuous claims

| DISCUSSION
This demonstration project was designed to assess the feasibility of conducting comparative effectiveness and safety research using the   was anticipated to be uncommon and was not assessed in this study.
EHR medication prescription refill data were very incomplete.
Greater than 80% of all observed medication prescriptions had either zero refills or missing refill information. Among patients with claims following the index EHR TNFi prescriptions, only 22% had uninterrupted use for at least 12 months. Therefore, our use of an intentionto-treat approach that assumed use for 12 months in the absence of refill data led to substantial misclassification of true exposure during the follow-up period.
We observed varying amounts of misclassification of new medication users depending on the definition applied. Our least specific definition resulted in 3%-24% misclassification across medications, and the most specific definition resulted in 4%-16% misclassification across medications but with substantially decreased sensitivity to identify true new users. We did not observe substantial differences in crude hospital- T A B L E 4 Challenges and points to consider when using EHR data to conduct pharmacoepidemiology safety studies.
Biologics frequently require prior authorization and are dispensed from specialty pharmacies in the US. These medications often cannot be e-prescribed, and EHR records of prescriptions are therefore not accurate with respect to true medication initiations and refills of ongoing medications.
Updates to a CDM may create difficulties in pooling data if all data partners do not update at the same time or in a standardized and coordinated fashion. This may create substantial mapping problems within each CRN, which often cannot be adequately resolved. Data partners may have difficulty trouble-shooting issues from prior CDM versions if an update occurs after dataset creation.
Data elements contained within a CDM may not be exhaustively mapped for all potential variables (e.g., hemoglobin values are mapped to CDM laboratory test results, but rheumatoid factor tests are not). Data elements not included within a CDM may be extremely difficult to pool, even if stored as discrete data elements within each data network (e.g., physician identifiers such as the national provider index (NPI) number or physician specialty).
Patients receiving subspecialty care for a chronic illness are very likely to receive additional medical care outside of the healthcare system of their subspecialty provider. This may be especially true for patients seen at university medical centers, which were the source of the majority of the EHR data used in this study. Similarly, these patients may be very likely to present at the nearest hospital if they have an acute event (e.g., serious infection) rather than at the hospital associated with the healthcare system where they receive their subspecialty care.
The limitations of left censoring may not always be obvious. Patients may appear in the EHR data, but only for a limited number of data elements (i.e., not fully observable). Modifications to the EHR itself for the delivery of care may create inaccuracies in the interpretation of observed data (e.g., data entry dates of historical information).
Although not attempted in this study, accurate ascertainment of the use of systemic glucocorticoids in the treatment of chronic autoimmune diseases is anticipated to be challenging owing in part to "as-needed" physician prescribing and patient selfadministration.
Abbreviations: CDM, common data model; EHR, electronic health record; US, United States.