A process to deduplicate individuals for regional chronic disease prevalence estimates using a distributed data network of electronic health records

Abstract Introduction Learning health systems can help estimate chronic disease prevalence through distributed data networks (DDNs). Concerns remain about bias introduced to DDN prevalence estimates when individuals seeking care across systems are counted multiple times. This paper describes a process to deduplicate individuals for DDN prevalence estimates. Methods We operationalized a two‐step deduplication process, leveraging health information exchange (HIE)‐assigned network identifiers, within the Colorado Health Observation Regional Data Service (CHORDS) DDN. We generated prevalence estimates for type 1 and type 2 diabetes among pediatric patients (0‐17 years) with at least one 2017 encounter in one of two geographically‐proximate DDN partners. We assessed the extent of cross‐system duplication and its effect on prevalence estimates. Results We identified 218 437 unique pediatric patients seen across systems during 2017, including 7628 (3.5%) seen in both. We found no measurable difference in prevalence after deduplication. The number of cases we identified differed slightly by data reconciliation strategy. Concordance of linked patients' demographic attributes varied by attribute. Conclusions We implemented an HIE‐dependent, extensible process that deduplicates individuals for less biased prevalence estimates in a DDN. Our null pilot findings have limited generalizability. Overlap was small and likely insufficient to influence prevalence estimates. Other factors, including the number and size of partners, the matching algorithm, and the electronic phenotype may influence the degree of deduplication bias. Additional use cases may help improve understanding of duplication bias and reveal other principles and insights. This study informed how DDNs could support learning health systems' response to public health challenges and improve regional health.

of deduplication bias. Additional use cases may help improve understanding of duplication bias and reveal other principles and insights. This study informed how DDNs could support learning health systems' response to public health challenges and improve regional health.
electronic health records, medical record linkage, network, public health informatics, public health surveillance

| INTRODUCTION
Learning health systems (LHSs) that leverage data for rapid, continuous improvement operate amid broader secular epidemics of chronic disease and substance use that exceed any one healthcare system's ability to address. [1][2][3][4] Public health problems freely transcend county boundaries and provider networks. A nationwide LHS, based on a federated data sharing model, 5 proposes to combine LHS concepts with established public health strategies, such as estimating disease prevalence. 6 One common LHS challenge is reconciling fragmented data collected across an ecosystem of electronic health records (EHRs).
Record fragmentation limits a single system's ability to learn from patients' experiences and outcomes at the individual or population level. A treatment (or exposure) may be recorded in one healthcare system, while related outcomes may be recorded in another. Conceptually, this identity management (IM) challenge includes several component activities: (a) uniquely identifying and linking individuals across multiple data sources and distinct healthcare organizations, (b) aggregating individual-level health data from multiple sources, and (c) reconciling data and discrepancies across sources (eg, removing duplicates, resolving changing residence data over time). For example, while healthcare organizations identify unique individuals and assign medical record numbers using internal IM tools, health information exchanges (HIEs) facilitate Health Insurance Portability and Accountability Act-(HIPAA) compliant, data sharing from one covered entity to another using cross-entity IM (eg, master patient index). 7,8 EHR distributed data networks (DDNs) can leverage federal investments for research, quality improvement and public health, valued domains for any LHS. Federated data sharing, recognized by funding agencies, 9 and adopted by clinical data research networks, 10,11 can preserve privacy and security as data remain behind firewalls of DDN-participating healthcare organizations, until queried for specific approved uses. The importance of IM in a DDN is likely influenced by the specific use case and geographic proximity of participating organizations. For example, whereas some PCORnet clinical data research networks may have limited geographic overlap and duplication of patients, others have implemented IM solutions. 12,13 Regional DDNs designed for quality improvement or public health surveillance in a defined region may be especially likely to experience patient duplication. Risks of duplication bias in public health DDNs have been recognized but lack data to inform decisions. 14 Prevalence estimates from DDNs can be biased when individuals access multiple health systems are represented more than once, 15 however the degree of bias may differ by use case.
Efforts to define, scope, and address problems caused by duplication for a variety of public health use cases are needed. 16 17 This study focuses on an IM approach that could scale to deduplicating prevalence estimates in the seven-county Denver metropolitan area, which includes over 50% of Colorado's 5.8 million residents.

| Populations
Two health systems contributing data to the CHORDS Network ("data partners") participated in this study. Data partner 1 (DP1) is a large, integrated safety-net health system recognized as an LHS 9 that provides care for the majority of low-income individuals in the City and County of Denver ($30% of Denver's population). Data partner 2 (DP2) is a large pediatric tertiary care facility that participates in a network LHS. 10 Children seen in primary care at DP1 are routinely referred to DP2 for many types of specialty care. Patients with T1DM are referred to a specialty diabetes program affiliated with DP2 that did not contribute data to this study. Both data partners have multiple locations throughout the Denver metropolitan area. DP1 and DP2 operate locations as close as 3 miles of one another.
The eligible population (denominator) for this evaluation were children (less than 18 years of age on the date of the encounter) with at least one 2017 healthcare encounter at either data partner, residing in the seven-county Denver metropolitan. Individuals with incomplete information for unique identification by the HIE were excluded (n = 47 347, 2% of all records; see Figure 1).
For its chronic disease surveillance mission the CHORDS Network leverages numerous case definitions drawn from the Centers for Medicare and Medicaid Services Chronic Conditions Data Warehouse. 21 Cases were identified individuals with at least one International Classification of Disease (ICD) code for a billing or problem list diagnosis of T1DM or T2DM. Individuals with at least one T1DM diagnosis, at either data partner, were classified as T1DM cases. Likewise, a single T2DM diagnosis resulted in a person being classified as a T2DM case. We did not distinguish cases with both T1DM and T2DM diagnosis codes.

| Data sources and distributed network
Data Governance: Data partners executed a Data Use Agreement (DUA) to share a record-level limited dataset and business associate F I G U R E 1 Strobe flow diagram representing the number of unique patients across two distributed data network partners participating in a study of identity management's influence of type 1 and type 2 diabetes prevalence agreements (BAA) to share personally identifiable information (PII) with the HIE. The HIE has participation agreements with each data partner and routinely manages patients' identities as part of its core business functions. The HIE assigned a unique network-wide identifier where one or more rows were associated with each LINK_ID. More than one row was required when the patient matching algorithm identified duplicates within a data partner and assigned the same network-wide identifier (LINK_ID) to several different individuals (eg, PERSON_ID).

Distributed Query Logic:
We developed a two-step query process: (a) select a cohort of unique individuals across data partners and (b) classify those individuals into cases (yes/no). Limited exchange of associated PII permitted demographic (age, gender, and race/ethnicity) and geographic (census tract) stratification. We designed an automated process to reconcile medical, demographic, and geographic information that conflicted across partners (eg, an individual may be diagnosed at one data partner and not another). Manual reconciliation was resource-intensive and infeasible. Automated reconciliation required decisions to identify which value(s) to use in estimating prevalence. Individual-level data exchange was limited to three variables: LINK_ID, diagnosis status and final 2017 visit date. Individual demographic and geographic data were selected from the system whose data were used in Step 2. Below, we describe the twostep (ie, cohort selection and stratification) query processes: Cohort Reconciliation: For each condition (T1DM or T2DM) a cohort was selected by generating lists of all eligible individuals (see above) with an initial query. Query results contained three fields: the LINK_ID, a binary indicator for case status and last visit date. With these fields a single data partner was selected to contribute a given patient's data (medical, demographic, geographic) to prevalence F I G U R E 2 Generating stratified, deduplicated estimates of diabetes prevalence through a distributed query process that minimizes exchange of phi estimates. Rules for data partner selection are described below (mock results are represented in Table 1).
• Decision 1: When an individual (CID1) was seen by both data partners, yet only one identified the patient as a case, select the record from the data partner identifying the patient as a case.
• Decision 2: When an individual (CID2 or CID3) was seen by both data partners, and has the same case status, select the record from the data partner with the most recent 2017 visit.
• Decision 3: When an individual (CID4 or CID5) was seen by both data partners, and has the same case status and the same most recent 2017 visit date, select the data partner at random.
T A B L E 1 Example of reconciliation process for selecting data partners to contribute demographic and geographic data for individuals seen in multiple health care systems (selection criteria are highlighted) The query tool applied these logical rules to produce two mutually-exclusive lists of LINK_IDs -one for each data partner.

| RESULTS
Among 58 351 eligible children seen at DP1 and 167 569 seen at DP2 (Table 2), the DP2 population had a higher T1DM prevalence and a lower T2DM prevalence compared to the DP1 population.
DP2's population was younger, had a greater proportion of male and white patients, and a smaller proportion of patients of Hispanic ethnicity than the DP1 population.
Aggregation across data partners, without deduplication, would have estimated 226 100 children from the Denver region seen by these two data partners. We identified 218 437 unique individuals after deduplication, with 7628 (3.5%) seen in both systems (Table 3).
Individuals seen by both data partners had a higher prevalence of both T1DM and T2DM than individuals seen in a single system. Compared to individuals seen in only one system, duplicates were more likely to be identified as Hispanic, non-Hispanic black, or non-Hispanic Asian, and were substantially more likely to reside in a higher poverty neighborhood.
The prevalence estimates of T1DM and T2DM before and after deduplication are presented in Table 4. Prevalence did not change after IM processes for either condition. There was no observed change in prevalence for any demographic or geographic subgroup after deduplication, even for the subgroups that were most affected by deduplication (eg, Hispanic patients).

| DISCUSSION
This study describes a process we designed to link and deduplicate individuals for prevalence estimate activities using a regional DDN. To our knowledge, this is one of the very few studies to report implementing HIPAA-compliant IM across a DDN to generate deduplicated prevalence estimates. 24 The process we designed and tested generated and stored Importantly, unlike many population-based surveys used for public health surveillance, DDN-based prevalence estimates integrated with the LHS mindset provide a powerful approach to evaluating interventions in a given region. Prevalence is a metric that can help health systems, county health departments and others continuously learn how to best respond to pressing public health challenges, including but not limited to diabetes.
In this initial pilot test of our process, involving only two data partners, deduplication had no measurable effect on pediatric diabetes prevalence estimates. Very low disease prevalence estimates may have resulted in fewer opportunities for cross utilization. Analyses were limited to only two data partners; neither was a referral center for diabetes. The relatively small degree of overlap between the data partners was unexpected -given referring relationships and geographic proximity -and likely contributed to the null finding. Having Duplication Bias and Potential Impact on Prevalence: Duplication, if unaccounted for, has the potential to bias both prevalence estimate components (number of cases and people in the underlying population). Using simulated data, we represent potential overlap and bias among two hypothetical DDN data partners in Figure 3 and Table 5.    Note: Aside from the top row, which uses set theory notation to represent the meaning of each column, each row illustrates how different combinations of conditions impact duplication bias in the prevalence estimate. Conditions that influence the degree of bias include: the prevalence of a condition (eg, low or high), the degree of overlap in the overall population [none, high (complete among non-cases), or complete], and the degree of overlap in the case population (none or complete). Bias is introduced when overlap is disproportionate among cases and non-cases.
prevalence estimates within a regional DDN. The process leveraged a master patient index at an HIE and limited exchange of protected health information to the minimum necessary. The process should be extensible to deduplicate data for specific patients from any number of data partners, and adaptable to other linkage methods, health conditions and cohort selection criteria. Results from this evaluation suggests several factors influence the duplication bias effect on crossinstitution prevalence estimates, including the relative size of patient populations, the representativeness of patients among participating data partners, organizational or patient self-referral patterns, and shared patient populations. This process has informed how DDN prevalence estimates might be used to help learning health systems respond to public health challenges, track patients across a healthcare ecosystem and improve population health. Additional use cases (beyond diabetes) may refine our efforts to reduce bias and reveal other principles and insights.