By continuing to browse this site you agree to us using cookies as described in About Cookies
Notice: Wiley Online Library will be unavailable on Saturday 7th Oct from 03.00 EDT / 08:00 BST / 12:30 IST / 15.00 SGT to 08.00 EDT / 13.00 BST / 17:30 IST / 20.00 SGT and Sunday 8th Oct from 03.00 EDT / 08:00 BST / 12:30 IST / 15.00 SGT to 06.00 EDT / 11.00 BST / 15:30 IST / 18.00 SGT for essential maintenance. Apologies for the inconvenience.
We thank Robert W. Carlson, MD, Wei-Nchih Lee, MD, PhD, and Sandra Wilson, PhD for their critical review of this article.
Understanding of cancer outcomes is limited by data fragmentation. In the current study, the authors analyzed the information yielded by integrating breast cancer data from 3 sources: electronic medical records (EMRs) from 2 health care systems and the state registry.
Diagnostic test and treatment data were extracted from the EMRs of all patients with breast cancer treated between 2000 and 2010 in 2 independent California institutions: a community-based practice (Palo Alto Medical Foundation; “Community”) and an academic medical center (Stanford University; “University”). The authors incorporated records from the population-based California Cancer Registry and then linked EMR-California Cancer Registry data sets of Community and University patients.
The authors initially identified 8210 University patients and 5770 Community patients; linked data sets revealed a 16% patient overlap, yielding 12,109 unique patients. The percentage of all Community patients, but not University patients, treated at both institutions increased with worsening cancer prognostic factors. Before linking the data sets, Community patients appeared to receive less intervention than University patients (mastectomy: 37.6% vs 43.2%; chemotherapy: 35% vs 41.7%; magnetic resonance imaging: 10% vs 29.3%; and genetic testing: 2.5% vs 9.2%). Linked Community and University data sets revealed that patients treated at both institutions received substantially more interventions (mastectomy: 55.8%; chemotherapy: 47.2%; magnetic resonance imaging: 38.9%; and genetic testing: 10.9% [P < .001 for each 3-way institutional comparison]).
Advances in breast cancer diagnosis and treatment[1-4] offer many effective options, and raise questions about the comparative effectiveness of different care pathways.[5-7] National initiatives prioritize comparing the effectiveness of treatments in diverse practice settings,[8-10] requiring demographic and long-term follow-up data from their populations.[11-13] Studies of real-world cancer outcomes, outside of clinical trials, have been limited by the fragmentation and lack of detail in available data. Population-based registries such as the Surveillance, Epidemiology, and End Results (SEER) Program excel at tracking demographics and incidence, but lack essential details regarding treatments and diagnostic tests.[14, 15] Institutional electronic medical records (EMR) contain extensive treatment information; however, they are subject to a measurement bias of unknown magnitude, namely the underreporting of care delivered outside the institution and its outcomes.
Linking EMR-derived data across health care systems offers the promise of more complete information, as well as the challenge of disagreement between institutions, which may require laborious review of patients' charts for resolution. We linked data from the EMRs of an academic medical center and a multisite community practice in the same catchment region. To provide a gold standard for patient identification and treatment summaries, we also linked to the statewide population-based California Cancer Registry (CCR; a SEER component). Our hypothesis was that this 3-way data linkage would offer a practical and scalable approach for identifying patients treated in more than 1 health care system, and would provide information concerning variability in cancer care that could not be obtained otherwise.
MATERIALS AND METHODS
Data Resource Environment
Our project (Oncoshare) began in 2009 to integrate data from EMRs of Stanford University Hospital (SU) and the Palo Alto Medical Foundation (PAMF). SU is an academic medical center; PAMF is a multisite community practice located in Alameda, San Mateo, Santa Clara, and Santa Cruz counties in California. SU (“University”) is within 1 mile of the nearest PAMF (“Community”) site. Community patients have health maintenance organization and fee-for-service insurance; University patients have various insurance plans, including Medicaid. Although inpatient care provided by Community physicians sometimes occurs in University facilities, the institutions are legally and financially separate, with nonoverlapping staff. All research was approved by University and Community Institutional Review Boards and the State of California Institutional Review Board (for use of CCR data).
Clinical Data Extraction
We extracted data from University and Community EMRs (Epic Systems, Verona, Wis) and from a University warehouse for clinical data collected before Epic implementation in 2007. All University clinical systems data since the mid-1990s reside in the Stanford Translational Research Integrated Database Environment (STRIDE), a warehouse and integration platform for research data extraction and analysis. Real-time electronic data feeds supply clinical information to STRIDE via Health Level Seven International (HL7) technology; extract, transform, and load processes out of Epic and into STRIDE occur daily. STRIDE contains 1 terabyte of data in the form of transcribed dictations and physicians' text notes, billing codes, laboratory and pharmacy orders, medication and radiotherapy administration records, laboratory results, and radiology and pathology reports. University chemotherapy data have been available from the Epic Beacon provider order entry system since 2008. Community clinical data are housed in 3 EMR systems: Epic for everything except chemotherapy orders; IDX medical billing software (GE Healthcare, Pittsburgh, Pa) for billing information; and IntelliDose (IntrinsiQ, AmerisourceBergen Specialty Group, Burlington, Mass), an ancillary computer system dedicated to chemotherapy and used since 2000. To ensure uniform coding, chemotherapy data elements in each EMR were mapped to RxNorm, a standardized drug lexicon, and diagnostic test data elements were mapped to National Cancer Institute codes. We identified clinically important interventions, including surgery, chemotherapy, and radiotherapy, and emerging diagnostic tests (breast magnetic resonance imaging [MRI], positron emission tomography [PET], and genetic testing for BRCA1 and BRCA2 [BRCA1/2] mutations). We excluded interventions occurring > 90 days before cancer diagnosis.
CCR Data Addition
We requested CCR records, with all data fields including age, race/ethnicity, tumor stage, tumor grade, histology, receptors (estrogen receptor [ER], progesterone receptor [PR], and human epidermal growth factor receptor 2 [HER2]), and treatment summaries (comprising reports from any California institution of a patient having receiving surgery, chemotherapy, and/or radiotherapy) for all patients with breast cancer treated at University and/or Community facilities from 2000 through 2010. Census block groups were geocoded based on patients' residential addresses at the time of diagnoses. The 3% of individuals whose address could not be precisely geocoded were assigned to a census block group within their county of residence. We assigned neighborhood socioeconomic status (SES) using a previously developed and widely used index that incorporates 2000 US Census data regarding education, income, occupation, and housing costs, based on selection via principal components analysis. We categorized this measure by quintiles based on the distribution of the composite SES index across California. CCR and EMR records were linked using names, social security numbers, medical record numbers, and birthdates. All personal identifying information was removed, and clinical encounter dates were randomly offset by 30 days before research use of the data.
Patient Cohort Identification
We defined cohorts representing all patients treated for breast cancer at Community and/or University facilities from January 1, 2000 through January 1, 2010. Eligible patients were female, aged ≥ 18 years, and met at least 1 of the following criteria within the period: 1) the CCR reported a breast cancer diagnosis and/or treatment at Community and/or University facilities; and 2) University and/or Community billing records included a diagnostic code for breast cancer or ductal carcinoma in situ (International Classification of Diseases, 9th revision [ICD-9] codes 174.9 or 233.0), billed by a breast cancer specialist (defined as a surgeon, medical oncologist, or radiation oncologist). The treating institution was based on clinician affiliation, not location; a Community surgeon operating at the University was coded as Community. Institution was determined first by EMR-based billing records. Patients who had University records of undergoing breast cancer-specific interventions (surgery, chemotherapy, and radiotherapy) were coded as University, and likewise for Community, as confirmed by the CCR. For patients lacking treatment records, the institution was defined by billing records for cancer-related diagnostic tests including PET and genetic testing, and if there were no such records, by presence in University or Community internal tumor registries, which report to the CCR. MRI was not used to determine treating institution because before 2006 some Community patients visited the University for MRI only. After generating separate University and Community cohorts (defined hereafter as “EMR-CCR cohorts”), we linked these 2 EMR-CCR cohorts to identify patients treated at both institutions.
Quality Assurance and Analytical Cohort Development
We validated and applied an algorithm to link records across data sources.[21, 22] To ensure subjects' eligibility, we developed analytical cohorts, from which we excluded patients lacking data regarding all of the following (considered essential for analyzing breast cancer care): stage of disease, tumor receptors (ER, PR, and HER2), and any diagnostic or treatment intervention. We applied more stringent inclusion criteria for patients identified in EMRs only but not in the CCR, because review of physicians' notes and pathology reports in EMRs revealed that many such patients had received breast cancer ICD-9 codes erroneously, often coincident with prophylactic mastectomy or tamoxifen used for breast cancer risk reduction. These stringent inclusion criteria were cancer-specific pathology data (stage and/or tumor receptors) and treatments (chemotherapy and/or radiotherapy). This algorithm was applied within each institution before linking EMR-CCR cohorts, and to the overall cohort after linkage.
Patient characteristics, receipt of treatments, and diagnostic tests were tabulated before and after linkage of University and Community EMR-CCR cohorts. After linkage, measures for patients treated at University, Community, and both institutions (“Both”) were compared using the chi-square statistic. All P values were 2-sided.
We identified a maximally inclusive University cohort of 8892 patients. Applying our eligibility criteria left 8210 patients (92.3%) in the University analytical cohort. Repeating these steps, we identified a maximally inclusive Community cohort of 6304 patients, and retained 5770 (91.5%) in the Community analytical cohort; adding these cohorts produced an apparent total of 13,980 patients. Linked records from the University and Community EMR-CCR cohorts yielded a maximally inclusive cohort of 13,238 unique patients, of whom we retained 12,109 (91.5%) in the combined analytical cohort (Fig. 1).
Patient Characteristics Before and After EMR-CCR Cohort Linkage
Before linking University and Community EMR-CCR cohorts, University patients appeared to be younger, with a lower SES and worse cancer prognostic factors than Community patients (Table 1). Linked EMR-CCR cohorts identified a third group of patients who were treated at both institutions (defined hereafter as “Both”). Both patients were significantly more likely to be Asian (University-only, 14%; Community-only, 13.9%; and Both, 17.2%) and of highest-quintile SES (University-only, 49.2%; Community-only, 64.6%; and Both, 75.2%). Both patients had intermediate prognostic factors, including age (< 40 years: University-only, 10.9%; Community-only, 3.7%; and Both, 10%), stage (III or IV: University-only, 13.6%; Community-only, 6.8%; and Both, 10.2%), tumor receptor subtype (for the poor-prognosis subtypes, HER2-positive or ER-, PR-, and HER2 negative: University-only, 29.1%; Community-only, 14.5%; and Both, 25.9%), and grade (grade 3: University-only, 32.3%; Community-only, 19.8%; and Both, 29.5% [P < .001 for each reported 3-way comparison]). As prognostic factors worsened, including decreasing age, increasing stage of disease, increasing grade, and less favorable receptor subtype,[24-26] an increasing percentage of Community patients (but not University patients) fell into the Both category.
Table 1. Patient Characteristics, Ascertained Before and After Linking University and Community EMR Data
Before Linking Data
After Linking Data
Percentage in Both
Abbreviations: Community, community-based practices; EMR, electronic medical record; HER2, human epidermal growth factor receptor 2, HR, hormone receptor (estrogen receptor [ER] and progesterone receptor [PR]); University, university-based practices.
P value was derived using the chi-square statistic (<.001) for comparison between patients from university-based practices, community-based practices, and both after EMR data linkage.
HR-positive tumors were positive for both ER and PR and HR-negative tumors were negative for both ER and PR. Receptor subtype was not available for patients with stage 0 disease, because HER2 was not tested.
Treatments and Diagnostic Tests, Before and After EMR-CCR Cohort Linkage
Treatment information was most often available from the CCR, but diagnostic test information was available only from EMRs, through providers' notes and billing (Table 2). For example, CCR data identified approximately 95% of all women with evidence from any source of having undergone mastectomy, but institution-specific data identified only 25% to 50% of these cases. For women in the Both category, the institution-specific data performed better, reflecting a greater yield from combining EMR-derived data from 2 institutions. For chemotherapy, Community billing data offered somewhat more complete case finding than those from the University. Linked University and Community EMR-CCR cohorts revealed that the use of all interventions was highest among the Both patients. For example, mastectomy use was 39.7% for University-only, 30.5% for Community-only, and 55.8% for Both, and these data were similar for bilateral mastectomy (University-only, 8%; Community-only, 5.2%; and Both, 13.2%). Figure 2 illustrates the differential use of MRI among patients in the University-only (32.9%), Community-only (32.8%), and Both (66%) categories by 2009 (P < .001 for each 3-way comparison).
Table 2. Diagnostic Test and Treatment Use, Ascertained Before and After Linking University and Community EMR Data
Before University-Community EMR Data Linkage
After University-Community EMR Data Linkage
Users Identified by Data Source
Users Identified by Data Source
Users Identified by Data Source
Users Identified by Data Source
Users Identified by Data Source
Abbreviations: CCR, California Cancer Registry; Community, community-based practices; EMR, electronic medical record; MRI, magnetic resonance imaging; PET, positron emission tomography; University, university-based practices.
P value <.001 for comparison between patients from university-only practices, community-only practices, and both after EMR data linkage.
To study breast cancer care beyond the walls of a single institution, we linked state registry records with data extracted from the EMRs of 2 health care systems, one of which was community-based and one of which was university-affiliated. This 3-way data linkage generated unique insights. We found a 16% patient overlap between nearby health care systems, which enables an estimate of the magnitude of missing treatment information in single-institution studies. We discovered a striking care pattern, with Community patients increasingly likely to be treated at both institutions as their cancer prognosis worsened, and with Both patients receiving the most intensive intervention despite having intermediate cancer prognostic factors. These findings illustrate how efforts to compare outcomes across real-world settings must account for measured and unmeasured risk factors and patient preferences.
Previous studies have integrated complementary databases, supplementing SEER-derived data with treatment details from Medicare claims[27, 28] and health maintenance organizations.[29, 30] This study's novelty lies in linking data from the EMRs of nearby yet independent health care systems, anchored by data from the CCR, a SEER component. We assessed data quality by reviewing several hundred deidentified patient records and evaluating agreement between all sources; rare conflicts were adjudicated by physician review.[21, 22] The 3-way linkage identified the most informative source for each variable, with the CCR being most informative regarding treatment use, and EMRs the only source of diagnostic test data. Missing data were reduced by the 3-way linkage, with Both patients having the most data available.
We encountered limitations in extracting research data from EMRs. We extracted structured data from billing, drug ordering, and administration records, and performed simple natural language processing of diagnostic reports, but many important concepts remain buried in the unstructured paragraphs of the clinicians' notes. These include nuances of decision-making that lack representation elsewhere, notably physician recommendations and patient preferences. EMRs also promise a wealth of clinical detail that cannot be obtained from administrative databases or registries, including the images and reports of radiologic examinations and genomic sequencing tests. Some of this information can be extracted and encoded as discrete data elements (for example, Breast Imaging-Reporting and Data System [BI-RADS] scores for mammogram and breast MRI), whereas identifying the determinants of treatment choices may require advances in natural language processing. The accurate retrieval of such specific patient information from unstructured, free-text EMR notes remains an active area of research.[31, 32] Given the unique potential of EMRs to enhance the understanding of cancer outcomes, studies to optimize the clinical and research uses of EMRs should remain a high priority.[33, 34] Some limitations may be addressed through EMR changes, with structured fields facilitating data extraction; others require new data sources, including patient-reported information.[8, 35] Bridging such gaps should be a priority of emerging data integration initiatives.[36, 37] Health information technology is developing rapidly, and the decade between 2000 and 2010 witnessed the implementation of EMRs and complementary databases. EMR modules for clinical data exchange between University and Community (Care Everywhere network; Epic Systems) and between patients and physicians (Patient Portal; Epic Systems) were activated in 2012, and should enhance both clinical care and research. In the future, standardized data representation models will facilitate the interoperability of digital health data between institutions.
Patients in the Both category offer an intriguing glimpse across health care systems. This category comprised 16% of patients, disproportionately representing those with top-quintile SES and intermediate cancer prognostic factors. Without information regarding physician referrals and patient preferences, we do not know why patients accessed both systems, but the overrepresentation of sicker Community patients in the Both category suggests tertiary center consultation on challenging cases. The Both patients are remarkable for their significantly greater use of every intervention studied, including mastectomy, chemotherapy, radiotherapy, MRI, PET, and genetic testing. One explanation might be that University-only and Community-only patients actually accessed other health care systems, leading us to underestimate their test use; however, such potential underascertainment cannot explain treatment differences recorded in the CCR, which aggregates statewide cancer data comprehensively because of mandated reporting. Previous studies reported rising mastectomy rates[38-42] despite a lack of survival benefit,[4, 43, 44] and found correlations with an increase in diagnostic testing.[39, 45, 46] The high SES noted among patients in the Both category might explain their greater use of interventions that are usually considered optional, such as MRI and bilateral mastectomy,[25, 47-50] but we lack information regarding other factors that may drive use of these interventions, including family cancer history and clinical trial participation. Assessing the value added by specific interventions[51-53] will require a deeper understanding of the patient, physician, and health care factors that shape the care patterns we observed.
Integrating breast cancer data from 2 EMRs and the state registry proved feasible and informative, and broadened our understanding of care beyond what could be achieved from just one or two data sources. This approach offers insight regarding real-world treatment across health care systems, which can advance comparative effectiveness and outcomes research in oncology.
Supported by the Susan and Richard Levy Gift Fund; Regents of the University of California's California Breast Cancer Research Program (#16OB-0149); the Stanford University Developmental Research Fund; and the National Cancer Institute's Surveillance, Epidemiology, and End Results Program under contract HHSN261201000140C awarded to the Cancer Prevention Institute of California. The collection of cancer incidence data used in the current study was supported by the California Department of Health Services as part of the statewide cancer reporting program mandated by California Health and Safety Code Section 103885; the National Cancer Institute's Surveillance, Epidemiology, and End Results Program under contract HHSN261201000140C awarded to the Cancer Prevention Institute of California, contract HHSN261201000035C awarded to the University of Southern California, and contract HHSN261201000034C awarded to the Public Health Institute; and the Centers for Disease Control and Prevention's National Program of Cancer Registries, under agreement #1U58 DP000807-01 awarded to the Public Health Institute. The ideas and opinions expressed herein are those of the authors and endorsement by the University or State of California, the California Department of Health Services, the National Cancer Institute, or the Centers for Disease Control and Prevention or their contractors and subcontractors is not intended nor should be inferred.
CONFLICT OF INTEREST DISCLOSURES
Dr. Belkora has acted as a consultant for the Palo Alto Medical Foundation Research Institute. Dr. Blayney has received a grant from Blue Cross/Blue Shield of Michigan and honoraria and travel expenses from the American Society of Clinical Oncology, Saudi Cancer Foundation, Cancer Centers of Excellence, Physician Resource Management, Oregon Health Sciences University, and United HealthCare. He also owns stock in IBM, Oracle Corporation, and Google Inc. Dr. Luft has received a grant from the Richard and Susan Levy Foundation.