Linkage of routinely collected oncology clinical data with health insurance claims data—an example with aromatase inhibitors, tamoxifen, and all-cause mortality




Studies of cancer based solely on health insurance claims data typically lack information on cancer clinical characteristics that are strong predictors of treatment and prognosis. Our objective was to evaluate routinely collected cancer clinical data for adjustment of confounding using an example evaluation of mortality associated with aromatase inhibitors and tamoxifen.


This cohort study identified women with breast cancer from 2008 through 2010 using health insurance claims data linked to clinical information on stage at diagnosis, current clinical status, histology, and other clinical markers. Estimated mortality rates (MRs) and 95% confidence intervals (CI) were compared between users of aromatase inhibitors or tamoxifen, adjusted for claims-identified covariates and additionally for the clinical variables using propensity scores and proportional hazards regression models.


The overall (n = 7974) estimated MR was 69/1000 person-years (95%CI = 62–76 person-years), 308/1000 person-years (95% CI = 273–345 person-years) for women with metastasis, and 12/1000 person-years (95%CI = 8–16 person-years) for women without active cancer. Propensity score matching of aromatase inhibitor users (n = 777) with tamoxifen users (n = 535) removed many, but not all, covariate imbalances. The hazard ratios (HRs) of all-cause mortality comparing users of aromatase inhibitors with tamoxifen users ranged from 1.0 to 1.6, with the HR most similar to previous clinical trials (0.87) coming from the claims-only analysis.


We were able to address potential unmeasured confounders by linking clinical information to the claims data; however, there was no apparent improvement in confounding control in the chosen example. Conditioning eligibility on the clinical data restricted the sample size substantially. Copyright © 2012 John Wiley & Sons, Ltd.


Pharmacoepidemiologic studies must often address confounding arising from factors related to the decision to initiate a particular therapy that are also determinants of the study outcome. Although pharmacoepidemiologists may employ many tools to address confounding, the issue can be exacerbated when the hypothesized confounders are poorly defined in data. Health insurance claims databases offer advantages related to their routine collection in electronic form of drug exposures, patient outcomes, and numerous patient characteristics. However, characteristics that define good candidates for certain therapies may not be recorded in the data or by prescribers so that confounding by indication may represent a combination of confounding and missing data.[1]

Analytic approaches to address confounding through modeling the joint distribution of dozens of proxies (e.g., claims information) have emerged to address “unmeasured” confounding—where patient characteristics not measured by in a single claim might be sufficiently correlated with proxies (or multivariable proxies) to allow for adjustment for the desired variables.[2, 3] This approach must be used carefully because in the absence of data on the “missing” variable, the extent to which the proxies account for the unmeasured confounding is unknown.

An illustrative example is the setting of oncology medications, where studies based on claims data typically lack information on cancer clinical characteristics that are strong predictors of treatment and prognosis. In this article, we describe an analysis in which routinely collected cancer clinical data are linked to claims data, providing insight into confounding in an example evaluating the effect of aromatase inhibitors and tamoxifen on the rate of all-cause mortality. The 10-year analysis of the Arimidex, Tamoxifen, Alone or in Combination trial found that patients randomized to 5 years of anastrozole treatment had a similar or slightly lower rate of all-cause mortality compared with patients randomized to 5 years of tamoxifen therapy (hazard ratio [HR] = 0.87; 95%CI = 0.74–1.02).[4] Yet, in practice, the drug choice may be related to detailed cancer characteristics that are not identifiable in claims and might confound the association unless accounted for. The aim of this article was to present an example analysis that is based in a clinical example, but it is not intended to inform or alter prescribing practice.


Data sources

The study population came from the Normative Health Information (NHI) Database, a claims database of a large US commercial health insurer. Diagnoses associated with the claims are recorded using the International Classification of Disease, 9th revision (ICD-9) coding system. Procedures map to ICD-9, Common Procedural Terminology, and the Centers for Medicare and Medicaid Services Common Procedure Coding System codes. National Drug Codes identify medications.

The cancer clinical information came from the Impact Intelligence Oncology Management (IIOM) database. IIOM is a program coordinated by the health insurer for purposes of health plan administration that includes data collected every 6 months for patients in the NHI database who have two or more diagnoses of breast, lung, colon, or rectal cancer associated with a physician visit. Patients' physicians are sent a one-page facsimile form that is prepopulated with patient information and are requested to complete information on the patient's clinical status at diagnosis and currently. The data used for this analysis come from a pilot program and represent a sample of patients meeting the IIOM eligibility criteria. For breast cancer (ICD-9 174.xx), the facsimile form requests information on date of diagnosis, stage at diagnosis (0, I, II, III, and IV), human epidermal growth factor receptor-2 (HER2) status (overexpressed or underexpressed), hormone receptor status (positive or negative for either estrogen or progesterone receptors), and histology (invasive adenocarcinoma, ductal carcinoma in situ, or other). The current clinical status variable is common across cancer types in IIOM and contains the following options:

  • receiving adjuvant or neoadjuvant therapy;
  • clinically disease-free;
  • metastatic, locally recurrent, or progressive disease (with date of relapse as applicable);
  • not seen here since prior report (last 6 months);
  • patient no longer at this practice;
  • patient deceased.

Completed forms are sent back to the IIOM office, with a response fraction of approximately 60% on the first request for information for each patient, so that at least one record is available on that fraction of patients. Because the IIOM data were derived from the NHI database, linkage between the IIOM and the claims information is straightforward. The data are missing for about 1.5% of each of the breast cancer items, and because this fraction is small, we excluded women with missing data on their baseline IIOM assessment. The validity of the data has not been evaluated, but because these data are collected from treating physicians, they can be conceptualized similarly to medical records, but with an additional opportunity for recording errors.

Cohort formation

We initially identified women who had at least one IIOM record for breast cancer, at least one pharmacy dispensing for an aromatase inhibitor (anastrozole, letrozole, or exemestane) or tamoxifen, and at least one period of health plan enrollment lasting 6 months or longer between 1 January 2008 and 31 December 2010 (Figure 1). Because the USA has a multipayer healthcare system, women may have more than one period of health plan enrollment as a result of switching in and out of coverage by the health insurer.

Figure 1.

Formation of study cohorts from the Normative Health Information and the Impact Intelligence Oncology Management Databases, 1 January 2008–31 December 2010

The NHI database includes enrollees from a range of health insurance plans, and approximately 40% of the IIOM population had no pharmacy claims in the NHI database (IIOM eligibility is not contingent on enrollment in an affiliated pharmacy plan). As a result, the linked IIOM and NHI data were too few to study new users of an aromatase inhibitor or tamoxifen.[5] Therefore, in an effort to maintain an adequate sample size, we identified all users of an aromatase inhibitor or tamoxifen who had at least one breast cancer record in the IIOM data (prevalent users). These women were eligible for cohort entry upon the date of their first IIOM record that was preceded by 6 months of continuous enrollment and a dispensing of an aromatase inhibitor or tamoxifen in the 90 days before or 30 days after the IIOM record. Women who had a dispensing of more than one agent in this time frame were classified according to the dispensing closest in time to the IIOM record. This baseline exposure categorization was maintained for the duration of follow-up.

Follow-up occurred from the date of cohort entry—the latter of the qualifying IIOM record or the first aromatase inhibitor or tamoxifen dispensing in the 90 days before to 30 days after the IIOM record date—to the earliest of death, disenrollment from the health plan, or the end of the study period.

Covariates and propensity score estimation

We estimated propensity scores of aromatase inhibitor exposure relative to tamoxifen use among the eligible patients with exposure to these drugs (n = 1312) using a logistic regression model. The model included covariates defined from claims in the 6 months prior to cohort entry. These covariates consisted of all present episode treatment groups (ETGs; OptumInsight, Waltham, MA) and all pharmacy dispensings grouped into therapeutic drug classes and include information on other forms of treatment for breast cancer. Similar to diagnosis-related groups (DRGs) used to characterize inpatient care,[6] ETGs group treatment episodes using the claims record of patients and incorporate inpatient, outpatient, and ancillary services, including pharmacy services.[7-9] ETGs therefore use the trio of code types typically used for variable definition in administrative data (diagnosis, procedure, and medications) but in contrast to DRGs use all available claims.

For modeling, we first forced into the model variables judged to be potentially clinically relevant and prevalent: age, geographic, region, other disorders of the breast, benign neoplasm of the breast, diabetes, hypo-functioning thyroid gland, malignant bone metastases, disorders of the lymphatic channels and use of narcotic analgesics, serotonin-specific reuptake inhibitors, serotonin–norepinephrine reuptake inhibitors, or antiemetic/antivertigo agents.

Next, we used a stepwise selection procedure to identify predictors from among the remaining ETG and drug categories, retaining variables based on a p-value cutoff of 0.3 for model entry and 0.2 for retaining. This final propensity score model contained 35 variables and discriminated well (c-statistic = 0.81).

The propensity score estimated in this manner was used as the reference case in that it was computed using only healthcare claims data. We estimated an additional propensity score using the same procedure, but additionally forcing the following variables from the IIOM data: stage at diagnosis, histology, baseline current clinical status, and HER2 status. This model included 41 variables with c = 0.82.

Assessment of mortality

Information on the fact and date of death was obtained from linkage with the Social Security Death Master File, a compilation of mortality information derived from the US Social Security Administration payment records.[10]


We first estimated mortality rates (MR) and 95% confidence intervals (CI) among all women with at least one IIOM breast cancer record who were alive at the time of their first record (n = 7974). These rates were estimated overall and within strata of clinical characteristics from IIOM, with the aim of evaluating whether the stratum-specific rates are consistent with clinical expectation (e.g., that patients with metastatic cancer have a higher MR than that of other patients) and with greater precision than would be possible after restriction to the exposure cohorts. The remaining analyses applied to the exposure cohorts.

Second, we compared the prevalence of selected covariates among eligible women exposed to an aromatase inhibitor or tamoxifen (women with an IIOM record) with that of all women with at least two physician-visit-associated diagnoses of breast cancer within 6 months and exposure to an aromatase inhibitor or tamoxifen (the source population). The aim of this comparison was to determine whether women with an IIOM record were different than women in the source population. We ascertained the tabulated covariates from the six months before the second qualifying diagnosis of breast cancer.

Next, we tabulated baseline characteristics of the cohorts overall and among subcohorts matched on each of the propensity scores where the comparisons were made:

  • all users of aromatase inhibitors versus all users of tamoxifen (n = 1312);
  • users of aromatase inhibitors versus users of tamoxifen, matched on the claims-based propensity score (n = 630);
  • users of aromatase inhibitors versus users of tamoxifen, matched on propensity score estimated from claims and IIOM data (n = 632).

We estimated exposure-specific MRs and survival curves within each of the sets of exposure cohorts. HRs and 95%CIs comparing the rates of mortality among aromatase inhibitor users and tamoxifen users were estimated from proportional hazards regression models. In the analysis of all patients (i.e., without matching), we estimated crude and adjusted HRs, with adjustment via inclusion of the (claims-based) propensity score as a linear covariate in the model. The models for the matched cohorts provided crude HRs.


The Impact Intelligence Oncology Management database contained at least one record on 11 536 women with breast cancer. After exclusion of women who were deceased at the time of the first IIOM assessment, 7974 women entered follow-up for the stratified MR analysis (Table 1). The overall estimated MR was 69/1000 person-years (95%CI = 62–76 person-years). Women who were clinically disease-free had an MR of 12/1000 person-years (95%CI = 8–16 person-years) compared with 308/1000 person-years (95%CI = 273–345 person-years) for women with metastatic cancer. Similar differences were evident with stage at diagnosis.

Table 1. All-cause mortality rates, full breast cancer cohort, Impact Intelligence Oncology Management Database, 1 January 2008–31 December 2010
 Patients (n)Person-yearsEvents (n)Mortality rate*95% Confidence interval
  • *

    Rate per 1000 person-years.

All deaths7974602841468.762.375.5
Current clinical status
Clinically disease-free409629683411.58.115.8
I and II5298409217743.337.250.0
Human epidermal growth factor receptor-2
Hormone receptor status

Women who were eligible for the example analysis—and therefore had at least one IIOM record—had a higher prevalence of malignant disease and exposure to chemotherapy or radiation and a lower recorded prevalence of diabetes mellitus, mammography, and hyperlipidemia relative to the source population (Table 2).

Table 2. Comparison of prevalent diagnoses among women with breast cancer exposed to aromatase inhibitors or tamoxifen between the Impact Intelligence Oncology Management Database and the source claims data, 1 January 2008–31 December 2010
DiagnosisIIOM (n = 1312)Source population (n = 58 732)
Non-malignant neoplasm of breast24.237.4
Other disorders of breast12.710.6
Hypo-functioning thyroid gland12.212.4
Malignant bone metastases8.33.2
Disorders of lymphatic channel7.92.1
Radiation therapy23.35.8
Gynecological signs and symptoms21.932.7
Cardiovascular diseases signs and symptoms20.419.2
Non-malignant neoplasm of female genital tract7.24.2
Postmenopausal status5.33.2
Endocrine disease signs and symptoms2.62.9
Migraine headache2.01.6
Congestive heart failure1.52.4

Aromatase inhibitor users were more likely than were tamoxifen users to be ≥55 years of age, have had advanced disease upon diagnosis, and have a diagnosis of cardiovascular risk factors (Table 3). Propensity score matching removed many covariate imbalances, but others (invasive histology and metastatic clinical status) were only partially attenuated, even after adjustment for the variables from IIOM.

Table 3. Selected characteristics of all and propensity-score-matched aromatase inhibitor and tamoxifen users in the 6 months prior to cohort entry
 Total (n = 1312)Matched: claims only (n = 630)Matched: claims + IIOM (n = 632)
Aromatase inhibitorsTamoxifenAromatase inhibitorsTamoxifenAromatase inhibitorsTamoxifen
n = 777%n = 535%n = 315%n = 315%n = 316%n = 316%
  1. IIOM, Impact Intelligence Oncology Management Database.

Age (years)
IIOM clinical variables
Stage (III, IV)15319.77714.45517.54614.65316.85015.8
Histology (invasive)42254.327250.816552.415047.615950.314947.2
Current clinical status (metastasis)9312.0275.1299.2247.63310.4268.2
Human epidermal growth factor receptor-2 (overexpressed)16020.610720.07323.26219.75617.76520.6
Drug dispensings
Narcotic analgesics29538.020337.912539.711536.510934.511235.4
Serotonin-specific reuptake inhibitors12716.39417.65116.25517.56119.35316.8
Serotonin–norepinephrine reuptake inhibitors10012.98515.94915.65316.84413.95517.4
Antiemetic/Antivertigo agents9612.47113.33611.43711.83912.33611.4
Non-barbiturate sedative hypnotics11414.76812.73511.14113.03611.44012.7
Cephalosporins—first generation8711.28315.54213.33912.44213.33912.3
Alpha-/Beta-adrenergic blocking agents151.950.920.631.031.031.0
Bone resorption suppression agents12416.0234.3196.0216.7257.9227.0
Episode treatment groups
Non-malignant neoplasm of breast17622.714126.46520.67724.46721.27423.4
Other disorders of breast8210.68515.93711.84012.73410.83611.4
Hypo-functioning thyroid10313.35710.73310.53711.84313.64012.7
Malignant bone metastases8410.8254.7247.6237.3278.5237.3
Disorders of lymphatic channel688.8356.5257.9247.6299.2237.3
Other hyperlipidemia23430.17514.06119.46219.75918.75718.0
Radiation therapy16621.413926.07122.57523.88125.67222.8
Gynecological signs and symptoms14518.714226.56420.36921.96620.97022.2
Cardiovascular diseases signs and symptoms17522.59317.45517.55918.75918.75718.0
Non-malignant neoplasm of female genital tract334.36211.6216.7144.4196.0206.3
Postmenopausal status536.8173.2175.4154.8144.4165.1
Endocrine disease signs and symptoms182.3163.072.2123.882.592.9
Migraine headache141.8122.241.3103.282.592.9
Congestive heart failure111.491.751.641.351.661.9

The estimated MR among users of aromatase inhibitors was 72/1000 person-years, similar to the overall MR for IIOM breast cancer patients (Table 4). The rate among tamoxifen users was lower (27/1000 person-years), which became apparent after about 6 months of follow-up (Figure 2). The crude HR was 2.5 (95%CI = 1.3–4.9) and was attenuated with each of the confounding adjustment techniques, with HR ranging from 1.0 to 1.6 and with 95%CIs consistent with chance. The lowest HR came from the claims-only propensity score analyses, whereas the addition of the IIOM data appeared to provide no benefit. The HR from the cohorts matched on the propensity score estimated using claims and IIOM data was 1.6 (95%CI = 0.7–3.6).

Table 4. Rates of all-cause mortality comparing use of aromatase inhibitors with use of tamoxifen with and without additional adjustment for clinical variables from the IIOM database
AnalysisExposure cohortEvents/PatientsPerson-yearsMortality rate*Claims data onlyClaims and IIOM data
Hazard ratio95%CIHazard ratio95%CI
  • IIOM, Impact Intelligence Oncology Management Database; CI, confidence interval.

  • *

    Rate per 1000 person-years.

CrudeAromatase inhibitors44/777614.–4.9   
Propensity score adjustedAromatase inhibitors44/777614.–2.6  1.10.5–2.3  
Propensity score matchedAromatase inhibitors11/315241.–2.4      
Propensity score matchedAromatase inhibitors15/316241.462.1    1.60.7–3.6  
Tamoxifen9/316243.337.0    Ref.
Figure 2.

Kaplan–Meier Curves of time to death comparing aromatase inhibitor users with users of tamoxifen, with and without adjustment for clinical variables from the Impact Intelligence Oncology Management Database


Although the cancer clinical data appeared to provide no benefit in the adjustments for confounding in this example, perhaps because the claims data accounted for important confounding by proxy,[2, 11] these data were nevertheless useful. The IIOM data allowed for an informed interpretation of the role of residual confounding in the analysis. Through the linkage to the IIOM data, we were able to estimate the prevalence of otherwise unmeasured potential confounders and their associations with exposure and outcomes—three parameters that determine the degree of confounding.[12]

Having complete data on these variables also allows for a direct adjustment for these potential confounders. Contrasting the approach used in this article to sensitivity analysis can highlight this advantage. The commonly used forms of sensitivity analysis aim to evaluate whether an observed estimate of effect is robust across a range of plausible values for the prevalence of the unmeasured confounders and their associations with exposure and outcomes.[13] This approach can be informed through data collection[11] but often at a high cost and therefore on only a small sample of study subjects. External data can also be used but can require strong assumptions about the representativeness of the sampled data viz the study population. The approach presented in this article includes data on each study member, thereby avoiding assumptions about the representativeness of the clinical data across the full study membership. It also includes clinical data measured for each person (internal information) that was gathered at little additional expense. Indeed, the fundamental advantage of requiring that all eligible women have at least one IIOM record was that it avoided assumptions relevant to the sampled approaches that the distributions of the clinical variables in the IIOM subpopulation are similar to the distributions in the reference population.

However, there are also drawbacks in this example analysis related to the requirement that each woman has at least one IIOM record. This requirement reduced the sample size substantially in two ways. First, the study was restricted to the time frame covered by the IIOM data collection (January 2008 onward), whereas aromatase and tamoxifen have been available in the USA for over a decade.[14] Second, the response fraction for IIOM data requests is approximately 60% so that 40% of otherwise eligible patients are automatically excluded during the period covered by IIOM.

Where sample size remains an issue, sampled approaches with the IIOM data appear feasible. Several such approaches exist,[13, 15-20] ranging from using the external data to refine heuristic thinking about the magnitude of residual bias[21] to formally adjusting for a multitude of variables ascertained from the sampled data source (i.e., IIOM) using cohort (e.g., propensity score)[22] or nested case–control[15] methods. With these approaches, the IIOM data could be restricted to use in sensitivity analyses only (whether formal or informal[15, 22]), and the main inference could be drawn from the claims-based study. One difficulty with using the IIOM data in this manner is that we observed some differences in important clinical variables between the eligible women and the source population. These differences appear to reflect response bias, whereby patients in active treatment are more likely to have had their physician's office return the IIOM data collection form. Conditioning eligibility on having at least one IIOM record avoided these complications, but sampled uses of the IIOM data would need to account for the observed differences.[23]

Other limitations of our study warrant mention. Our exposure measure is based on claims for aromatase inhibitor or tamoxifen at the time of the first IIOM evaluation. Because this proxy exposure measure is likely misclassified to some extent, the findings with respect to mortality reduction may in part arise from exposure misclassification. Identification of clinical constructs in claims data can be difficult, which was a principal rationale for this study. However, the potential inaccuracies in the data extend beyond the cancer clinical variables in IIOM. For instance, the linkage between pharmacy submission of claims and patients' receipt and consumption of the medication is assumed and not directly measured, although prior work suggests that medication exposure measures can be accurately derived from pharmacy claims.[24] Because we included prevalent users and women who might have switched between the study drugs prior to the start of follow-up, the values of covariates used for adjustment and matching might have been affected by previous exposure. If any of these covariates were causal intermediaries, then the matching and adjustment would have biased the effect estimates toward the null.[25] In addition, the IIOM variables have unknown measurement characteristics, and data on 40% of patients eligible for IIOM were not obtained because of provider nonresponse.

Because this analysis aimed to evaluate the IIOM data for confounder control in comparative effectiveness studies of oncology, the eligibility criteria for the analysis were broad, resulting in inclusion of a heterogeneous population of patients. This was intentional, as we aimed to evaluate the data with respect to strong confounders, such as metastasis. One result is that the presented data may be of limited utility to clinical practice, but the IIOM data could easily be used to identify clinical subgroups for stratification or restriction in the study, resulting in the identification of subgroups that are not readily identifiable in claims data.[26]

In summary, we were able to address potential unmeasured confounders by linking clinical information to the claims data alone; however, there was no apparent improvement in confounding control in the chosen example. Conditioning eligibility on these data restricted the sample size substantially.


The authors were employees of OptumInsight, a division of UnitedHealth Group, at the time this research was conducted. Both of these organizations have a stake in the usability of the data presented for research purposes.


  • Pharmacoepidemiologic studies of cancer therapies may be confounded by characteristics of the cancer, including stage and histology, which are strong predictors of treatment and prognosis.
  • These clinically important variables may be unavailable in health insurance claims data, but they have recently become available through routine data collection.
  • In this evaluation of mortality associated with aromatase inhibitor or tamoxifen use, adjustment for cancer staging and histology variables did not affect point estimates but allowed for estimation of the prevalence of otherwise unmeasured potential confounders as well as their associations with exposure and outcomes—the three parameters that determine the degree of confounding.


This research was funded by OptumInsight, Epidemiology Division.