## INTRODUCTION

Electronic healthcare databases are widely used to assess the comparative effectiveness and safety of therapeutics in real-world settings.[1, 2] However, as these databases are not created for research purposes, data on certain important confounders may not be recorded. For example, administrative claims databases rarely, if ever, include direct measures of cigarette smoking, left ventricular ejection fraction, or depression severity.

Confounder data are sometimes available for a subset of patients. For example, measures of cigarette smoking or left ventricular ejection fraction may be recorded for some patients with electronic health records. Administrative claims databases and electronic health records can be linked to each other,[3] or to device and disease registries,[4-7] birth certificates,[8, 9] or survey data[10] to provide additional confounder information that is otherwise not available. However, supplemental information is available only for patients whose records can be found in both data sources and linked successfully.

With a rapid increase in database linkage and the use of electronic health records in comparative effectiveness and safety research, many researchers are now dealing with partial missingness of confounder information. Methods that can handle missing data have been described.[11-13] Here, we discuss and compare several analytic approaches to handle partially missing confounder data in studies that use electronic healthcare databases. We used the relation between non-steroidal anti-inflammatory drugs (NSAIDs) and upper gastrointestinal bleeding (UGIB) as our example, because the relation is well known—randomized trials reported a 40%–60% lower risk of UGIB for selective cyclo-oxygenase-2 inhibitors (coxibs) compared with traditional NSAIDs (tNSAIDs)[14, 15]—and because severe confounding is expected in observational studies as coxibs are likely to be preferentially given to patients who have a higher risk of UGIB.

## EXAMPLE STUDY

### Data source

We analyzed data from The Health Improvement Network database in the United Kingdom,[16, 17] a primary care database of nearly 4 million individuals whose clinical information is recorded by their general practitioner. Available information includes patient demographics; medical diagnoses; free-text comments; referral letters from consultants and hospitalizations; a record of all prescriptions issued; results from clinical examinations and laboratory tests; and additional information such as weight, height, smoking, and alcohol consumption. The Health Improvement Network database uses Read Codes to register medical diagnoses and procedures and a coded drug dictionary based on the Prescription Pricing Authority dictionary to record medications prescribed. The current study was approved by a Multicentre Research Ethics Committee in the UK.

### Study population

The source population included 1 810 442 individuals aged 40–84 years between 1 January 2000 and 31 December 2008 with at least 5 years of enrollment with the general practitioner, at least 1 year of prospectively recorded information after the first recorded prescription in the database, and at least 1 record (e.g., diagnosis) in the year prior to the first day in the study period during which they met all the previously mentioned criteria (*entry date*).

From the source population, we identified all patients with a first prescription of either a tNSAID or a coxib between the entry date and 31 December 2008. We refer to the date of first NSAID prescription as the *index date*. We required eligible patients to have no evidence of any NSAID prescription in the 18 months preceding the index date and no recorded history of cancer (excluding non-melanoma skin cancer), chronic liver disease, Mallory–Weiss syndrome, coagulopathy, esophageal varices, chronic alcoholism, or bariatric or other surgery resulting in gastrojejunal anastomosis any time before the index date. We further excluded patients who initiated both a coxib and a tNSAID on the index date. The remaining 43 569 coxib and 411 616 tNSAID initiators formed our study cohort. We represent the treatment variable by *A* (1: coxib initiation, 0: tNSAID initiation).

Each patient in the cohort was followed from the index date until the earliest occurrence of UGIB, 85 years of age, death, 180 days of follow-up, or 31 December 2008. We selected a short follow-up of up to 180 days to minimize exposure misclassification.

### Outcome

The validation process of potential UGIB cases has been described previously.[18] Briefly, we first searched for Read Codes that suggest UGIB during the follow-up period and then reviewed the computerized medical records (after including free-text comments) to confirm the diagnosis. Our initial computer search identified 468 potential cases of UGIB (73 among coxib initiators) during follow-up, of which 183 (25 among coxib initiators) were confirmed as cases after manual review and included in the analysis. The incidence rate of confirmed UGIB per 1000 person-years was 1.2 for coxib initiators and 0.9 for tNSAID initiators, which was consistent with previous studies.[19-22] We represent the outcome variable by *Y* (1: UGIB, 0: no UGIB).

### Potential confounders

We identified the following potential confounders recorded in the entire study cohort during the 12-month period preceding the index date:[19-22] age; sex; calendar year of treatment initiation; Charlson comorbidity score; use of gastroprotective drugs, anticoagulants, antiplatelets, and oral steroids; diagnosis of osteoarthritis, rheumatoid arthritis, dyspepsia, complicated and uncomplicated peptic ulcer disease, hypertension, congestive heart failure, and coronary artery disease; and three measures of healthcare utilization (numbers of distinct drugs prescribed, physician visits, and hospitalizations in the prior year). We represent these confounders by the vector **X**.

We further considered three supplemental variables—smoking, alcohol consumption, and body mass index—that were recorded in only a subset of the study population. About 78% of the study cohort had information for all three of these lifestyle variables, which are represented by the vector **L** = (*L*_{1}, *L*_{2}, *L*_{3}). Both **X** and **L** include only baseline variables measured before treatment initiation.

### Propensity score analysis

We used propensity score (PS)[23, 24] to adjust for potential confounders. Suppose the values of **L** variables were known for all patients, we could fit: (i) a logistic model for Pr[*A* = 1|**X**, **L**] to estimate each patient's PS, that is, the probability of initiating a coxib conditional on their covariate values; and (ii) a logistic model for Pr[*Y* = 1|*A*, *PS*] to estimate the odds ratio (OR) of UGIB for coxib versus tNSAID initiators conditional on coxib initiation *A* and the PS (in deciles). This PS analysis would estimate an *intention-to-treat* effect of coxib initiation on the risk of UGIB (conditional on the PS) compared with tNSAID initiation over the study's follow-up in the entire study population. In this study, we had to handle partial missingness in **L** before proceeding to performing the PS analysis.

### METHODS TO DEAL WITH PARTIALLY MISSING CONFOUNDERS

We describe two ways to handle missing data for **L**: (i) restricting the outcome model to patients without missing values (complete-case analysis) or (ii) fitting the outcome model to all patients after assigning a value to either **L** or the PS for those with missing values (imputation methods). We analyzed our data under two versions of the complete-case analysis and four versions of imputation.

### Complete-case analysis; unweighted

We defined a missingness indicator *M* (1: if any of the **L** variables is missing, 0: otherwise) and performed the PS analysis described earlier but only among patients with no missing values (*M* = 0). For comparison purposes, we performed an analysis adjusting only for **X** separately in patients with and without missing values in **L**.

### Complete-case analysis; inverse probability weighted

The inverse probability (IP) weight 1/Pr[*M* = 0|**X**] is the inverse (reciprocal) of the probability of *M* = 0 conditional on **X**.[25, 26] To estimate this weight, we (i) defined a missingness indicator *M*_{j} (1 if missing and 0 otherwise, *j* = 1, 2, 3) for each of the three variables smoking *L*_{1}, alcohol *L*_{2}, and body mass index *L*_{3} (smoking had the lowest proportion of missing values and BMI had the highest; alternative orderings of the three variables did not materially affect the results); (ii) fit three logistic models for Pr[*M*_{1} = 0|**X**]_{,} Pr[*M*_{2} = 0|**X**, *M*_{1} = 0], and Pr[M3=0|X,M1=M2=0]; (iii) calculated the three predicted conditional probabilities for each patient; and (iv) multiplied the three predicted probabilities and used the reciprocal of the product as the IP weight for each patient. We then performed a complete-case analysis identical to the one described earlier, except that each patient was weighted by his estimated IP weight. We used a robust variance estimator to calculate a conservative 95% confidence interval (CI).[27] Using either bootstrapping or a variance estimator that explicitly incorporates how the weights were estimated would produce a narrower 95%CI.[28]

This method attempts to reconstruct the study population without missing values by re-weighting patients with complete information. For example, if a patient had a conditional probability of 0.25 of having no missing values in **L**, the patient would be assigned a weight of 4 (1/0.25). That is, the patient would represent three other patients with similar **X** values but whose data are not included in the outcome model because of a missing value in **L**.

### Missing-category and missing-indicator approach

In the missing-category approach, we created an additional category within each variable in **L** for patients with missing values. We then conducted the PS analysis described earlier. In the missing-indicator approach,[11, 29] we estimated the PS via a logistic model for coxib initiation that included the **X** variables, the missing indicators *M*_{j}, and the product terms *L*_{j}(1*−M*_{j}).

### Single imputation

We replaced the missing values for each **L** variable by the value of its most common category and conducted the PS analysis described previously.

### Multiple imputation by chained equations[30, 31]

*Iteration 1*: We fit a multinomial logistic regression model for Pr[*L*_{1} = *l*|**X**, *A*, *Y*, *M*_{1} = 0].[32, 33] The parameter estimates from the model defined a conditional multinomial distribution, from which we drew values of *L*_{1} to impute them to patients with missing *L*_{1}. We then fit a second multinomial model for Pr[L2=l|X,A,Y,L1*,M2=0], where L1* is the partially imputed smoking status, to impute the missing values for *L*_{2}. Finally, we fit a multinomial model for Pr[L3=l|X,A,Y,L1*,L2*,M3=0] to impute the missing values for *L*_{3}. This first iteration of the procedure resulted in a dataset in which all missing values for variables **L** were imputed.

*Iteration 2*: We repeated the previously mentioned procedure using the imputed data set from the first iteration. We removed the imputed values of *L*_{1} and re-imputed them with a model conditional on *A*, **X**, Y, L2*, and L3*. We then re-imputed the missing values of *L*_{2} and *L*_{3}.

The procedure was repeated until 10 iterations were completed or until stable imputed values were obtained. We then repeated the iterative procedure 10 times to create 10 imputed datasets, in each of which we conducted the PS analysis described earlier. We then combined the OR estimates from the imputed datasets.[34] We performed this multiple imputation analysis using the IVEWare package for SAS software developed by the Survey Research Center, Institute for Social Research, at the University of Michigan (http://www.isr.umich.edu/src/smp/ive/).

### Propensity score calibration

This method imputes the value of the PS, rather than the value of the variables **L**, in patients with missing values in **L**.[35] Imputing the PS can be conceptualized as a measurement error issue, which can be corrected using regression calibration if the true or *gold-standard* PS, *PS*_{gs}, can be correctly estimated from an internal or external validation sample.[36, 37] We attempted to create an internal validation sample by randomly selecting 300 000 patients among those with complete information on both **X** and **L** with the same age and sex joint distribution as the entire study cohort.

In the entire study cohort, we estimated an *error-prone* PS or *PS*_{ep} via the logistic model for Pr[*A* = 1|**X**] and then included the estimated *PS*_{ep} as a linear continuous covariate in the logistic model for Pr[Y=1|A,PS∧ep]=β0+β1A+β2PS∧ep.

In the validation sample, we estimated *PS*_{gs} via the logistic model for Pr[*A* = 1|**X**, **L**] and fit the linear regression model E[PSgs|A,PS∧ep]=δ0+δ1A+δ2PS∧ep.

The regression calibration-adjusted estimator for the treatment effect was η^1=β^1−δ^1β^2/δ^2. We conducted the analysis by using a SAS macro by Spiegelman and Logan, which is publicly available at http://www.hsph.harvard.edu/faculty/donna-spiegelman/software/blinplus-macro/index.html.

## RESULTS

Table 1 shows the distribution of baseline characteristics of initiators of coxibs and tNSAIDs ascertained during the 12-month period before the first NSAID prescription. The crude OR of UGIB for coxib initiators versus tNSAID initiators was 1.50 (95%CI: 0.98, 2.28). The OR was 1.04 (0.68, 1.59) after adjustment for age and sex, 0.98 (0.63, 1.52) upon further adjustment for calendar year of treatment initiation, and 0.84 (0.54, 1.31) after further adjustment for measures of healthcare utilization. When we further adjusted for all remaining confounders in **X**, the OR was 0.81 (0.52, 1.27) for the entire study cohort, 0.64 (0.38, 1.07) for the 78% patients of the cohort with complete information on all three lifestyle variables in **L**, and 1.93 (0.78, 4.74) for patients with missing values on any of the three lifestyle variables.

Table 1. Baseline characteristics of initiators of selective cyclo-oxygenase-2 inhibitors (coxibs) or non-selective (*traditional*) non-steroidal anti-inflammatory drugs (tNSAIDs) ascertained during the 12-month period before the first NSAID prescriptionAge (years) | |

40–44 | 6.7 | 18.9 | 7.9 | 20.4 |

45–49 | 8.2 | 14.3 | 9.4 | 15.9 |

50–54 | 10.5 | 13.9 | 11.4 | 14.8 |

55–59 | 13.2 | 13.7 | 11.8 | 12.9 |

60–64 | 12.6 | 12.6 | 11.7 | 11.0 |

65–69 | 13.4 | 9.7 | 11.6 | 8.5 |

70–74 | 14.3 | 8.0 | 12.9 | 7.0 |

75–79 | 12.6 | 5.7 | 12.7 | 5.7 |

80–84 | 8.6 | 3.3 | 10.7 | 3.9 |

Male sex | 35.0 | 41.3 | 39.5 | 50.2 |

Calendar year of treatment initiation | |

2000 | 5.0 | 9.7 | 5.7 | 12.0 |

2001 | 10.9 | 11.5 | 13.5 | 14.1 |

2002 | 19.3 | 10.9 | 22.1 | 13.5 |

2003 | 24.6 | 10.3 | 25.6 | 12.0 |

2004 | 27.3 | 10.6 | 23.0 | 11.0 |

2005 | 2.3 | 11.9 | 2.3 | 11.0 |

2006 | 3.4 | 11.5 | 2.7 | 9.6 |

2007 | 3.6 | 11.9 | 2.7 | 8.9 |

2008 | 3.7 | 11.7 | 2.3 | 7.9 |

No. of distinct drugs in the prior year | |

0–2 | 22.7 | 42.2 | 32.0 | 53.3 |

3–4 | 17.7 | 20.5 | 19.5 | 19.4 |

5–7 | 22.2 | 18.5 | 20.0 | 15.0 |

≥8 | 37.3 | 18.8 | 28.5 | 12.4 |

No. of outpatient visits in the prior year | |

0–3 | 19.5 | 29.9 | 34.3 | 46.1 |

4–6 | 20.7 | 23.5 | 23.0 | 23.3 |

7–10 | 22.3 | 21.0 | 19.1 | 16.0 |

≥11 | 37.5 | 25.6 | 23.7 | 14.6 |

Hospitalized in the prior year | 9.8 | 8.1 | 7.2 | 6.1 |

Charlson comorbidity score ≥1 | 41.4 | 27.9 | 33.5 | 20.4 |

Prior use of | |

Gastroprotective drugs | 29.2 | 11.8 | 26.2 | 9.4 |

Anticoagulants | 2.0 | 0.8 | 1.7 | 0.7 |

Antiplatelets | 21.1 | 12.7 | 14.4 | 7.7 |

Oral steroids | 9.8 | 5.0 | 9.2 | 4.2 |

Diagnosis of | | | | |

Osteoarthritis | 39.8 | 21.3 | 31.8 | 16.5 |

Rheumatoid arthritis | 3.3 | 1.2 | 3.2 | 1.2 |

Dyspepsia | 4.6 | 2.1 | 3.2 | 1.4 |

Peptic ulcer disease | 0.4 | 0.1 | 0.3 | 0.1 |

Hypertension | 40.5 | 29.9 | 26.5 | 17.0 |

Congestive heart failure | 2.9 | 1.1 | 2.5 | 0.9 |

Coronary artery disease | 15.9 | 8.3 | 10.1 | 4.3 |

Smoking | |

Non-smoker | 55.9 | 55.2 | 35.8 | 37.6 |

Current smoker | 19.9 | 21.7 | 16.4 | 18.1 |

Past smoker | 24.3 | 23.1 | 12.8 | 12.6 |

Unknown | -- | -- | 35.0 | 31.7 |

Alcohol consumption (drinks/week) | |

None | 49.6 | 44.4 | 14.9 | 13.6 |

1–9 | 33.4 | 34.6 | 7.9 | 8.8 |

10–19 | 10.7 | 12.9 | 2.5 | 3.6 |

≥20 | 6.3 | 8.1 | 1.8 | 3.0 |

Unknown | -- | -- | 73.0 | 71.0 |

Body mass index (kg/m^{2}) | |

<18.5 | 1.3 | 1.1 | 0.5 | 0.4 |

18.5–24.9 | 33.9 | 36.7 | 8.0 | 8.6 |

25.0–29.9 | 39.8 | 38.8 | 9.2 | 9.2 |

30.0–34.9 | 17.6 | 16.0 | 4.9 | 4.6 |

≥35 | 7.4 | 7.4 | 2.8 | 2.4 |

Unknown | -- | -- | 74.7 | 74.9 |

Table 2 shows the results from different analytic approaches to deal with missing confounder data. The adjusted ORs were 0.65 and 0.67 for the unweighted and IP-weighted complete-case analyses, respectively. In the IP-weighted analysis, the weight had a mean of 1.28 (standard deviation 0.15) and ranged from 1.04 to 2.49. The adjusted ORs ranged between 0.80 and 0.83 for the imputation methods. The 95%CIs from the different methods were overlapping; the 95%CI for any estimate included all other point estimates.

Table 2. Odds ratios of upper gastrointestinal bleeding during the first 180 days following initiation of selective cyclo-oxygenase-2 inhibitors versus non-selective non-steroidal anti-inflammatory drugs, by different analytic approaches to incorporate supplemental confounder data available in a subset of the study cohortComplete-case analysis; unweighted | 354 426 | 0.65 (0.39, 1.09) | 0.27 |

Complete-case analysis; inverse probability weighted | 354 426 (outcome/PS model) 455 185 (weight model) | 0.67 (0.38, 1.16) | 0.28 |

Missing-category approach | 455 185 | 0.81 (0.51, 1.26) | 0.23 |

Missing-indicator approach | 455 185 | 0.80 (0.51, 1.25) | 0.23 |

Single imputation | 455 185 | 0.83 (0.53, 1.30) | 0.23 |

Multiple imputation | 455 185 | 0.82 (0.52, 1.29) | 0.23 |

PS calibration | 455 185 (error-prone PS model) 300 000 (gold-standard PS model) | 0.80 (0.50, 1.27) | 0.24 |

Results from all approaches did not materially change when the PS was included as a continuous variable instead of deciles in the outcome model (as was necessary for the PS calibration approach). The *c*-statistic for the PS model was around 0.80 for all analyses, and the covariates were overall well balanced within PS strata (data not shown).

## DISCUSSION

We have reviewed and compared several approaches to deal with partially missing confounder information in electronic healthcare databases. We used the NSAID–UGIB example to illustrate their application to comparative effectiveness and safety research of therapeutics. All these methods require the assumptions of no unmeasured confounding for the effect of treatment on the outcome and no misspecification of the outcome and PS models.

The missing-category/indicator approach and single imputation by the most common category further require additional assumptions that are generally implausible. In essence, they all assume that patients with missing information on certain variables are unconditionally exchangeable and can be grouped together for analysis. Single imputation by the most common category goes a step further and assumes that patients with missing data are not only comparable with each other but also with patients with a certain (often arbitrarily chosen) covariate value. Although these methods are easy to implement, they have been shown to produce biased estimates even when patients with and without missing data are unconditionally exchangeable (i.e., data *missing completely at random*).[11, 38-40]

Multiple imputation requires that missingness be unassociated with the outcome conditional on the measured confounders or the corresponding PS (i.e., data *missing at random*) and that the imputation model for each covariate with missing data be correctly specified.[34] The approach has been shown to provide more valid estimates than the missing-indicator approach and single imputation when these assumptions are true.[11, 34, 39-42] A recent study that used The Health Improvement Network database[43] found that patients with missing information on smoking, alcohol consumption, weight, or height differ systematically from the others in terms of comorbidities such as cardiovascular disease and chronic obstructive pulmonary disease. Our estimate from multiple imputation would be incorrect if missingness was associated with other prognostic factors that were not included in the analysis. We used a version of multiple imputation that does not require the often unrealistic assumption of joint multivariate normality.[30, 31]

The PS calibration approach is valid under the assumptions that there is an appropriate internal or external validation sample, the linear measurement error model is correctly specified, and the error-prone PS is an appropriate surrogate for the gold-standard PS.[35, 44] The last assumption may be violated if the direction of confounding from the unmeasured or partially measured confounders is in the opposite direction to that from the measured covariates.[44] This approach may be combined with single imputation of the gold-standard PS based on the parameters of the measurement error model to do away with the need to specify the outcome model through matching or stratifying on the imputed gold-standard PS.[45]

Yet, despite all these differences in the conditions required for valid estimates, we found only small differences across different imputation methods. The reasons might be that the proportion of missingness was relatively low and that the three variables with missing values might not be strong confounders after conditioning on other measured variables. Indeed, the OR adjusted for all potential confounders available in the entire study cohort (0.81) was similar to the ORs that were further adjusted for the three lifestyle variables by using different imputation approaches (0.80–0.83).

Like the imputation methods, the IP-weighted complete-case analysis estimates the effect in the entire study population.[25, 46] It is valid under an additional assumption that the weight models are correctly specified. The unweighted complete-case analysis estimates the effect only among patients without missing values; its results cannot be applied to the entire study population unless the data are missing completely at random. The unweighted complete-cases analysis has been shown to produce more biased estimates compared to other approaches, such as multiple imputation.[11, 39, 40]

The point estimates of complete-case analyses and imputation methods were somewhat different, which may be due to random variability (wide and overlapping 95%CIs) or to real differences between patients with complete and incomplete confounder information beyond the information recorded in the database. For example, general practitioners who record patient lifestyle factors—and patients who respond to these questions—might have certain unmeasured characteristics that are associated with the outcome risk. Also, the effect of NSAIDs on UGIB might be modified by certain patient characteristics for which *missingness* is a proxy.

In conclusion, a number of methods are available to deal with missing data in comparative effectiveness and safety studies of therapeutics that analyze electronic healthcare databases. Researchers need to be aware of the underlying assumptions of various methods when choosing among them.

## ACKNOWLEDGEMENTS

The authors would like to thank Ken Kleinman, ScD from Harvard Medical School and Harvard Pilgrim Health Care Institute for his thoughtful comments on an earlier draft of this paper. Dr. Toh is partially supported by R03HS019024. Dr. Hernán is partially supported by R01HL080644.