Analyzing partially missing confounder information in comparative effectiveness and safety research of therapeutics




Electronic healthcare databases are commonly used in comparative effectiveness and safety research of therapeutics. Many databases now include additional confounder information in a subset of the study population through data linkage or data collection. We described and compared existing methods for analyzing such datasets.


Using data from The Health Improvement Network and the relation between non-steroidal anti-inflammatory drugs and upper gastrointestinal bleeding as an example, we employed several methods to handle partially missing confounder information.


The crude odds ratio (OR) of upper gastrointestinal bleeding was 1.50 (95% confidence interval: 0.98, 2.28) among selective cyclo-oxygenase-2 inhibitor initiators (n = 43 569) compared with traditional non-steroidal anti-inflammatory drug initiators (n = 411 616). The OR dropped to 0.81 (0.52, 1.27) upon adjustment for confounders recorded for all patients. When further considering three additional variables missing in 22% of the study population (smoking, alcohol consumption, body mass index), the OR was between 0.80 and 0.83 for the missing-category approach, the missing-indicator approach, single imputation by the most common category, multiple imputation by chained equations, and propensity score calibration. The OR was 0.65 (0.39, 1.09) and 0.67 (0.38, 1.16) for the unweighted and the inverse probability weighted complete-case analysis, respectively.


Existing methods for handling partially missing confounder data require different assumptions and may produce different results. The unweighted complete-case analysis, the missing-category/indicator approach, and single imputation require often unrealistic assumptions and should be avoided. In this study, differences across methods were not substantial, likely due to relatively low proportion of missingness and weak confounding effect by the three additional variables upon adjustment for other variables. Copyright © 2012 John Wiley & Sons, Ltd.


Electronic healthcare databases are widely used to assess the comparative effectiveness and safety of therapeutics in real-world settings.[1, 2] However, as these databases are not created for research purposes, data on certain important confounders may not be recorded. For example, administrative claims databases rarely, if ever, include direct measures of cigarette smoking, left ventricular ejection fraction, or depression severity.

Confounder data are sometimes available for a subset of patients. For example, measures of cigarette smoking or left ventricular ejection fraction may be recorded for some patients with electronic health records. Administrative claims databases and electronic health records can be linked to each other,[3] or to device and disease registries,[4-7] birth certificates,[8, 9] or survey data[10] to provide additional confounder information that is otherwise not available. However, supplemental information is available only for patients whose records can be found in both data sources and linked successfully.

With a rapid increase in database linkage and the use of electronic health records in comparative effectiveness and safety research, many researchers are now dealing with partial missingness of confounder information. Methods that can handle missing data have been described.[11-13] Here, we discuss and compare several analytic approaches to handle partially missing confounder data in studies that use electronic healthcare databases. We used the relation between non-steroidal anti-inflammatory drugs (NSAIDs) and upper gastrointestinal bleeding (UGIB) as our example, because the relation is well known—randomized trials reported a 40%–60% lower risk of UGIB for selective cyclo-oxygenase-2 inhibitors (coxibs) compared with traditional NSAIDs (tNSAIDs)[14, 15]—and because severe confounding is expected in observational studies as coxibs are likely to be preferentially given to patients who have a higher risk of UGIB.


Data source

We analyzed data from The Health Improvement Network database in the United Kingdom,[16, 17] a primary care database of nearly 4 million individuals whose clinical information is recorded by their general practitioner. Available information includes patient demographics; medical diagnoses; free-text comments; referral letters from consultants and hospitalizations; a record of all prescriptions issued; results from clinical examinations and laboratory tests; and additional information such as weight, height, smoking, and alcohol consumption. The Health Improvement Network database uses Read Codes to register medical diagnoses and procedures and a coded drug dictionary based on the Prescription Pricing Authority dictionary to record medications prescribed. The current study was approved by a Multicentre Research Ethics Committee in the UK.

Study population

The source population included 1 810 442 individuals aged 40–84 years between 1 January 2000 and 31 December 2008 with at least 5 years of enrollment with the general practitioner, at least 1 year of prospectively recorded information after the first recorded prescription in the database, and at least 1 record (e.g., diagnosis) in the year prior to the first day in the study period during which they met all the previously mentioned criteria (entry date).

From the source population, we identified all patients with a first prescription of either a tNSAID or a coxib between the entry date and 31 December 2008. We refer to the date of first NSAID prescription as the index date. We required eligible patients to have no evidence of any NSAID prescription in the 18 months preceding the index date and no recorded history of cancer (excluding non-melanoma skin cancer), chronic liver disease, Mallory–Weiss syndrome, coagulopathy, esophageal varices, chronic alcoholism, or bariatric or other surgery resulting in gastrojejunal anastomosis any time before the index date. We further excluded patients who initiated both a coxib and a tNSAID on the index date. The remaining 43 569 coxib and 411 616 tNSAID initiators formed our study cohort. We represent the treatment variable by A (1: coxib initiation, 0: tNSAID initiation).

Each patient in the cohort was followed from the index date until the earliest occurrence of UGIB, 85 years of age, death, 180 days of follow-up, or 31 December 2008. We selected a short follow-up of up to 180 days to minimize exposure misclassification.


The validation process of potential UGIB cases has been described previously.[18] Briefly, we first searched for Read Codes that suggest UGIB during the follow-up period and then reviewed the computerized medical records (after including free-text comments) to confirm the diagnosis. Our initial computer search identified 468 potential cases of UGIB (73 among coxib initiators) during follow-up, of which 183 (25 among coxib initiators) were confirmed as cases after manual review and included in the analysis. The incidence rate of confirmed UGIB per 1000 person-years was 1.2 for coxib initiators and 0.9 for tNSAID initiators, which was consistent with previous studies.[19-22] We represent the outcome variable by Y (1: UGIB, 0: no UGIB).

Potential confounders

We identified the following potential confounders recorded in the entire study cohort during the 12-month period preceding the index date:[19-22] age; sex; calendar year of treatment initiation; Charlson comorbidity score; use of gastroprotective drugs, anticoagulants, antiplatelets, and oral steroids; diagnosis of osteoarthritis, rheumatoid arthritis, dyspepsia, complicated and uncomplicated peptic ulcer disease, hypertension, congestive heart failure, and coronary artery disease; and three measures of healthcare utilization (numbers of distinct drugs prescribed, physician visits, and hospitalizations in the prior year). We represent these confounders by the vector X.

We further considered three supplemental variables—smoking, alcohol consumption, and body mass index—that were recorded in only a subset of the study population. About 78% of the study cohort had information for all three of these lifestyle variables, which are represented by the vector L = (L1, L2, L3). Both X and L include only baseline variables measured before treatment initiation.

Propensity score analysis

We used propensity score (PS)[23, 24] to adjust for potential confounders. Suppose the values of L variables were known for all patients, we could fit: (i) a logistic model for Pr[A = 1|X, L] to estimate each patient's PS, that is, the probability of initiating a coxib conditional on their covariate values; and (ii) a logistic model for Pr[Y = 1|A, PS] to estimate the odds ratio (OR) of UGIB for coxib versus tNSAID initiators conditional on coxib initiation A and the PS (in deciles). This PS analysis would estimate an intention-to-treat effect of coxib initiation on the risk of UGIB (conditional on the PS) compared with tNSAID initiation over the study's follow-up in the entire study population. In this study, we had to handle partial missingness in L before proceeding to performing the PS analysis.


We describe two ways to handle missing data for L: (i) restricting the outcome model to patients without missing values (complete-case analysis) or (ii) fitting the outcome model to all patients after assigning a value to either L or the PS for those with missing values (imputation methods). We analyzed our data under two versions of the complete-case analysis and four versions of imputation.

Complete-case analysis; unweighted

We defined a missingness indicator M (1: if any of the L variables is missing, 0: otherwise) and performed the PS analysis described earlier but only among patients with no missing values (M = 0). For comparison purposes, we performed an analysis adjusting only for X separately in patients with and without missing values in L.

Complete-case analysis; inverse probability weighted

The inverse probability (IP) weight 1/Pr[M = 0|X] is the inverse (reciprocal) of the probability of M = 0 conditional on X.[25, 26] To estimate this weight, we (i) defined a missingness indicator Mj (1 if missing and 0 otherwise, j = 1, 2, 3) for each of the three variables smoking L1, alcohol L2, and body mass index L3 (smoking had the lowest proportion of missing values and BMI had the highest; alternative orderings of the three variables did not materially affect the results); (ii) fit three logistic models for Pr[M1 = 0|X], Pr[M2 = 0|X, M1 = 0], and math formula; (iii) calculated the three predicted conditional probabilities for each patient; and (iv) multiplied the three predicted probabilities and used the reciprocal of the product as the IP weight for each patient. We then performed a complete-case analysis identical to the one described earlier, except that each patient was weighted by his estimated IP weight. We used a robust variance estimator to calculate a conservative 95% confidence interval (CI).[27] Using either bootstrapping or a variance estimator that explicitly incorporates how the weights were estimated would produce a narrower 95%CI.[28]

This method attempts to reconstruct the study population without missing values by re-weighting patients with complete information. For example, if a patient had a conditional probability of 0.25 of having no missing values in L, the patient would be assigned a weight of 4 (1/0.25). That is, the patient would represent three other patients with similar X values but whose data are not included in the outcome model because of a missing value in L.

Missing-category and missing-indicator approach

In the missing-category approach, we created an additional category within each variable in L for patients with missing values. We then conducted the PS analysis described earlier. In the missing-indicator approach,[11, 29] we estimated the PS via a logistic model for coxib initiation that included the X variables, the missing indicators Mj, and the product terms Lj(1−Mj).

Single imputation

We replaced the missing values for each L variable by the value of its most common category and conducted the PS analysis described previously.

Multiple imputation by chained equations[30, 31]

Iteration 1: We fit a multinomial logistic regression model for Pr[L1 = l|X, A, Y, M1 = 0].[32, 33] The parameter estimates from the model defined a conditional multinomial distribution, from which we drew values of L1 to impute them to patients with missing L1. We then fit a second multinomial model for math formula, where math formula is the partially imputed smoking status, to impute the missing values for L2. Finally, we fit a multinomial model for math formula to impute the missing values for L3. This first iteration of the procedure resulted in a dataset in which all missing values for variables L were imputed.

Iteration 2: We repeated the previously mentioned procedure using the imputed data set from the first iteration. We removed the imputed values of L1 and re-imputed them with a model conditional on A, X, Y, math formula, and math formula. We then re-imputed the missing values of L2 and L3.

The procedure was repeated until 10 iterations were completed or until stable imputed values were obtained. We then repeated the iterative procedure 10 times to create 10 imputed datasets, in each of which we conducted the PS analysis described earlier. We then combined the OR estimates from the imputed datasets.[34] We performed this multiple imputation analysis using the IVEWare package for SAS software developed by the Survey Research Center, Institute for Social Research, at the University of Michigan (

Propensity score calibration

This method imputes the value of the PS, rather than the value of the variables L, in patients with missing values in L.[35] Imputing the PS can be conceptualized as a measurement error issue, which can be corrected using regression calibration if the true or gold-standard PS, PSgs, can be correctly estimated from an internal or external validation sample.[36, 37] We attempted to create an internal validation sample by randomly selecting 300 000 patients among those with complete information on both X and L with the same age and sex joint distribution as the entire study cohort.

In the entire study cohort, we estimated an error-prone PS or PSep via the logistic model for Pr[A = 1|X] and then included the estimated PSep as a linear continuous covariate in the logistic model for math formula.

In the validation sample, we estimated PSgs via the logistic model for Pr[A = 1|X, L] and fit the linear regression model math formula.

The regression calibration-adjusted estimator for the treatment effect was math formula. We conducted the analysis by using a SAS macro by Spiegelman and Logan, which is publicly available at


Table 1 shows the distribution of baseline characteristics of initiators of coxibs and tNSAIDs ascertained during the 12-month period before the first NSAID prescription. The crude OR of UGIB for coxib initiators versus tNSAID initiators was 1.50 (95%CI: 0.98, 2.28). The OR was 1.04 (0.68, 1.59) after adjustment for age and sex, 0.98 (0.63, 1.52) upon further adjustment for calendar year of treatment initiation, and 0.84 (0.54, 1.31) after further adjustment for measures of healthcare utilization. When we further adjusted for all remaining confounders in X, the OR was 0.81 (0.52, 1.27) for the entire study cohort, 0.64 (0.38, 1.07) for the 78% patients of the cohort with complete information on all three lifestyle variables in L, and 1.93 (0.78, 4.74) for patients with missing values on any of the three lifestyle variables.

Table 1. Baseline characteristics of initiators of selective cyclo-oxygenase-2 inhibitors (coxibs) or non-selective (traditional) non-steroidal anti-inflammatory drugs (tNSAIDs) ascertained during the 12-month period before the first NSAID prescription
CharacteristicsPatients with no missing supplemental confounder data*Patients with missing supplemental confounder data*
Coxib initiators (n = 33 693)tNSAID initiators (n = 320 733)Coxib initiators (n = 9876)tNSAID initiators (n = 90 883)
  • *

    Supplemental confounder data include information on smoking, alcohol consumption, and body mass index.

Age (years) 
Male sex35.041.339.550.2
Calendar year of treatment initiation 
No. of distinct drugs in the prior year 
No. of outpatient visits in the prior year 
Hospitalized in the prior year9.
Charlson comorbidity score ≥141.427.933.520.4
Prior use of 
Gastroprotective drugs29.211.826.29.4
Oral steroids9.
Diagnosis of    
Rheumatoid arthritis3.
Peptic ulcer disease0.
Congestive heart failure2.
Coronary artery disease15.98.310.14.3
Current smoker19.921.716.418.1
Past smoker24.323.112.812.6
Alcohol consumption (drinks/week) 
Body mass index (kg/m2) 

Table 2 shows the results from different analytic approaches to deal with missing confounder data. The adjusted ORs were 0.65 and 0.67 for the unweighted and IP-weighted complete-case analyses, respectively. In the IP-weighted analysis, the weight had a mean of 1.28 (standard deviation 0.15) and ranged from 1.04 to 2.49. The adjusted ORs ranged between 0.80 and 0.83 for the imputation methods. The 95%CIs from the different methods were overlapping; the 95%CI for any estimate included all other point estimates.

Table 2. Odds ratios of upper gastrointestinal bleeding during the first 180 days following initiation of selective cyclo-oxygenase-2 inhibitors versus non-selective non-steroidal anti-inflammatory drugs, by different analytic approaches to incorporate supplemental confounder data available in a subset of the study cohort
Analytic methodsNumber of patients included in the analysisAdjusted odds ratio*

(95% confidence interval)

Standard error of log odds ratio
  • *

    Adjusted via propensity score (PS) for the following variables recorded in the entire study cohort: age; sex; calendar year; Charlson comorbidity score; use of gastroprotective drugs, anticoagulants, antiplatelets, and oral steroids; diagnosis of osteoarthritis, rheumatoid arthritis, upper gastrointestinal symptoms, dyspepsia, complicated or uncomplicated peptic ulcer disease, hypertension, congestive heart failure, and coronary artery disease; and the number of distinct drugs, physician visits, and hospitalization in the prior year. Supplemental variables recorded only in a subset of study cohort included smoking, alcohol consumption, and body mass index.

Complete-case analysis; unweighted354 4260.65 (0.39, 1.09)0.27
Complete-case analysis; inverse probability weighted354 426 (outcome/PS model)

455 185 (weight model)

0.67 (0.38, 1.16)0.28
Missing-category approach455 1850.81 (0.51, 1.26)0.23
Missing-indicator approach455 1850.80 (0.51, 1.25)0.23
Single imputation455 1850.83 (0.53, 1.30)0.23
Multiple imputation455 1850.82 (0.52, 1.29)0.23
PS calibration455 185 (error-prone PS model)

300 000 (gold-standard PS model)

0.80 (0.50, 1.27)0.24

Results from all approaches did not materially change when the PS was included as a continuous variable instead of deciles in the outcome model (as was necessary for the PS calibration approach). The c-statistic for the PS model was around 0.80 for all analyses, and the covariates were overall well balanced within PS strata (data not shown).


We have reviewed and compared several approaches to deal with partially missing confounder information in electronic healthcare databases. We used the NSAID–UGIB example to illustrate their application to comparative effectiveness and safety research of therapeutics. All these methods require the assumptions of no unmeasured confounding for the effect of treatment on the outcome and no misspecification of the outcome and PS models.

The missing-category/indicator approach and single imputation by the most common category further require additional assumptions that are generally implausible. In essence, they all assume that patients with missing information on certain variables are unconditionally exchangeable and can be grouped together for analysis. Single imputation by the most common category goes a step further and assumes that patients with missing data are not only comparable with each other but also with patients with a certain (often arbitrarily chosen) covariate value. Although these methods are easy to implement, they have been shown to produce biased estimates even when patients with and without missing data are unconditionally exchangeable (i.e., data missing completely at random).[11, 38-40]

Multiple imputation requires that missingness be unassociated with the outcome conditional on the measured confounders or the corresponding PS (i.e., data missing at random) and that the imputation model for each covariate with missing data be correctly specified.[34] The approach has been shown to provide more valid estimates than the missing-indicator approach and single imputation when these assumptions are true.[11, 34, 39-42] A recent study that used The Health Improvement Network database[43] found that patients with missing information on smoking, alcohol consumption, weight, or height differ systematically from the others in terms of comorbidities such as cardiovascular disease and chronic obstructive pulmonary disease. Our estimate from multiple imputation would be incorrect if missingness was associated with other prognostic factors that were not included in the analysis. We used a version of multiple imputation that does not require the often unrealistic assumption of joint multivariate normality.[30, 31]

The PS calibration approach is valid under the assumptions that there is an appropriate internal or external validation sample, the linear measurement error model is correctly specified, and the error-prone PS is an appropriate surrogate for the gold-standard PS.[35, 44] The last assumption may be violated if the direction of confounding from the unmeasured or partially measured confounders is in the opposite direction to that from the measured covariates.[44] This approach may be combined with single imputation of the gold-standard PS based on the parameters of the measurement error model to do away with the need to specify the outcome model through matching or stratifying on the imputed gold-standard PS.[45]

Yet, despite all these differences in the conditions required for valid estimates, we found only small differences across different imputation methods. The reasons might be that the proportion of missingness was relatively low and that the three variables with missing values might not be strong confounders after conditioning on other measured variables. Indeed, the OR adjusted for all potential confounders available in the entire study cohort (0.81) was similar to the ORs that were further adjusted for the three lifestyle variables by using different imputation approaches (0.80–0.83).

Like the imputation methods, the IP-weighted complete-case analysis estimates the effect in the entire study population.[25, 46] It is valid under an additional assumption that the weight models are correctly specified. The unweighted complete-case analysis estimates the effect only among patients without missing values; its results cannot be applied to the entire study population unless the data are missing completely at random. The unweighted complete-cases analysis has been shown to produce more biased estimates compared to other approaches, such as multiple imputation.[11, 39, 40]

The point estimates of complete-case analyses and imputation methods were somewhat different, which may be due to random variability (wide and overlapping 95%CIs) or to real differences between patients with complete and incomplete confounder information beyond the information recorded in the database. For example, general practitioners who record patient lifestyle factors—and patients who respond to these questions—might have certain unmeasured characteristics that are associated with the outcome risk. Also, the effect of NSAIDs on UGIB might be modified by certain patient characteristics for which missingness is a proxy.

In conclusion, a number of methods are available to deal with missing data in comparative effectiveness and safety studies of therapeutics that analyze electronic healthcare databases. Researchers need to be aware of the underlying assumptions of various methods when choosing among them.




  • Partially missing confounder information is common in comparative effectiveness and safety research of therapeutics.
  • We applied several methods to deal with missing confounder information using data from a primary care electronic medical records database from the United Kingdom.
  • Researchers should be aware of the underlying assumptions of various methods for handling missing data when choosing among them.


The authors would like to thank Ken Kleinman, ScD from Harvard Medical School and Harvard Pilgrim Health Care Institute for his thoughtful comments on an earlier draft of this paper. Dr. Toh is partially supported by R03HS019024. Dr. Hernán is partially supported by R01HL080644.