Estimating a marginal causal odds ratio in a case-control design: analyzing the effect of low birth weight on the risk of type 1 diabetes mellitus
Correspondence to: Emma Persson, Department of Statistics, USBE, Umeå University, SE-90187 Umeå, Sweden.
Estimation of marginal causal effects from case-control data has two complications: (i) confounding due to the fact that the exposure under study is not randomized, and (ii) bias from the case-control sampling scheme. In this paper, we study estimators of the marginal causal odds ratio, addressing these issues for matched and unmatched case-control designs when utilizing the knowledge of the known prevalence of being a case. The estimators are implemented in simulations where their finite sample properties are studied and approximations of their variances are derived with the delta method. Also, we illustrate the methods by analyzing the effect of low birth weight on the risk of type 1 diabetes mellitus using data from the Swedish Childhood Diabetes Register, a nationwide population-based incidence register. Copyright © 2013 John Wiley & Sons, Ltd.
Case-control designs are often used when investigating the effect of a treatment (exposure) on an outcome, e.g., a disease. The design is an efficient alternative in the event of rare diseases, since then the size of a random sample may need to be very large in order to include sufficiently many cases for the subsequent analysis. In the case-control setting, the controls can be sampled in different ways, e.g., independently sampled or matched on one or several variables believed to be confounders.
In case-control designs, odds ratios are commonly used to measure the effect of a binary treatment, t, on a binary outcome, Y, since the odds ratio expressed in terms of the outcome conditional on the treatment is equivalent to the odds ratio defined in terms of the treatment conditional on the outcome. A causal odds ratio is defined by the distribution of the potential outcomes under each treatment; see Holland  for the definition of a causal effect in the potential outcome framework. The causal effect of the treatment on the disease is identified if all confounding pretreatment variables, referred to as covariates henceforth, are observed . Depending on whether we are conditioning on, or marginalizing over the confounding covariates, an odds ratio is conditional or marginal. A conditional odds ratio can help a clinician decide whether a treatment is beneficial for a patient with particular characteristics, while a marginal odds ratio can be used to assess the effect of the treatment in the population as a whole. A conditional estimate will usually differ from its marginal counterparts due to noncollapsibility of the odds ratio; see Greenland, Robins, and Pearl  for a discussion of confounding and collapsibility in causal inference. Even without the presence of confounding, noncollapsibility can occur. For instance, when a covariate is associated with the outcome but unassociated with the treatment and the marginal odds ratio is not equal to 1, the conditional counterpart will be farther from 1 than the former . Statistical development has to a large extent focused on the estimation of conditional odds ratios; see the review by Breslow  and the references therein.
Estimators of the marginal causal odds ratio have been investigated. Zhang  proposes an estimator based on a logistic regression model for the outcome conditional on the treatment and the covariates. In addition, an estimator stratifying on the propensity score, the probability of treatment given the covariates, has also been introduced . In contrast, the inverse of the propensity score can also be used as a weight in an inverse probability of treatment weighted estimator of the marginal causal odds ratio, described by Robins . In Vansteelandt, Bowden, Babanezhad, and Goetghebeur , estimators of both conditional and marginal causal odds ratios are proposed using an instrumental variable, i.e., a variable associated with the exposure but not the response.
In this paper, we study and compare the performance of estimators for the marginal causal odds ratio within matched and unmatched case-control designs when information on the outcome probability (the prevalence), P (Y = 1), is known in the population under study. The prevalence can be used to adjust the conditional outcome model in an unmatched case-control design either with intercept adjustment [10, 11] or weighted maximum likelihood (ML) . For a matched design, we implement the theory described in van der Laan  and apply intercept adjustment and weighting. Here, knowledge of the prevalence conditional on the matching variables is required. After adjusting the conditional models, we marginalize over a weighted distribution of the cases and controls to obtain estimates of the marginal causal odds ratio. An approximation of the variances of the estimators is derived using the delta method. The estimators described are compared to case-control weighted targeted maximum likelihood estimators (TMLE) [13, 14]. For a TMLE, the conditional model is updated using a fitted model for the propensity score. The finite sample properties of all the estimators are studied in simulations and compared to the unadjusted sample odds ratio and the ML estimator of the conditional causal odds ratio.
An example of a case-control design where the prevalence is known is a study based on individuals from a disease register where all incident cases of the disease are registered and a sample of disease-free controls are collected from the population giving rise to the cases. In this paper, we use data from the Swedish Childhood Diabetes Register (SCDR), a population-based incidence register providing information on the prevalence of type 1 diabetes mellitus (T1DM) in children 0-15 years old. The estimators described above are implemented and we estimate the causal effect of low birth weight on the risk of T1DM. Here, we show that neglecting the confounding or considering the conditional estimate results in a different conclusion than when considering the marginal estimate when adjusting for confounding.
We proceed as follows: In Section 2, the theoretical framework and notation are presented. Section 3 describes estimators of the marginal casual odds ratio in matched and unmatched case-control designs. In a simulation study described in Section 4, we investigate the finite sample properties of the estimators. In Section 5, we use data from the SCDR to estimate the effect of low birth weight on childhood onset insulin-dependent diabetes. Section 6 concludes with a discussion.
2 Causal inference in a case-control framework
In the following, we consider studies where the objective is to estimate the causal effect of a binary treatment on a binary outcome. The treatment variable, t, equals 1 if the individual has received the treatment and 0 if the individual is untreated. Let Y 1 denote the binary response variable that would be observed under the treatment and Y 0 under no treatment. The two variables are called potential outcomes [15-17]. For each individual, we observe only one of the potential outcomes and the observed response Y is defined as Y = TY 1 + (1 − T)Y 0. Let X denote a vector of covariates. We assume that we have a case-control sampling design, i.e., we sample N1 individuals from (Y = 1,X,T) and N0 individuals from (Y = 0,X,T), although interest lies in estimating parameters for the process generating (Y,X,T).
A parameter of interest describing the causal effect of the treatment is the marginal causal odds ratio
Interest can also lie in a conditional causal odds ratio over a subpopulation with certain values of the covariates, X,
To identify causal parameters in nonrandomized studies, we need to control for all possible confounders, X. The following assumption,
states that the potential outcomes are independent of the treatment assignment conditional on X. To ensure that there exists an overlap in the covariate distribution for the treated and untreated, a second condition
is also required. A.1 and A.2 together are called strong ignorability . Under strong ignorability, the conditional and the marginal odds ratios can be identified since, for t = 0,1, we have
Throughout the paper, we also assume that the observed outcome, Y, can be described with the following model
where h is some known linear function in β = (β0,β1,...,βp) which, in turn, is a finite number of unknown parameters. If we had data from a random sample of size N, the ML estimate, , is given by the solution to the likelihood equation
and h ′ (t,x; β) is the partial derivative of the function h with respect to β.
3.1 Design A
Consider a design in which we first sample N1 cases, e.g., individuals with a certain disease in a defined population. From the same population, N0 controls are then sampled from the group of individuals without the disease, resulting in a sample consisting of N = N1 + N0 individuals. This design, also called unmatched or independent case-control sampling, will henceforth be referred as Design A.
An estimator of θ was introduced by Zhang  under the assumption of an iid sample from (Y,T,X). Here, the outcome is assumed to follow the logistic regression model (2) and the unknown parameters in the model are obtained by ML estimation solving (3). In this setting and motivated by (1), P (Y t = 1) can be estimated by averaging over the covariate distribution, and causal parameters such as the risk difference, the relative risk, and the marginal odds ratio can be identified. Thus, an estimator of the marginal causal odds ratio is defined by
The estimator, , can be adjusted for Design A. If the case-control design is ignored when fitting the logistic regression model, the ML estimates obtained for the β-coefficients in (2) are identical except for an intercept, β0, . If the prevalence, P(Y = 1), is known, the intercept can be adjusted by adding
to the regression function , yielding an unbiased estimate of logit[P(Y = 1 ∣ T,X)]. Subsequently, P (Y t = 1), t = 0,1, can be estimated by averaging over the case-control weighted distribution of the covariates,
where = and ∀ j satisfying (3). Thus, from (5), an estimator of the marginal causal odds ratio can be expressed in terms of weighted sums,
where we use the normalized weights,
In Appendix A, the delta method  is used to derive a variance estimator for on the log scale.
Instead of adjusting the intercept, P(Y = 1) can be used for weighting of the likelihood equations in (3) (see, e.g., ) yielding
where wi is defined in (6). The estimated that satisfies (7) thereafter replaces in (5) and a corresponding estimator of the marginal causal odds ratio is defined by
Similarly as for , a variance approximation of log using the delta method is presented in Appendix A.
Targeted maximum likelihood estimation is a procedure introduced by van der Laan . In the procedure, the distribution from which the data is generated is estimated and then the parameter of interest is targeted by updating the initial fit to achieve an estimate with lower bias. The approach can be used with different models and can also be applied to various sampling designs . Below, we describe a case-control weighted TMLE of the marginal causal odds ratio for Design A . Here, the outcome is assumed to be generated from a logistic regression model (2). In the updating step of the TMLE procedure, a logistic regression model for the propensity score, e(X) = P(T = 1 ∣ X), is assumed throughout the paper. A TMLE of the marginal causal odds ratio can be obtained through the following steps:
Define weights, wi, as in (6).
Estimate P (Y = 1 ∣ T,X) using weighted ML estimation, i.e., solving (7).
Estimate the propensity score, e(X), using weighted ML estimation corresponding to that in Step 2.
Update the initial fit, , by including a two-dimensional variable g(T,X) = (g0,g1) defined as
where is the estimated propensity score and = 1 if T = t, and 0 otherwise. The update is achieved by regressing Y on g(T,X) in a model without intercept and including as an offset using weighted ML estimation as in Step 2. The updated fit is
where is the estimated parameter corresponding to g(T,X).
Estimate the marginal causal odds ratio, , by averaging over the weighted distribution of the covariates,
If the odds ratio is estimated without the updating step, i.e., by skipping Steps 3 and 4 above and replacing the fit in (8) with the initial fit in Step 5, the estimator is equal to . For , a larger bias may be expected than for when the model is misspecified  due to the so-called doubly robust property of the TMLE. This theoretical property assures consistency of the TMLE if the model in either step 2 or 3 is correctly specified. The variance of is dependent on the derivative (score) of the distribution of Y given t and X, and the score of the covariate distribution. For variance estimation, bootstrap methods can also be used. A detailed description of variance estimation is available in van der Laan and Rose . In addition, an implementation is available in the statistical software R  library tmle.
3.2 Design B
In a matched case-control design, N1 cases are first sampled. We then stratify the population by a confounding variable, and for each selected case, we randomly sample matched controls from the same stratum as the corresponding case. We denote the matching variable by M, and in the sequel it is assumed to be categorical and a subset of X. The matched case-control design is, henceforth, referred to as Design B. In Design B, the covariate distribution of the controls depends on the covariate distribution of the cases. To achieve unbiased estimates of the marginal causal odds ratio using the previously described methods, additional information about the prevalence within the levels of the matching variable, P(Y = 1 ∣ M), needs to be incorporated. For Design B, it is possible to use an intercept adjustment method to estimate θ, though correcting the intercept in the model is insufficient to eliminate the bias, so weighting is also needed. An estimator that incorporates weighting together with intercept adjustment can be constructed using results in , by applying a corrected version of Theorem 2 (to appear in the International Journal of Biostatistics).
Thus, the estimator for Design B is obtained by first weighting the likelihood equation (7) with weights defined as
Consequently, additional information about the probability of being a case within the levels of the matching variable, P(Y = 1 ∣ M), is incorporated. Thereafter, P (Y t = 1) can be estimated, for t = 0, 1, by
where is the weighted ML estimates and α is defined in (4). Thus, using the intercept adjusted approach, an estimator of the marginal causal odds ratio can be constructed,
Another estimator, not requiring the intercept to be adjusted, can be achieved by modifying the weights used in the likelihood Equation (7). Defining the weights as in (10), the estimator for Design B is
Approximated variances for and on the log scale for Design B are derived in Appendix A.
Both estimators described above can be seen as special cases of the case-control weighted TMLE when the updating step is ignored. To apply TMLE to Design B , the weights are redefined as in (10). The procedure now follows the same steps as in Design A, in Section 3.1, although with the newly defined weights.
To illustrate the performance of , , and in a finite sample, a simulation study is conducted on both Designs A and B. In addition, the estimators are compared to commonly used conventional methods within the specific designs to illustrate confounding and noncollapsibility.
Within Design A, the most commonly used method for estimating the conditional odds ratio is by means of logistic regression. The estimate is obtained by taking the exponential function of the estimated β-parameter in (2) corresponding to the treatment . This yields a conditional odds ratio estimate that can be interpreted as causal if h(T,X; β) is correctly specified and X includes all confounding covariates. For Design B, conditional logistic regression is the predominant method of analysis. Breslow and Day  proposed a conditional ML approach for estimating the β-parameters. The exponential function is then applied to the estimated β-parameter corresponding to the treatment. The matching variables are not, on their own, to be included in the model since this would violate the underlying assumption that being sampled is independent of X given Y. We denote these conditional estimates and .
The 2 × 2 contingency table can be used to construct an estimate of the marginal odds ratio. The cross product of the cell counts in the table relating t with Y constitute the unadjusted sample odds ratio, . This estimator ignores X and consequently any confounding, and therefore may be biased for θ.
4.1 Simulation design
Five uniformly distributed independent covariates, X = (X1,X2,X3,X4,X5), are generated. X1 has a discrete uniform distribution on the interval [1, 5], and X2 ∼ Uniform(0, 2), X3 ∼ Uniform(0, 5) and X4,X5 ∼ Uniform(0, 1) are continuous random variables. The treatment variable, t, is generated from N Bernoulli trials with treatment probability
where the expected probability of being treated is 0.5. The potential outcomes also follow a logistic regression model with probabilities
and the observed outcome is Y = TY 1 + (1 − T)Y 0. The prevalence, i.e., P (Y = 1), is 0.07 and the marginal causal odds ratio is 2.97.
Following the above rules, we generate datasets and sample N1 = (300,600,1200,2400) cases. For Design A, four times as many controls are independently sampled, and for Design B, four controls are matched, on X1, to each case. For each sample size, 1000 replicates are generated and the software R  is used to conduct the simulations.
For the methods where modeling of the outcome, Y, is needed, i.e., all except , a correctly specified model is fitted. In addition, we use the misspecified model,
where Φ denotes the standard normal distribution function. For where a model for the treatment is also needed, we fit a model according to (11) as well as making the same misspecification as for the outcome model, i.e., we use the link function probit instead of logit. Truncation of the propensity score from below has been recommended for the individuals with the lowest values ( < 0.01) . However, in our experience, the TMLE is sensitive to small or large values of the propensity score; thus, truncation from above and below is performed when it is needed in order to ensure convergence. In some cases, truncation of the propensity, with as much as 0.25 from below and above, is needed to achieve a sufficiently small ϵ in the smallest sample size.
4.2 Simulation results
The bias, standard deviation, and mean squared error (MSE) for the marginal causal odds ratio can be seen in Tables 1 and 2 for Designs A and B, respectively. In both Designs, we see that the unadjusted sample odds ratio, , is biased for all sample sizes due to not controlling for the confounding. However, in Design B, where controls are matched to cases on X1, the estimator is less biased than in Design A since X1 is to a large extent controlled for by the matching. The conditional measures and that control for the covariates are also biased for the marginal effect, illustrating that a conditional estimator is inappropriate to use when we are interested in marginal effects.
Table 1. Results from Design A.
| ||M|| − 0.32||0.73||0.63|| − 0.35||0.50||0.37|| − 0.42||0.31||0.28|| − 0.44||0.23||0.25|
| ||M|| − 0.47||0.67||0.67|| − 0.50||0.46||0.46|| − 0.57||0.28||0.41|| − 0.59||0.21||0.39|
| ||M/C|| − 0.01||1.15||1.32|| − 0.08||0.67||0.46|| − 0.18||0.39||0.18|| − 0.21||0.29||0.13|
| ||M/M|| − 0.04||1.14||1.30|| − 0.11||0.73||0.54|| − 0.19||0.41||0.21|| − 0.22||0.30||0.14|
Table 2. Results from Design B.
| ||M|| − 0.29||0.75||0.65|| − 0.33||0.53||0.39|| − 0.40||0.33||0.27|| − 0.43||0.24||0.24|
| ||M|| − 0.45||0.69||0.67|| − 0.49||0.48||0.48|| − 0.56||0.30||0.40|| − 0.58||0.22||0.38|
| ||M/C||0.07||1.06||1.14|| − 0.01||0.79||0.62|| − 0.10||0.44||0.21|| − 0.15||0.31||0.12|
| ||M/M||0.08||1.16||1.35|| − 0.01||0.87||0.76|| − 0.10||0.51||0.27|| − 0.16||0.31||0.12|
Under a correctly specified model, the smallest MSE can be seen with the intercept adjusted approach, , and weighted ML method, , where the bias and variance decreases with larger sample sizes. Model misspecification in these estimators, however, shows an increase in bias, which becomes more pronounced as the sample size grows, and a decrease in variance, resulting in larger MSE in the larger sample sizes.
The TMLE displays similar results to and for the largest sample size under correctly specified models as well as when the propensity score model is misspecified. This is in keeping with the double robustness property of the TMLE. However, the MSE is larger for the smaller sample sizes. When the outcome model is misspecified, the MSE of is smaller than under the correctly specified model. This is largely due to a smaller variance, which is best seen in the largest sample size. We see that, as the sample size increases, underestimates the true parameter. This is not seen for the smaller sample sizes since here the sampling distribution of is more positively skewed. For all methods, only very small differences are seen between Design A and Design B.
To evaluate the performance of the derived variance estimators for and , the empirical standard deviation of the estimates on the log scale is compared to the average of the estimated standard deviation and the results are found in Table 3. The two estimates only exhibit small differences, which decrease with larger sample sizes, indicating relatively accurate variance estimators.
Table 3. Results show empirical standard deviations, , and averaged estimated standard deviation, , on the log scale from Designs A and B, N denoting the samples size.
5 Effect of low birth weight on type 1 diabetes mellitus
Childhood onset insulin-dependent diabetes (T1DM) is a complex autoimmune disease, in which multiple factors are thought to be contributing to the onset of the illness. One possible risk factor for T1DM is intrauterine conditions that affect prenatal growth, and previous findings indicate that being small for gestational age is a protective factor for T1DM . Other studies have also shown a protective effect of low birth weight on T1DM . The objective of this study is to investigate how low birth weight in full-term children may affect the onset of T1DM in the population as a whole.
In the Swedish Childhood Diabetes Register (SCDR), all cases of T1DM in Sweden have been registered since 1977 . For each case, four controls from the Swedish population without diabetes have been sampled so that they matched the case unit for age and municipality of residence. By linkage to the Longitudinal Integration Database for Health Insurance and Labour Market Studies, the Swedish Medical Birth Register (MBR), and the Multigenerational Register, a whole range of medical, demographic, and other variables are available (post-matching) on cases and controls as well as their parents.
The study population was the population giving rise to the cases; children diagnosed with T1DM with onset before 15 years of age, with a gestational age of at least 38 weeks and who were born in Sweden from 1983 to 2007 and were still alive in 2007. There were 34332 subjects (7945 cases) in the sample, and due to missing information in MBR, 731 subjects (143 cases) were excluded. There was missing information on birth weight for 97 subjects (21 cases) and there were 1992 (424 cases) missing information on the mother's smoking habit. These values were assumed to be missing at random and imputation was used in the analysis. For birth weight, a random number, from a normal distribution with parameters estimated from the data, was imputed for the subjects with missing information. A random number from the Bernoulli distribution with the parameter estimated by the proportion of maternal smokers in the sample was used to impute a value where it was absent. The treatment variable was binary and equal to 1 if the subject had a birth weight less than 2650 g ‡ , and 0 otherwise. Table 4 describes the treatment variable and possible confounders.
Table 4. Descriptive statistics.
|Mean birth weight||3539.7 g||3517.0 g|
|Proportion of mothers born in Sweden||91.2%||87.3%|
|Proportion of mothers smoking at time of delivery||17.3%||21.0%|
|Proportion of mothers with diabetes||1.41%||0.36%|
|Proportion of births with cesarean section||10.9%||9.79%|
|Mean maternal age||29.1 years||28.9 years|
The three methods for Design B described in Section 3.2 as well as the sample odds ratio and conditional logistic regression were used to analyze the data, and the software R  was used to conduct all analyses. Knowledge of the probability of having T1DM with respect to municipality was unavailable and therefore weights for the relevant methods were constructed utilizing knowledge of the prevalence of T1DM conditional on age.
The results, in Table 5, show that the sample odds ratio indicates a significant effect with an estimate of 0.76. This would suggest that, for full-term children, the odds of T1DM if born with low birth weight is approximately 24% smaller than if born heavier. However, the estimate only has a causal interpretation if we assume that there are no confounders. To control for credible confounders, we assumed a linear logistic model with covariates, maternal age, and binary indicator variables for maternal smoking at the time of delivery, maternal diabetes, cesarean section, and maternal birth country (equal to 1 if born in Sweden and 0 otherwise). A t-test was used to test if the parameters in the model were zero given that all other variables were included in the model and the chosen model contained all variables that were significant on the 5% level. Based on this model, the conditional logistic regression estimate, 0.79, was slightly higher and still displayed a significant protective effect. Under the assumption of no unmeasured confounders, this estimate could still be biased for the marginal causal effect. On the other hand, all three methods for estimating the marginal causal odds ratio (, , ), applied under the model where age and municipality were also included, generated nonsignificant estimates, presenting no established effect of low birth weight on T1DM, again under the assumption of unconfoundedness.
Table 5. Estimated odds ratios (), standard deviations (SD) on the log scale, and 95% confidence intervals (CI) for the effect of low birth weight on T1DM using data from SCDR.
|Sample odds ratio ()||0.757||0.102||(0.615; 0.924)|
|Conditional logistic regression ()||0.788||0.105||(0.641; 0.967)|
|Intercept adjusted method ()||0.824||0.150||(0.615; 1.105)|
|Weighted ML method ()||0.827||0.149||(0.617; 1.108)|
|TMLE ()||0.917||0.116||(0.731; 1.150)|
In this paper, we studied estimators of marginal causal odds ratios within independent and matched case-control designs. We investigated three estimators adjusting both for confounding and the case-control sampling scheme. In simulations, the methods were compared to the sample odds ratio, ignoring confounding, and the common approach of estimating the conditional odds ratio with conditional ML. For policy makers, the conclusions is straightforward: studying a conditional causal odds ratio is not necessarily informative for its marginal counterpart.
In the simulations, under a correctly specified model, the weighted ML estimator and the intercept adjusted estimator have the smallest MSE among the estimators studied in this paper. However, when the model is misspecified, the TMLE estimator shows the smallest MSE in both designs due to its double robustness property. The two former estimators can both be implemented in standard statistical software and they, as well as the TMLE, can be used to construct other parameters such as the causal risk difference and causal relative risk.
For the estimators studied in this paper, it is assumed that the prevalence is known. In practice, the prevalence might be estimated. The researcher may in that case perform a sensitivity analysis using a range of plausible values for the prevalence to study the impact on the estimate of the marginal causal odds ratio. The variance estimation described for the estimators in this study does not take into account the uncertainty of an estimated prevalence. To study the possible impact of an estimated prevalence, the simulations were repeated using an estimate from a population of 30 000 and no essential changes in the results appeared.
Appendix A: Variance estimation
We want to thank the Swedish Childhood Diabetes Study Group (co-ordinator Gisela Dahlquist). We are grateful to Xavier de Luna for his valuable advice and discussion. In addition, we would like to thank the reviewers for their helpful comments. The authors have received financial support from the Swedish Research Council through grant 0735, Riksbankens Jubileumsfond P11-0814:1, and the Umeå node of the Swedish Initiative for Research on Microdata in the Social and Medical Sciences.
Which is approximately two standard deviation less than the mean birth weight of the study population.