Handling Missing Data in Matched Case-Control Studies using Multiple Imputation

Analysis of matched case-control studies is often complicated by missing data on covariates. Analysis can be restricted to individuals with complete data, but this is inefficient and may be biased. Multiple imputation (MI) is an efficient and flexible alternative. We describe two MI approaches. The first uses a model for the data on an individual and includes matching variables; the second uses a model for the data on a whole matched set and avoids the need to model the matching variables. Within each approach, we consider three methods: full-conditional specification (FCS), joint model MI using a normal model, and joint model MI using a latent normal model. We show that FCS MI is asymptotically equivalent to joint model MI using a restricted general location model that is compatible with the conditional logistic regression analysis model. The normal and latent normal imputation models are not compatible with this analysis model. All methods allow for multiple partially-observed covariates, non-monotone missingness and multiple controls per case. They can be easily applied in standard statistical software and valid variance estimates obtained using Rubin’s Rules. We compare the methods in a simulation study. The approach of including the matching variables is most efficient. Within each approach, the FCS MI method generally yields the least biased odds ratio estimates, but normal or latent normal joint model MI is sometimes more efficient. All methods have good confidence interval coverage. Data on colorectal cancer and fibre intake from the EPIC-Norfolk study are used to illustrate the methods, in particular showing how efficiency is gained relative to just using individuals with complete data.


Introduction
Case-control studies are used to investigate associations between disease and putative risk factors. Confounding of observed associations can be handled at the design stage by matching cases and controls on confounders, at the analysis stage by adjusting for confounders using a regression model, or by a combination of these. In matched case-control studies each case is individually matched with one or more controls on a subset of confounders and the (usual) analysis uses conditional logistic regression (CLR) to control for the remaining confounders.
Often, the analysis is complicated by missing data on covariates (i.e. exposures and remaining confounders). A common solution is to restrict analysis to individuals with complete data.
Although appealing for its simplicity, this 'complete-case analysis' ('case' here means any individual, rather than an individual with disease) is inefficient and may be biased. In particular, where exclusion of a case or control leaves a matched set in which remaining members are either all cases or all controls, the whole set ceases to contribute information to the CLR estimating equations.
To improve efficiency and reduce bias, several alternatives have been proposed. Lipsitz et al. (1998) allow for one partially observed covariate. They assume data are missing at random (MAR) and fit a missingness model, i.e. a model for the probability that an individual is a complete case. Functions of the fitted probabilities are then used as offsets in CLR. This consistently estimates odds ratios (ORs) when the missingness model is correctly specified, but is inefficient as it only uses data on complete cases. Paik and Sacco (2000) also allow for just one partially observed covariate and assume MAR. They assume a model for the distribution of the partially observed covariate given the other covariates, matching variables and binary disease status. When this covariate model is correctly specified, consistent estimation of the ORs can be achieved by CLR after imputing the missing covariate as its fitted value when the disease status variable is set to 0.5. Rathouz (2003) notes that this method implicitly assumes missingness does not depend on disease status, and generalises it to allow for such dependence, as well as for multiple missing covariates. His method assumes the partially observed covariates are all observed or all missing on each individual. Sinha and Wang (2009) take a similar approach, but instead of a parametric covariate model, kernel density estimation is used for those functions in the estimating equations that depend on the distribution of the partially observed covariate. They find their OR estimator is less biased than that of Paik and Sacco (2000) when the latter's covariate model is misspecified.
A drawback is that categorical variables are handled by stratifying individuals on these variables and performing kernel density estimation separately in each stratum, which limits the feasible number of categorical variables (and categories). Paik (2004) extends Paik and Sacco's (2000) method to allow for data missing not at random (MNAR).
The forementioned methods all reduce to standard CLR when there are no missing data: the assumed missingness or covariate model then becomes irrelevant. Other methods for missing data derive information from an assumed covariate model even when data are complete.
These methods may be more efficient but at the cost of possible bias when the covariate model is misspecified. Satten and Carroll (2000) propose such a method. This allows for multiple partially observed covariates, but assumes these are all observed or all missing on each individual. Ahn et al. (2011) generalise it to allow for MNAR and multiple disease states. Rathouz et al. (2002) elaborate Lipsitz et al.'s (1998) method to use a covariate model and so gain efficiency. The resulting estimator is doubly robust but difficult to implement. They also propose a more practical approximation which, though not doubly robust, still gains efficiency. Liu et al. (2013) use empirical likelihood to develop a semiparametric-efficient competitor to Rathouz et al.'s (2002) estimator. Gebregziabher and DeSantis (2010) assume all covariates are categorical and carry out multiple imputation (MI) using a latent-class model. A drawback is that imputation of a individual's missing value makes no use of data on matching variables, covariate values of other individuals in the same matched set, or disease status, which may cause bias (Moons et al., 2006) and inefficiency.
The methods described so far assume the distribution of the partially observed covariate(s) given fully observed covariates, disease status and matching variables can be modelled parametrically. Sometimes this is not feasible. For example, if cases are matched with controls from the same family, from the same postcode area or from the set of patients attending the same general practice, it could be difficult to model parametrically the matching via explicit matching variables, while the alternative of allowing a separate nuisance parameter for each matched set may cause problems with model fitting and induce bias and even inconsistency of estimators. Even when matching could, in principle, be modelled parametrically, this is only possible if the analyst has data on matching variables, which is not always so, and some analysts may prefer to avoid modelling effects of matching variables, since CLR makes no assumptions about the association between disease and matching variables. One solution, adopted by Sinha et al. (2005), is to allow each matched set to have its own parameter in the covariate model but treat these as random effects. They assume a single partially observed covariate and that the random effects are generated by a Dirichlet process. They fit their Bayesian model using a Hastings-Metropolis algorithm with specially written computer code.
Though useful, these methods have limitations. Many assume only one partially observed covariate or that partially observed covariates are collectively observed or missing on each individual. Many require bespoke computer code. Most require parametric modelling of matching variables. In this article we advocate the use of MI, proposing and comparing six MI methods suitable for matched case-control data that can be easily implemented in commonly used statistical packages. MI has several advantages. First, it is increasingly being used to handle missing data and many researchers are familiar with the technique. Second, MI software is readily available and easy to use. Third, MI allows for multiple partially observed covariates without needing them to be collectively observed or missing. Fourth, MI can easily incorporate information on variables that are not included in the CLR model but are predictive of missing covariates in that model. This can increase efficiency and can also reduce bias when these extra variables are required to make the MAR assumption more plausible. Fifth, we propose both methods that parametrically model matching variables and methods in which this is not required. Arguably, a sixth advantage is that, unlike some of the methods proposed earlier, MI reduces to standard CLR when there are no missing data.
Although this means MI does not offer the potential efficiency gain associated with methods that make use of a covariate model even when data are complete, it should make it more robust to misspecification of that model.
We illustrate the use of MI for matched case-control data on a study of association between fibre intake and colerectal cancer nested within the European Prospective Investigation of Cancer (EPIC) Norfolk cohort. This is one of the studies in the UK Dietary Cohort Consortium, which combines case-control studies nested within several cohorts. Results from this study have been described elsewhere (Dahm et al., 2010). Cases were individuals in the EPIC Norfolk cohort diagnosed with colorectal cancer between recruitment to the cohort (1993)(1994)(1995)(1996)(1997)(1998) and the end of 2006. Seven-day diet diaries were completed by participants shortly after recruitment to the underlying cohort and stored for later use. The diet diaries were processed for individuals selected for the case-control sample to obtain measures of average daily intake of foods and nutrients (Dahm et al., 2010). There were 318 colorectal cancer cases and each was matched with 4 controls on sex, age within 3 years, and date of diary completion within 3 months. Controls had to be alive and have not been diagnosed with colorectal cancer at the end of 2006. In the original analysis, the association between fibre intake and colorectal cancer was adjusted for several potential confounders using CLR : smoking status (3 categories), education (4 categories), social class (6 categories) and physical activity level (4 categories), and height, weight, exact age, alcohol intake, folate intake, intake of energy from fat and intake of energy from non-fat (all continuous). We wished also to adjust for aspirin use (2 categories). Many other studies have adjusted for aspirin use (Aune et al., 2011), which is known to be associated with reduced risk of colorectal cancer (Asano and McLeod, 2004;National Cancer Institute, 2014). It was omitted from the original analysis (Dahm et al., 2010) because it was not measured in some of the contributing studies. Of the 1590 individuals in the study, 328 (78 cases and 250 controls) were missing one or more adjustment variables, most commonly aspirin use or social class; the main exposure, fibre intake, and the matching variables were fully observed. A complete-case analysis uses only 240 (75%) matched sets and 1012 (64%) individuals.
The article is structured as follows. Section 2 discusses CLR with complete data. Section 3 describes MI in general. For matched case-control studies, Section 4 proposes three MI methods that parametrically model the matching variables, and Section 5 three analogous methods that avoid this. Section 6 contains a simulation study comparing the methods.
Section 7 describes their application to the EPIC study. We end with a discussion in Section 8.

Analysis of Matched Case-Control Studies with Complete Data
For each individual in the population, let D = 1 if he/she has disease and D = 0 otherwise. So, D = 1 for cases and D = 0 for controls. Let S denote the variables used to match controls with cases. Let X cat and X con denote categorical and continuous covariates, respectively. A categorical variable with m > 2 levels is coded as m − 1 dummy variables. Assume where q(S) = logit P (D = 1 | X cat = 0, X con = 0, S). Let M denote the number of controls matched with each case. We use subscript j (j = 1, . . . , M + 1) to index individual within set and assume cases and controls have been ordered so that D 1 = 1 and D 2 = . . . = D M +1 = 0.
In ordinary logistic regression the log ORs β cat and β con are estimated by maximising the likelihood based on expression (1) and the data on the sampled individuals. This requires that either q(S) is modelled or a separate baseline parameter is included for each matched set.
The former corresponds to breaking the matching and adjusting for S, which there is often a reluctance to do, because it requires a functional form to be specified for the effect of matching variables on disease risk. The alternative of including a baseline parameter for each set yields inconsistent maximum likelihood estimates (Breslow and Day, 1980). For this reason, CLR is often used instead. CLR includes a baseline parameter for each set, but then eliminates these from the likelihood by conditioning on the number of cases and controls in each set.
Let G(x cat 1 , x con 1 , . . . , x cat M +1 , x con M +1 ) denote the conditional probability that (X cat 1 , X con given that D 1 = 1 and D 2 = . . . = D M +1 = 0 and S 1 = . . . = S M +1 . Equation (1) implies and vice versa (Web Appendix A). CLR finds the values of β cat and β con that maximise the product of expression (2) over the matched sets; these consistently estimate the log ORs .  (Hughes et al., 2014). Otherwise, FCS is less theoretically justified, but there is much evidence that it works well in terms of approximate unbiasedness of parameter and variance estimates and coverage of confidence intervals (van Buuren, 2012;Hughes et al., 2014;Lee and Carlin, 2010). An important theoretical result was given by Liu et al. (2014).

MI using matching variables
Let R denote the missingness pattern in (X cat⊤ , X con⊤ ) ⊤ . Assume D and S are fully observed and the data are MAR. In this section, we propose multiply imputing missing (X cat⊤ , X con⊤ ) ⊤ from its conditional distribution given S and D. We call this 'MI using matching variables'. It is analogous to breaking the matching and adjusting for the matching variables. However, matching is broken only to impute missing data; matching is then restored and the imputed data analyzed using CLR. Most methods reviewed in Section 1 effectively break the matching for the individuals with missing data. In Section 5 we describe an alternative ('MI using matched set'), which imputes without breaking the matching. We now propose three ways of modelling the distribution of (X cat⊤ , X con⊤ ) ⊤ given S and D.
The first model for (X cat⊤ , X con⊤ ) ⊤ given S and D is a restricted general location model (Schafer, 1997). This has a log-linear model for X cat and normal model for X con given X cat : where a(x cat , S; ζ) includes a main effect for X cat and all pairwise interactions between X cat and S and between pairs of elements of X cat . Vectors λ, ζ, α and φ and matrices γ, δ and Σ are unknown parameters. In Web Appendix C, we prove that (3)-(4) imply that equation (2) holds with β cat = λ − γ ⊤ Σ −1 φ and β con = Σ −1 φ. Hence, this model is compatible with the CLR analysis model.
Bayesian modelling software, such as WinBUGS (Lunn et al., 2000), can be used to impute missing (X cat⊤ , X con⊤ ) ⊤ from its posterior predictive distribution implied by joint model (3)-(4). However, such software requires specialist programming skills. Instead we propose using FCS MI with a set of conditional models that is compatible with this joint model, and hence is asymptotically equivalent to joint model MI. FCS MI is widely available in general-purpose statistical packages, e.g. Stata, R and SAS. In Web Appendix D, we show that a compatible conditional model for a partially observed continuous covariate (an element of X con ) is a linear regression of this covariate on S, D, X cat and the remaining elements of X con . Likewise, a compatible conditional model for one of the partially observed categorical covariates making up X cat is a multinomial logistic regression of this categorical covariate on S, D, X con and those elements of X cat that are not dummy indicators for this categorical covariate.
Conveniently, these conditional models are the default options in many MI packages.
Although asymptotically equivalent, in finite samples this FCS MI method may be inefficient compared to joint model MI, because it estimates the parameters of the conditional model for X cat using only part of the available data on X con (Hughes et al., 2014). Our second proposed model for (X cat⊤ , X con⊤ ) ⊤ given S and D is a latent normal model (Carpenter and Kenward, 2013). This is not compatible with the CLR analysis model, but it has the advantage that it can be used for joint model MI without needing specialist Bayesian software. For simplicity, suppose that all the categorical covariates are binary (see Carpenter and Kenward (2013) for general case). The latent normal model is where W cat is a vector of latent variables (each with unit variance), one for each element of X cat , and such that an element of X cat equals 1 if its corresponding element of W cat is positive and 0 otherwise. α LN , φ LN , δ LN and Σ LN are unknown parameters. Joint model MI using (5) can be done using the software REALCOM-MI or the jomo package in R. The realcomImpute program provides an interface between Stata and REALCOM-MI.
When all partially observed covariates are continuous, our FCS MI method and joint model MI using (5) both reduce to joint model MI using the normal model (4). Use of this normal model for MI even when some partially observed variables are categorical was originally promoted by Schafer (1997) and has become common. Although the model is obviously misspecified, this method has been found to work well in many situations and software is widely available, e.g. mi mvn impute in Stata and norm in R. Thus, our third proposed model for (X cat⊤ , X con⊤ ) ⊤ given S and D is expression (5) with W cat replaced by X cat .
Following Bernaards et al. (2007), we use 'adaptive rounding' after imputation to handle non-integer imputed values of X cat .

MI using matched set
Now we propose three models for X set = (X cat⊤ on S, thus allowing imputation without using the matching variables. These are analogous to the models in Section 4 but involve a matched-set-specific random effect, u.
The first is a restricted general location model. Assume that for each matched set, where b 1 (x cat j ; ν) includes a main effect of each element of X cat j and an interaction between each pair of these elements, and b 2 (x cat j , x cat k ; ν) includes all pairwise interactions between one element of X cat j and one element of X cat k . This allows correlation between X cat of members of the same matched set. Also assume that for j = 1, . . . , M + 1 independently, and whereX cat = (M + 1) −1 M +1 j=1 X cat j . Note that ψ and Ω allow correlation between one individual's X con and the X cat and X con of other members of the same matched set. In Web Appendix E, we show this model implies equation (2) holds with β cat = τ −ρ ⊤ (F −C)ξ and As in Section 4, we propose using FCS MI with conditional models compatible with this joint model. In Web Appendix F, we show that a compatible conditional model for a partially observed element of X con j is a linear regression of that element on X cat j , k =j X cat k , k =j X con k and all the remaining elements of X con j . Likewise, a compatible conditional model for one of the partially observed categorical variables making up X cat j is a multinomial logistic regression of this categorical variable on X con j , k =j X con k , k =j X cat k , and those elements of X cat j that are not dummy indicators for this categorical variable. These conditional models are not the default options in MI software, because some predictors in the regression are sums of conditioning variables, e.g. k =j X cat k . However, specification of non-default conditional models is straightforward (see Web Appendix H).
As with the FCS method in Section 4, this method is asymptotically equivalent to joint model MI, but in finite samples may be inefficient. Our second proposed model for X set is a latent normal model with random effects (Carpenter and Kenward, 2013). Like the latent normal model of Section 4, this is not compatible with equation (2), but its use may improve efficiency. The latent normal model is the same as model (5), but with δ LN S replaced by u and now conditioning on all of D 1 , . . . , D M +1 : for j = 1, . . . , M + 1 independently, where u is normally distributed with mean zero and unstructured variance given D 1 = 1, D 2 = . . . = D M +1 = 0. Again, joint model MI can be done using REALCOM-MI or jomo.
As in Section 4, there is a normal version of this model. This assumes (10) but with W cat j replaced by X cat j . Joint model MI with this model can be done using the pan package of R.

Simulation Study
One thousand datasets were generated for each of 24 scenarios resulting from considering two sample sizes (N = 100 or 500 matched sets), two numbers of matching controls (M = 1 or M = 4), three missingness mechanisms, and two proportions of missing data. Each dataset was generated using the model defined by expressions (3)-(4). Specifically, there were two matching variables, one binary (S cat ) and one continuous (S con ), and three covariates, one categorical (X cat ) and two continuous (X conA and X conB ). We assumed P (S cat = 1 | D = 1) = 0.6 and S con | S cat , D = 1 ∼ N(0, 1). These could represent, for example, sex and standardised age. Among cases, the sex with greater risk would be more common, while age might be approximately normal if risk increases with age but total population size diminishes due to all-cause mortality. We assumed logit P (X cat = 1 | S cat , S con , D) = −2.5 + 0.5S cat + 0.5S con + 0.75D (so about 10% of controls and 20% of cases have X cat = 1), and (X conA , X conB ) given X cat , S cat , S con , D is bivariate normal with univariate marginal distributions N(0.5X cat + 0.5S cat + 0.5S con + 0.5D, 1) and covariance 0.5. From β cat = λ − γ ⊤ Σ −1 φ and β con = Σ −1 φ, the true log ORs of X cat , X conA and X conB are β cat = 5/12, β conA = 1/3 and β conB = 1/3.
Missingness was imposed on X cat and X conA assuming either missing completely at random (MCAR) or one of two MAR mechanisms. For MCAR data, each individual's X cat and X conA variables were independently missing with probability p miss . Two values, p miss = 0.1 and p miss = 0.25, were considered. Thus, either 19% or 44% of individuals had at least one missing variable. For the first MAR mechanism (MAR-A), each individual's X cat and X conA variables were independently missing with logit probability c miss + 0.25(X conB + S cat + S con + D). For the second (MAR-B), it was c miss + 0.25(X conB + S cat + S con + D + X conB D). In both cases, c miss was chosen to give p miss = 0.1 or p miss = 0.25 missingness in each of X cat and X conA .
Each dataset was analyzed using CLR with X cat , X conA and X conB as covariates. Missing data were handled in seven ways: complete-case analysis; FCS MI using matching variables or matched set (using ice in Stata); latent normal MI using matching variables or matched set (jomo in R); and normal MI using matching variables (mi impute in Stata) or matched set (pan in R). We used 25 imputed datasets when p miss = 0.1, and 50 when p miss = 0.25. In addition, the complete data were analyzed before imposing missingness on the covariates.
Tables 1 and 2 show results for the MCAR mechanism with 1:1 matching (M = 1) and 1:4 matching (M = 4) when N = 500. The results from these scenarios also give a good indication of the general patterns observed for MAR-A, MAR-B and N = 100 (see Web   Tables 1-10). We shall focus on β cat and β conA , since for β conB (the fully-observed covariate), differences between the three MI methods using matched variables were small, as were differences between those using matched set. To ease comparison of the six MI methods, Tables 3 and 4 show, for p miss = 0.25, the bias, ratio of empirical SEs, ratio of mean estimated SEs, and relative efficiency (i.e. ratio of mean squared errors, MSE) of each method, averaged over the three missingness mechanisms, separately for β cat and β conA , for N = 100 and 500, and for M = 1 and M = 4. Unsurprisingly, differences between methods were smaller when p miss = 0.1; we focus on p miss = 0.25 below.
The two FCS methods are approximately unbiased when N = 500 and usually when N = 100. Exceptions are when N = 100 and M = 1, where the complete-data method is also biased (with biases similar to those of FCS MI), and when N = 100 and M = 4, where there is bias for β cat when using matched set. Normal MI has some negative bias for β cat , especially when using matching variables (except when N = 100 and M = 1, where its negative bias cancels out the positive bias of the complete-data estimator). Latent normal MI has some positive bias for β cat when M = 1; latent normal MI using matched set also has negative bias for β conA . The complete-case estimators are generally approximately unbiased, but note that the estimator of β conB is severely biased under MAR-B (Web Tables 2, 4, 7 and 10). MI is more efficient than FCS MI when M = 1, but is the least efficient of all the methods when N = 500 and M = 4, where its bias dominates its smaller SE. All MI methods are more efficient than the complete-case analysis.
The MI methods show a tendency to slightly overestimate SEs. Mostly, this is fairly mild, but is more severe for normal MI with β cat when M = 1, and for latent normal MI with β conA when M = 1 or 4. Thus, although normal and latent normal MI are most efficient for β cat and β conA , respectively, this advantage is not apparent in the width of the estimated confidence intervals.
Indeed, the average estimated SEs of the three MI methods using matching variables were generally rather similar; the same was true of the methods using matched set. Coverage of 95% confidence intervals was between 93% and 97% for all methods.
We also performed two simulation studies using modified data-generating mechanisms that make our imputation models misspecified. In the first, there was an interaction between S cat and S con ; in the second, X conA and X conB were log-normally distributed. See Web Appendix G and Web Tables 11-16 for details and results. Briefly, none of the MI methods showed considerable bias for either of these data-generating mechanisms, and all MI methods were much more efficient than the complete-case analysis.
In summary, all the MI methods appear to work well. Using matching variables is more efficient than using matched set. If using matching variables, normal and latent normal MI appear to be preferable to FCS MI, which is less efficient; normal MI is more efficient for β cat , but latent normal MI more efficient for β conA . Of these, one might prefer latent normal MI, because of the bias in β cat for normal MI. If using matched set, FCS MI might be preferred when M = 4, on bias and efficiency grounds. However, when M = 1 and using matched set, no method appears better than any other.  matched study (M = 4); no method was obviously best or worst for 1:1 matching (M = 1).

Analysis of EPIC-Norfolk Data
A drawback of FCS MI using matched set when M > 1 is that the estimates may depend on the arbitrary order chosen for the M controls in each matched set. Any order produces valid imputations, but one could avoid this dependence by randomly permuting indices of controls within matched sets before generating each imputed dataset. However, that is only likely to be worthwhile if the sample size is small and there is a lot of missing data. Normal MI was, in general, the most biased of the three methods, but even its biases were fairly modest. A slight drawback of normal MI is the need manually to post-process imputed values of categorical variables, e.g. using adaptive rounding. None of the MI methods was uniformly superior to the others in simulations, and we regard use of any of them as entirely acceptable.
All methods can handle the situation where the number of matched controls, M, varies between cases, although this is slightly more complicated for FCS MI using matched set. For this method, extra controls with completely missing data would have to be added to those matched sets with fewer than the maximum number of controls, before performing MI, and then deleted again before analysing the imputed datasets.
Another method, which merits further research, is joint model MI using the restricted general location model. This requires specialist Bayesian software and more advanced programming skills, and the focus of this article is on methods that are easy to implement in standard pack-ages. Nevertheless, it would be worth investigating whether this method is significantly more efficient than our FCS MI method based on the same model. The mix package in R (Schafer, 1997) may also be of interest. This uses a model similar to (3)-(4), but additionally assumes the continuous part of S is normally distributed given D, X cat and the rest of S. The MI methods considered in this article assume, like the CLR analysis model, nothing about the distribution of the matching variables. Mix cannot be used for MI using matched set.
Finally, we note that, as always with missing data methods, it is important to consider the plausibility of the assumption about the missing data mechanism. Often, the MAR assumption can be made more plausible by including in the imputation model additional variables that are associated with the partially observed covariates.
Since this is true for any x cat , x con and any value of S, equation (1) holds with q(S) = logit P (D = 1 | X cat = 0, X con = 0, S).

Web Appendix B: Joint Model MI and Full-Conditional Specification (FCS) MI
In this appendix, we provide a more technical description of joint model MI and FCS MI than that given in the main text of the paper. We begin with joint model MI, and then discuss FCS MI and its relation to joint model MI.
Consider a generic dataset with n units and K random variables Y 1 , . . . , Y K potentially denote, respectively, the observed and missing parts of Y •k , and let Y obs •−k and Y mis •−k denote the observed and missing parts of Y •−k .
In joint model MI, a model f (Y 1 , . . . , Y K | θ) is specified for Y 1 , . . . , Y K , with prior distribution π(θ) on the parameters θ of this model. Random vectors (Y i1 , . . . , Y iK ) and (Y j1 , . . . , Y jK ) (i = j) are assumed to be conditionally independent given θ. Let f k (Y k | Y −k ; θ) denote the full-conditional distribution of Y k , i.e. the distribution of Y k given Y −k and θ implied by f (Y 1 , . . . , Y K | θ). Imputed values for Y mis •1 , . . . , Y mis •K are drawn from their posterior predictive distribution: (Little and Rubin, 2002). This can be achieved by using a Gibbs sampler algorithm. One form of the Gibbs sampler is as follows (Liu et al., 2014). Initially, the missing values of Y k (k = 1, . . . , K) are replaced by the mean, or a random sample, of the observed values of Y k .
A single iteration of the Gibbs sampler involves K steps, in the kth of which the missing values of Y k are updated. The kth step (k = 1, . . . , K) of the algorithm is to sample θ from its posterior distribution given Y obs •k and Y •−k (using the current value of Y mis •−k ) and then, using the sampled value of θ.
Step k is omitted if Y k is fully observed. The Gibbs sampler is iterated until convergence, at which point Y mis •1 , . . . , Y mis •K are drawn from their posterior predictive distribution. This algorithm is applied repeatedly to the original dataset, in order to create multiple imputed datasets. The analysis model is then fitted to each imputed dataset separately and the resulting parameter and variance estimates combined using Rubin's Rules (Little and Rubin, 2002). The imputation model is said to be compatible with the analysis model if there exists a model for the joint distribution of all the variables that implies the analysis and imputation models as submodels. When the imputation model is correctly specified and compatible with the analysis model, and data are MAR, joint model MI gives consistent parameter and variance estimates for the analysis model (Little and Rubin, 2002).
An alternative to joint model MI is FCS MI (also known as MI by chained equations) (van Buuren, 2012). FCS MI avoids the need to specify a model for the joint distribution of Y 1 , . . . , Y K and to sample from the corresponding posterior of θ given Y obs •k and Y •−k and from the full-conditionals. Instead, for each k = 1, . . . , K, a model g k (Y k | Y −k ; θ k ) is specified for the conditional distribution of Y k given Y −k , and a non-informative prior π k (θ k ) is specified for the parameters θ k of this model. FCS proceeds as per the Gibbs sampler except that at step k, θ k is sampled from the posterior distribution proportional to the likelihood formed by the product of g k (Y ik | Y i,−k ; θ k ) over units with observed Y ik multiplied by the prior π k (θ k ), and missing values of Y ik are sampled from g k (Y ik | Y i,−k ; θ k ).
Step k and the specification An important theoretical result about the asymptotic relation between joint model MI and FCS MI was provided by Liu et al. (2014). They defined the set of conditional models if the following condition holds for each k = 1, . . . , K. For each value of θ k in its parameter space, there exists at least one value of θ in its parameter space for which (Note that there is no need for there to exist a value of θ for which this is true for all k = 1, . . . , K simultaneously.) Liu et al. (2014) showed that when the set of conditional models is compatible with a joint model, the distribution of the imputed data converges, as the sample size tends to infinity, to the posterior predictive distribution of the missing data under that joint model. Hence, the use of FCS MI is asymptotically valid in this case.
For more information about joint model MI and FCS MI, refer to, for example, Carpenter and Kenward (2013).

It follows from equation (3) that
log P (X cat = x cat | S, D = 1) − log P (X cat = x cat | S, D = 0) = λ ⊤ x cat + terms that do not depend on x cat .
Therefore log p(X con 1 , X cat 1 | S, D 1 = 1) − log p(X con 1 , X cat 1 | S, This does not depend on S, and hence Web Appendix D: Proof that linear regression and multinomial logistic regression conditional models are compatible with the general location model of equations (3) and (4) It is obvious from expression (4) that the conditional distribution of any element of X con given X cat , S, D and the remaining elements of X con is a normal distribution with mean equal to a linear function of those variables and with constant variance. Since the parameters of the joint normal distribution in expression (4) are unconstrained, so are the parameters describing the mean and variance of this conditional distribution. Therefore, a linear regres-sion of a partially observed element of X con on X cat , S, D and the remaining elements of X con (with main effects only and no interactions) constitutes a conditional model that is compatible (in the sense of Definition 1 of Liu et al., 2014) with the general location model of equations (3) and (4). Now consider partially observed categorical covariates. By Bayes' Theorem, log p(X cat | X con , S, D) Now write X cat = (X cat⊤ A , X cat⊤ B ) ⊤ , where X cat A represents one categorical variable (coded as a vector of dummy indicators), and X cat B represents the remaining variables in X cat . Also To get the distribution of X cat A given X con , S, D and X cat B we take equation (14) and ignore all terms not involving X cat A . The result is where γ A , γ B , λ A and λ B are given by the partitions A is a vector whose elements all equal zero except for at most one element.
It follows from equation (15) that the distribution of X cat A given X con , S, D and X cat B follows a multinomial logistic regression with main effects for S, X cat B , D and X con . It can also be seen that, because there are no constraints on the parameters of the general location model of expressions (3) and (4) or on ζ 1 , ζ 2 and ζ 3 , there are also no constraints on the maineffect parameters in this multinomial logistic regression model. Hence, a multinomial logistic regression of a partially observed categorical variable on X con , S, D and the elements of X cat that are not dummy indicators for this categorical variable (with main effects only) constitutes a conditional model that is compatible (in the sense of Definition 1 of Liu et al., 2014) with the general location model of equations (3) and (4).

Lemma
If Λ and Ω are invertible, symmetric n × n matrices and M 1, then the inverse of the .
This result is easily verified. Note that C and F are symmetric.
Web Appendix F: Proof that linear regression and multinomial logistic regression conditional models are compatible with the general location model of equations (6)- (9) First, consider a continuous covariate. It follows from expressions (8)-(9) that X con j |      X cat 1 , . . . , X cat M +1 , X con 1 , . . . , X con j−1 , X con j+1 , . . . , X con M +1 , with where the M × M matrices C ′ and F ′ are the same as the (M + 1) × (M + 1) matrices C and F but with M replaced by (M − 1) and the final row and final column missing. Hence, Line (19) follows because D 1 , . . . , D M +1 sum to one. Also, Note that the right-hand side of equation (19) is a linear combination of X cat j , k =j X cat k , k =j X con k , and that Λ ′ does not depend on any of the variables.
Now partition X con j as X con j = (X con⊤ jA , X con⊤ jB ) ⊤ , where X con jA denotes a single element of X con j . It follows from expression (18) that the distribution of X con jA given the other variables is X con jA |      X con jB , X cat 1 , . . . , X cat M +1 , X con 1 , . . . , X con j−1 , X con j+1 , . . . , X con M +1 , where µ jA is a linear combination of X cat j , k =j X cat k , k =j X con k and X con jB , and where Λ ′′ , like Λ ′ , is not a function of the variables.
Since the parameters in equation (8) are unconstrained, it can be seen from expression (18) that the parameters relating the mean of X con j to X cat j , k =j X cat k and k =j X con k are unconstrained. Thus, the parameters relating the mean of µ jA to X cat j , k =j X cat k , k =j X con k and X con jB are also unconstrained. Therefore a linear regression of a partially observed element of X con j on X cat j , k =j X cat k , k =j X con k and the remaining elements of X con j (with main effects only and no interactions) constitutes a conditional model that is compatible (in the sense of Definition 1 of Liu et al., 2014) with the general location model of expressions (6)-(8).
Second, consider a categorical covariate. Using equation (9.28) of Schafer (1997), it follows from expressions (6), (7) and (8) that where Now partition X cat j as X cat j = (X cat⊤ jA , X cat⊤ jB ) ⊤ , where X cat jA denotes the subvector of X cat j that consists of the dummy indicators for a single categorical covariate, and partition x cat j analogously. To obtain the distribution of X cat jA given X cat jB , X cat 1 , . . . , X cat j−1 , X cat j+1 , . . . , X cat M +1 , X con 1 , . . . , X con M +1 , D 1 = 1 and D 2 = . . . = D M +1 = 0, we can take equation (20) and ignore terms not involving x cat jA . Using equation (21) it can be shown that where u 1j is a vector function of ν and the parameters in equation (21), and U 2 , . . . , U 5 are matrix functions of the same set of parameters. From equations (20) and (22) it can be seen that the distribution of X cat jA given X cat jB , X cat 1 , . . . , X cat j−1 , X cat j+1 , . . . , X cat M +1 , X con 1 , . . . , X con M +1 , D 1 = 1 and D 2 = . . . = D M +1 = 0 has the form of a multinomial logistic regression with main effects for X con j , k =j X con k , k =j X cat k and X cat jB .
The inclusion of b(x cat 1 , . . . , x cat M +1 ; ν) with unconstrained ν ensures that u 1j , U 2 and U 4 are unconstrained. Therefore, it only remains to check that U 3 and U 5 are unconstrained. It can be shown that where ρ A and ψ A are given by the partitions ρ = [ρ A ρ B ] and ψ = [ψ A ψ B ]. Since ρ A and ψ A are unconstrained, so are U 3 and U 5 . So, none of the parameters in the multinomial logistic regression is constrained. Hence, this multinomial logistic regression is compatible (in the sense of Definition 1 of Liu et al., 2014) with the general location model of expressions (6)- .

Web Appendix G: Additional simulation studies with misspecified models
We simulated datasets using data-generating mechanisms that made the imputation models of all our MI methods misspecified. In the first, data were simulated with an interaction between S cat and S con ; in the second, X conA and X conB were log-normally distributed. We now detail these two data-generating mechanisms, before giving the results.
Sensitivity analysis 1: interaction between S cat and S con The data-generating process used in the main simulations (Section 6) assumed that the conditional distribution of (X conA , X conB ) given X cat , S cat , S con and D was bivariate normal with univariate marginal distributions N(0.5X cat + 0.5S cat + 0.5S con + 0.5D, 1) and covariance 0.5.
We modified this data-generating process by changing the univariate marginal distributions to N(0.5X cat + 0.5S cat + 0.5S con + 0.5S cat S con + 0.5D, 1), thus including an interaction term that was not present in the imputation models. The rest of the data-generating process was the same as that used in the main simulations, including the application of the MCAR, MAR-A and MAR-B missingness mechanisms with 10% or 25% missingness. One thousand simulated datasets, each with N = 500 cases and 500 matched controls (M = 1), were generated.
Sensitivity analysis 2: log normally distributed X conA and X conB We simulated data for a matched case-control study as follows. First, data on a population of 15000 independent individuals were generated. For each individual in this population, variables S cat , S con , X cat , X conA , X conB and D were generated. Binary variable S cat and continuous variable S con were generated independently using P (S cat = 1) = 0.5 and S con ∼ Normal(0, 1). Binary variable X cat was generated with logit P (X cat = 1 | S cat , S con ) = −2.5+0.5S cat +0.5S con . Continuous variables X conA and X conB were generated using X conA = 0.5X cat + 0.5S cat + 0.5S con + exp(e A ) and X conB = 0.5X cat + 0.5S cat + 0.5S con + exp(e B ), where (e A , e B ) was bivariate normally distributed with zero mean and unit variance, independently of S cat , S con and X cat . The covariance of (e A , e B ) was chosen so that the covariance between X conA and X conB conditional on X cat , S cat and S con was approximately 0.5, as it was in the main simulations. Binary variable D was generated using logit P (D = 1 | S cat , S con , X cat , X conA , X conB ) = −4.68 + 0.25S cat + 0.25S con + 5 12 X cat + 1 3 X conA + 1 3 X conB .
The intercept of −4.68 was chosen to give P (D = 1) = 0.05, i.e. a 5% population prevalence of disease.
Having generated data on this population of 15000 individuals, 500 individuals with D = 1 were randomly drawn from it. These are the 500 cases. One individual with D = 0 was then matched with each case on S cat and S con . These are the 500 matching controls. Finally, missingness was randomly imposed on this matched case-control sample using the same MCAR, MAR-A and MAR-B mechanisms with 10% or 25% missingness that were used in the main simulation study.
This entire process (i.e. simulation of population and sampling of cases and controls) was repeated 1000 times to generate 1000 matched case-control study datasets.

Results
Web Tables 11-13 and 14-16 show the results for the first and second sensitivity analyses, respectively. It can be seen that the MI methods are reasonably robust to the imputation model misspecifications.

Web Appendix H: Application of MI methods in Stata and R
Here we provide the Stata or R code that we used to perform MI in the simulation study when there were M = 4 controls per case, for each of the methods outlined in the paper, along with a specimen simulated dataset. Text files of the Stata and R code given below and the files containing the simulated dataset are available as a web supplement at the Biometrics website. The dataset is provided as a Stata file (simulated data.dta) and as a comma-separated-values file (simulated data.csv).
In the dataset the three covariates X cat , X conA and X conB are denoted xcat, xconA and xconB, respectively. The categorical and continuous matching variables S cat and S con are denoted respectively as scat and scon, and the case or control status is denoted d. There are missing values in X conA and X conB , which were generated completely at random to give approximately 25% of the values missing in each variable.
We specify a seed at the start of each block of code, so that the results are reproducible. The results of applying the code to the simulated dataset provided are shown in Web Table 17.
The results from the complete case analysis are also shown; this analysis can be performed in, for example, Stata using clogit d xcat xconA xconB , group(set).

FCS MI using matching variables
This analysis was performed in Stata using the ice command, as shown below. if the imputed values of all the dummy variables are less than 0.5, then the most common category is assigned; otherwise, the category with the highest imputed value is assigned. The categories of the categorical variable are first ordered so that the most frequently occurring category is the reference category. This was the approach used in the EPIC example.

Latent normal MI using matching variables
This analysis was performed in R using the jomo1mix function in the 'jomo' package.

FCS MI using matched set
This analysis was performed in Stata using the ice command, as shown below. First, the dataset was transformed to 'wide' format using the reshape command, i.e. so that there is one row per matched set. Then, within each matched set, k =j X cat k , k =j X conA k and k =j X conB k were calculated for j = 1, . . . , 5. These are denoted as 'xcatsum1 ',. . .,'xcatsum5','xconAsum1',. . .,'xconAsum5',and 'xconBsum1',. . .,'xconBsum5

Normal MI using matched set
This analysis was performed in R using the pan function in the 'pan' package. The 'pan', 'survival' and 'mice' libraries are required. The pan function requires a data frame containing the model covariates (y), a data frame containing the outcome and a vector of 1s (x), and a data frame containing the matched set variable (set). The model covariates without missing values could alternatively be included in x. It is necessary to set up an array that stores the imputed datasets (y.imp) and also to specify the random seeds used to perform the imputations (seeds.n.imp). Adaptive rounding is used post-imputation to assign values 0 or 1 for missing values in xcat (note that a different approach is required for non-binary categorical variables, as described in the section on Normal MI using matching variables). A loop is used to fit the conditional logistic regression to each imputed dataset and the stored results are combined using the pool.scalar function, which is in the 'mice' package. In the output from pool.scalar the pooled estimate is given by qbar and its variance is given by t. data<-read.table("simulated_data.csv",sep=",",header=T) set.seed (780423) N <-500 # number of matching sets setsize <-5 # the number of individuals in a matching set n.imp <-50 # number of imputations nvar <-3 # number of covariates (xcat, xconA and xconB in this case) set.seed (10) seeds <-floor( runif(1000) * 1e6 ) y <-as.matrix( data[, c("xcat", "xconA", "xconB")] ) set <-data[, "set"] x <-cbind(1, data[, "d"]) y.imp <-array(0, c(N*setsize, nvar, n.imp)) # Set up array to store imputations   Table 3: Biases ('bias'), ratios of empirical SEs ('ratio SE'), ratios of mean estimated SEs ('ratio empSE'), and relative efficiencies (%, 'rel. eff.) of six MI methods when N = 500 and p miss = 0.25. Ratios and relative efficiencies are calculated relative to the corresponding complete-data estimators. Each reported ratio or relative efficiency is the average over three ratios or relative efficiencies: one from each of the MCAR, MAR-A and MAR-B scenarios. Reported biases are the signed average absolute bias over these three scenarios.  Table 4: Biases ('bias'), ratios of empirical SEs ('ratio SE'), ratios of mean estimated SEs ('ratio empSE'), and relative efficiencies (%, 'rel. eff.) of six MI methods when N = 100 and p miss = 0.25. Ratios and relative efficiencies are calculated relative to the corresponding complete-data estimators. Each reported ratio or relative efficiency is the average over three ratios or relative efficiencies: one from each of the MCAR, MAR-A and MAR-B scenarios. Reported biases are the signed average absolute bias over these three scenarios.  Table 5: Association between fibre intake and colerectal cancer estimated from EPIC-Norfolk. Log odds ratio is for six-gram per day increase in fibre intake, conditional on smoking, education, social class, physical activity, height, weight, age, alcohol intake, folate intake, intake of energy from fat and non-fat, aspirin use and the matching variables. Missing data are handled by restriction to complete cases or by MI.