Multiple imputation for an incomplete covariate that is a ratio

We are concerned with multiple imputation of the ratio of two variables, which is to be used as a covariate in a regression analysis. If the numerator and denominator are not missing simultaneously, it seems sensible to make use of the observed variable in the imputation model. One such strategy is to impute missing values for the numerator and denominator, or the log-transformed numerator and denominator, and then calculate the ratio of interest; we call this ‘passive’ imputation. Alternatively, missing ratio values might be imputed directly, with or without the numerator and/or the denominator in the imputation model; we call this ‘active’ imputation. In two motivating datasets, one involving body mass index as a covariate and the other involving the ratio of total to high-density lipoprotein cholesterol, we assess the sensitivity of results to the choice of imputation model and, as an alternative, explore fully Bayesian joint models for the outcome and incomplete ratio. Fully Bayesian approaches using Winbugs were unusable in both datasets because of computational problems. In our first dataset, multiple imputation results are similar regardless of the imputation model; in the second, results are sensitive to the choice of imputation model. Sensitivity depends strongly on the coefficient of variation of the ratio's denominator. A simulation study demonstrates that passive imputation without transformation is risky because it can lead to downward bias when the coefficient of variation of the ratio's denominator is larger than about 0.1. Active imputation or passive imputation after log-transformation is preferable. © 2013 The Authors. Statistics in Medicine published by John Wiley & Sons, Ltd.


Introduction
Missing values of covariates are a common problem in regression analyses. Missing data are classified as being missing completely at random (MCAR) if missingness does not depend on observed or unobserved data, missing at random (MAR) if missingness does not depend on unobserved data given observed data, or missing not at random if missingness depends on missing data even given the observed data [1]. Amongst methods that attempt to deal with missing data, rather than discarding them, multiple imputation (MI) can provide valid inference under MAR and has become popular in practice since its inception over 30 years ago [2].
Briefly, MI works as follows. Missing values are replaced with imputed values, drawn from their posterior predictive distribution under a model given the observed data. We term this model the imputation model. The process is repeated M > 1 times, giving M imputed datasets with no missing values. Each imputed dataset is analysed using the model that would have been used had the missing values been observed. We call this model the analysis model. The M estimates of each parameter of interest are then combined using 'Rubin's rules' [3]. When the imputation model is correctly specified, Rubin's rules can provide standard errors and confidence intervals that fully incorporate uncertainty due to missing data.
For both 1 and 2, the ratio is semi-missing rather than fully missing; that is, one of the two components is observed. Ratio missingness due to more than one of these reasons for different observations in the same dataset means it is not obvious how best to impute the ratio. A mixture of reasons 1 and 2 is particularly awkward.
One reasonable question at this stage is, 'Why use a ratio covariate?' There are mathematical arguments against their use [8]. Senn and Julious claim that ratios are always poor candidates for parametric analysis unless the components, and therefore the ratio, follow a lognormal distribution or the ratio's coefficient of variation (CV) is small [9]. We make three points. First, applied researchers do use ratios, and we are unlikely to persuade them to stop, especially because the use of certain ratios is well established; we should be pragmatic and try to guide practitioners on how to analyse datasets involving incomplete ratio covariates. Second, arguments against ratios assume that a ratio is not the correct functional form for a covariate, but it may be. Third, ratios are not used by accident: A ratio may be of genuine substantive interest when its separate components are not. For example, BMI is widely used because it measures weight-for-height and as such is regarded as a proxy measure of body fat. Substantive interest is in the influence of body fat on outcome, not weight or height. Weight alone may be considered a measure of body fat, but BMI is measured with less error because it aims to remove the effect of height (although it may not do so completely or accurately). It is our opinion that when researchers propose a relationship they believe, such as the influence of a ratio on outcome, this should not be cast aside lightly. The substantive question should not be altered for statistical convenience unless we have little choice.
We assume that the aims of analysis are unbiased estimation of a parameter describing the association between a ratio and some outcome, confidence intervals with the ascribed coverage and fully efficient parameter estimation. There may be other covariates in the analysis model, and primary interest may be in one of these, but the properties of the ratio parameter estimator are important nonetheless. There has been no previous methodological work on MI for a ratio covariate, although White et al. [10] and Bartlett et al. [11] allude to the issue, but practitioners are imputing ratio covariates nonetheless [12]. We aim to highlight issues with imputing an incomplete ratio covariate and to identify imputation strategies that are practicable for applied statisticians.
Despite the positive features listed previously, MI is neither the only approach to dealing with missing covariates, nor necessarily the best approach for any given analysis. Joint models for the outcome and covariates may be superior because they make use of the full likelihood in a coherent way. In this paper, we also investigate results for fully Bayesian joint models.
The remainder of this paper is as follows. In Section 2, we introduce and describe our two motivating datasets; in Section 3, we consider candidate models for imputing incomplete ratios. Section 4 presents two case studies, contrasting the different imputation models (for the datasets introduced in Section 2). Section 5 presents a simulation study in a simpler setting than our case studies; and Section 6 is a discussion.

Datasets: Aurum and EPIC-Norfolk
For both of our datasets, regression analyses involving a ratio as a covariate have previously been published [4,7]. The analysis models used in our example analyses are not the same as the original articles because of the following: (i) we want to keep the analysis models and imputation models relatively simple, and (ii) we do not wish to make any substantive claims about these data. Therefore, we have chosen to use analysis models resembling but not matching those used in the earlier publications [4,7].
For both datasets, the analysis model is the Cox model, where h 0 .t / is the nonparametric baseline hazard function at time t , h i .t j x i / is the hazard for the ith individual and x ci is the value of the cth covariate in the ith individual. Survival (or censoring) times are assumed to be fully observed.

The Aurum cohort
The Aurum dataset comes from a South African cohort study of 1350 HIV-infected participants starting antiretroviral therapy. Participants were recruited from 27 centres in five provinces between February 2005 and June 2006 and followed to March 2007. Information was recorded on a range of baseline characteristics, and participants were followed up for death. The aim of the work by Russell et al. [4] was to estimate the influence of haemoglobin on mortality using a Cox model. Of the participants, 1348 had a recorded time of death/censoring, with 185 deaths occurring within the follow-up time. We restrict our analysis to these 1348 individuals. The analysis model is (1) with p D 6, where x 1 ; : : : ; x 6 are age in years, sex, haemoglobin in g/mL, viral load in copies per mL, CD4 count in cells per L and BMI. Table I provides a summary of these  covariates and of weight and height. We give transformations of the covariates used in the analysis model, and summarise the transformed measure in the final column. Note that 381 (28%) patients are missing a weight and/or height measurement, but only five of these have height missing when weight is observed. Five of the covariates are continuous, and one (sex, which is complete) is categorical. Haemoglobin, weight, height 2 and BMI appear to be approximately normal on the transformed scale, while (log) viral load and (square root of) CD4 count do not. We focus on the estimation ofˇ3 andˇ6, the log hazard ratios for haemoglobin and BMI, respectively, (haemoglobin was the focus of the original publication [4]).

The EPIC-Norfolk cohort
The European Prospective Investigation Into Cancer and Nutrition (EPIC)-Norfolk study is a large cohort study designed to investigate the link between dietary factors and cancer. Dietary and nondietary factors were collected at baseline, and participants were followed up for cancer and non-cancer outcomes. We use some of the non-dietary characteristics as covariates and time to death as the outcome. The analysis model is (1) with p D 6, where x 1 ; : : : ; x 6 are age, sex, smoking status, systolic blood pressure, diastolic blood pressure and cholesterol ratio. We summarise these six covariates and total cholesterol and HDL in Table II; none are transformed. In total, 2155 (9%) participants are missing a total cholesterol and/or HDL measurement. Total cholesterol is always missing when HDL is missing. Incomplete covariates are all continuous and appear approximately normal, except for HDL, which is positively skewed. We focus on the estimation ofˇ6, the log hazard ratio for cholesterol ratio.

Model for analysis
The analysis model is the Cox model (1) with p covariates .x 1 ; : : : ; x p / made up of the ratio x p D a 1 =a 2 and p 1 other covariates .x 1 ; : : : ; x p 1 /, which we denote .z; w/ where z are incomplete and w are complete (in both example datasets, we have z and w).

Models for missing data
We list candidate models for the covariates in Table III (note the Label column, which we henceforth use to refer to models). For MI, the outcome must be explicitly included as a covariate in the imputation model [13]. In Table III, we denote outcome by f .y i /. For the Cox model, f .y i / involves a censoring indicator and the Nelson-Aalen estimate of the cumulative hazard function to the survival time (an approximation to the cumulative baseline hazard function H 0 .t / [14]), included as separate covariates in the imputation model. When the analysis model is linear or logistic regression, f .y i / D y i .

Compatibility in relation to active and passive imputation
Multiple imputation can provide an approximation to fitting a joint model if the models for imputation and analysis are compatible [15], where a joint model may be either maximum likelihood or Bayesian (if the joint model is Bayesian, compatibility also requires that priors are non-zero over the entire parameter is required. Passive imputation of x pi D expOEln.a 1i / ln.a 2i / is required. space). Considering whether or not the models M1-M6 are compatible with the analysis model helps us to formulate hypotheses and understand future results.
By 'compatible', we mean that a joint model exists that implies both the imputation model and the analysis model as conditional models. This does not mean that the joint model is correct, but that the analysis model and imputation model are both implied by it, and so the MI procedure is coherent. Appendix A describes how to tell if models are compatible and works through two examples of imputation models where one is compatible and the other is not (A.1 and A.2, respectively).
Compatibility is related the concept of 'congeniality', and the term congeniality is often used to mean compatibility [10,16,17]. Congeniality requires that the joint model from which the imputation and analysis models can be derived is Bayesian. Further, the researcher's incomplete and complete data procedures must be specified, and the inferences must be asymptotically equivalent to a Bayesian model. We refer interested readers to Meng [18].
Non-compatibility of models is not always problematic; Meng [18] and Rubin [19] have both shown that there can be some benefit to using imputation models that correctly draw on information not used by the analysis model. Collins, Schafer and Kam demonstrate via simulation that auxiliary variables (i.e. variables that are in the imputation model but not the analysis model) are unlikely to be harmful, and may be of benefit by making the MAR assumption more plausible, while 'restrictive' imputation strategies can lead to problems [20].
We therefore distinguish between two types of non-compatibility: If there is a special case of the imputation model that is compatible with the analysis model, as when it includes auxiliary variables, then the imputation model is termed 'semi-compatible' (following Liu et al. [21]); otherwise, the imputation model is simply termed 'incompatible'. In previous work, imputation models that are compatible or semi-compatible appear to perform well even when misspecified [22,23], but this is not necessarily true for imputation models that are incompatible [20,23]. We hypothesise that imputation models that are compatible or semi-compatible will be more robust to modest degrees of misspecification than models that are incompatible.
Imputation of a ratio is performed either actively or passively. Of the imputation models listed in Table III, only M1 is compatible with the analysis model. Of the remaining models M2-M4, which use active imputation, are semi-compatible because they include a 1 and/or a 2 , which do not appear in the analysis model, as auxiliary variables in the imputation model; models M5 and M6, which use passive imputation, are incompatible with the analysis model because x p is present in the analysis model but not in the imputation model, while a 1 and a 2 are present in the imputation model but not in the analysis model. We expect models M5 and M6 to be prone to bias and poor coverage, despite making use of all the observed data when imputing the ratio.

Motivation for missing data models
The choice of a model listed in Table III might be motivated by the way it makes use of observed information in a 1 ; a 2 , which will depend on the pattern of missingness.
Model M1 may be a good approach when a 1 ; a 2 are missing simultaneously. If a 1 is only missing when a 2 is missing, M2 may be used because model M2 makes use of observed a 1 values when imputing the ratio, and there is no information in a 2 about missing values of a 1 that might be used to improve imputation of x p . (Conversely, if a 2 is only missing when a 1 is missing, M3 may be attractive.) Note that M2 and M3 do not respect the deterministic relationship x p D a 1 =a 2 .
Model M4 makes use of information on a 1 ; a 2 by imputing both alongside x p ; this may be motivated by having a 1 , a 2 or both missing. This is similar to the approach advocated by von Hippel [22], which has been termed just another variable [10,23]. As with M2 and M3, the model ignores the deterministic relationship x p D a 1 =a 2 and assumes multivariate normality. This will appear a bizarre assumption; it is clearly wrong because the distributions of two of these variables must define the distribution of the third, yet software does not know this and will sample without complaint. If the assumption made by M4 is uncomfortable, we may be attracted to M5 or M6.
Model M5 is incompatible with the analysis model (Appendix A.1) and requires x p to be imputed passively from imputed values of a 1 =a 2 . The components a 1 ; a 2 are not auxiliary but completely determine the values of x p . The ratio of a 1 and a 2 , which are both normal, is expected to be heavy tailed.
M6 alters the problem by transforming x p into a linear function of its logged components and passively imputing it. Model M6 guarantees that imputed values of a 1 ; a 2 are positive, as with all observed ratios. While this may be desirable, it is important to remember that our primary goal is valid inference, and we are not trying to recreate the missing values [19,24]. The cosmetics of this model should therefore be a secondary consideration.
We have omitted from Table III the imputation model .z i ; ln.x pi / j f .y i // MVN. We do not consider this because ln.x p / D ln.a 1 / ln.a 2 / where ln.a 1 / and ln.a 2 / are normal, and the sum of two normal distributions is normal. Model M6 is therefore equivalent to imputing ln.x p / but makes more use of the observed data when components are not simultaneously missing. The only setting where modelling ln.x p / alone is appropriate is if (a 1 ; a 2 ) are always either both observed or both missing. In this case, the model would then be equivalent to M6.
To summarise our discussion of the models in Table III, there are conceptual problems with each one: Model M1 is compatible with the analysis model but does not use information on observed a 1 or a 2 when the other component is missing; M2-M4 are likely to be misspecified; and M5 and M6, the two models that make use of all the observed information on a 1 and a 2 and respect the relationship x p D a 1 =a 2 , are incompatible with the analysis model.

Software and details of imputation
We used Stata 12's mi suite for MI in our case studies and simulations in Section 5 [25,26]. We performed multiple imputation using mi impute mvn, and implemented Rubin's rules using mi estimate.
Advice on the number of imputations typically suggests that a small number (fewer than 10) is sufficient [16]. This idea comes from comparing the length of confidence intervals based on M imputations to intervals based on 1 imputations. Our view on choosing the number of imputations, described in White et al. [10], is slightly different, being based on the reproducibility of analyses. To achieve negligible Monte Carlo error from our MI analyses, we use M D 300 imputations for the Aurum case study and M D 100 for EPIC-Norfolk. Note that we are not advocating such large values of M in general.
Our imputation models, all of which are based on a multivariate normal model, used a burn-in of 1000 iterations of the MCMC chain. Thereafter, we stored imputed datasets at every 10th iteration of the chain.

Case studies
This section presents the results for MI. However, in analyses with missing data, Bayesian models are widely regarded as a sensible alternative if there is reason to be suspicious of MI results. We outline and present Bayesian analyses of the Aurum and Epic datasets, corresponding to the MI approaches presented in this section, in Appendix B.

Imputing body mass index in the Aurum cohort
The MI procedures took between 2 min, 7 s (M1) and 2 min, 44 s (M6) to impute 300 times, fit the analysis model in each imputed dataset and use Rubin's rules to combine estimates. Figure 1 shows estimates resulting from different imputation models. There is very little difference in the point estimates or width of confidence intervals; all returned essentially the same result. The number of imputations meant Monte Carlo error was negligible, at a maximum reaching 1/50th of the estimated standard error. The relative efficiency versus infinite M was > 0:999 for all models. For both haemoglobin and BMI, the MI estimates gave a slight change in the point estimate and a small reduction in the width of confidence intervals as compared to complete cases.

Imputing cholesterol ratio in the EPIC-Norfolk cohort
For MI of the EPIC-Norfolk data, we used M D 100. We used a smaller number of imputations than in Aurum because only 9% of individuals were missing cholesterol ratio. MI took between 19 min, 2 s (M1) and 21 min, 0 s (M5) to impute 100 times, analyse each imputed dataset and combine estimates using Rubin's rules. The relative efficiency versus infinite M was > 0:999 for all models except M5, where relative efficiency was 0.991.
There was consistency between estimates from models that impute cholesterol ratio directly ( Figure 2). Monte Carlo error for point estimates was negligible (around 0.0005, less than 1/50th of the standard error) for all models except M5 where it was 0.003. MI models are less consistent than in the Aurum MI analyses but would in five of six cases give similar substantive conclusions. These  estimates are also very similar to complete-cases analysis and, interestingly, the imputation model that passively imputes cholesterol ratio through log-total cholesterol and log-HDL. However, the estimate after the standard passive imputation approach (M5) is much closer to the null, with wider confidence intervals. Figure 3 demonstrates the problem with model M5 in the EPIC-Norfolk data, plotting imputed values of cholesterol ratio from a single, typical, imputed dataset under models M1-M6 alongside 2155 randomly selected observed values. The largest observed value of cholesterol ratio was 15.7. Note that for model M5, some imputed values were very large or very small; plotting these extreme values distorted the y-axis, and so we have censored the y-axis below 3 and above +20, ranking and listing the values of imputed HDL and cholesterol ratio values outside of this range.
The problem with M5 arises because the mean and SD of HDL are 1.42 and 0.42, respectively, meaning its coefficient of variation (CV) is 0.30, resulting in a danger of a 2 being imputed close to zero or even negatively. This CV is far larger than in the Aurum data, where CV(height 2 ) D 0.11 and imputed values are never close to zero (data not shown). Figure 3 also highlights the difference between the other imputation models. Imputation on the log scale (M6) is the only model to guarantee that a 1 ; a 2 and x p are positive. Further, the imputed values closely resemble the observed. M1-M4 can and did impute some x p < 0; these models all assume x p N, and so the distribution of imputed values is symmetrical about its mean. By looking at Figure 3, model M6 appears to be appealing, while from a statistical inference perspective (Figure 2), there appears to be little to choose between M6 and M1-M4. From all perspectives, M5 is a poor choice.

Predictive mean matching.
A natural question about model M5 that arises from Figure 3 is whether removing the high-leverage points could reduce the bias. For example, a truncated normal imputation model could be used to invoke the constraint x p > 0, which would remove the negative outliers of model M5. A better alternative, which can also remove the positive outliers, is predictive mean matching (PMM) [10,27,28]. Briefly, the imputation model is fitted, and for each individual with a missing value, the k individuals ('donors') with observed values with the closest predicted mean are identified. One of these is selected at random and their value 'donated' as the imputed value. This ensures that imputed values are within the range of observed values.
To improve model M5, PMM is most easily implemented in a chained equations procedure [10]. Imputation of a 1 and a 2 uses PMM, and x p is passively imputed. The largest possible imputed value of x p is then the ratio of the largest observed value of a 1 to the smallest observed value of a 2 (and vice versa for the smallest imputed value of x p ).
We used this imputation model on the EPIC-Norfolk data, using k D 10 and storing imputed values after 10 cycles of chained equations. This reduced the bias of model M5, giving an estimated log-hazard ratio of 0.119 (95% CI 0.079-0.159). See Appendix C for the full results.

Design
We performed a simulation study designed to investigate models M1-M6 in a simpler setting than the two case studies. With x p as the only covariate and a continuous outcome y, we investigated the performance of the imputation models and how this varied with the strength of x p -y association and the CV of the ratio's denominator, CV.a 2 /. This affects the distribution of x p , and we hypothesise that when CV.a 2 / is large, model M5 will be biased. An imputed value of a 2i may be very small, meaning the corresponding value of x pi will be large, and possibly outside the range of observed x p . The x pi will thus have high leverage. For such values, there are unlikely to be appropriately large or small y to preserve the true x p -y relationship, which leads us to expect bias towards no association.
Scenarios investigated include two values of CV.a 2 /: 0.1, taken from height 2 in the Aurum data, and 0.3, taken from HDL in the EPIC-Norfolk data; we vary these factorially with R 2 values of 0.1 and 0.3. We performed all simulations using Stata 12 [25]. Our simulation procedures were as follows: (1) Simulate n D 500 complete values of ln.a 1 /; ln.a 2 / to follow a bivariate normal distribution.
In our first scenario, the mean, standard deviation and correlation are taken from ln(weight) and ln(height 2 ) in the Aurum data: ln.a 1 / has mean 4 and SD 0.21, ln.a 2 / has mean 0.97 and SD 0.11, and Corr.ln.a 1 /; ln.a 2 // D 0:22. This gives CV(a 2 ) D 0.1. (2) Generate complete x p D exp.ln.a 1 / ln.a 2 //, meaning that x p follows a lognormal distribution.
For the ratios and components in our two example datasets, the lognormal distribution seems to be a suitable choice. (3) Simulate y N.ˇ0 Cˇ1x p ; 2 /. We used the same value ofˇ1 (arbitrarily 2) throughout to make bias comparable across all simulation settings. To vary the strength of association, we altered 2 to achieve the desired R 2 . (4) Simulate binary indicators of response, R 1 and R 2 , for a 1 and a 2 , respectively. Each R is generated independently from the model logitfP.R D 1/g D 0 C 1 y. Under MCAR, 1 D 0. Under MAR, 1 is chosen so that ROC analysis of y versus an indicator of response R produces a mean area under the curve of 0.65. This is to achieve the same degree of MAR across scenarios. We then alter 0 so that P.R 1 D 1/ D P.R 2 D 1/ D 0:75. Because 1 has the same sign for both R 1 and R 2 and both depend on y, the probability of a 1 ; a 2 being missing simultaneously is slightly larger under MAR than MCAR. This means that the overall proportion of observations missing x p is slightly smaller under MAR (42% missing x p ) than MCAR (44% missing x p ). (5) Set a 1i to missing if R 1i D 0, a 2i to missing if R 2i D 0 and x pi to missing if R 1i D 0 or R 2i D 0. (6) Impute x p five times using each of the models M1-M6 (Table III). (7) Fit the correct analysis model to each imputed dataset, and combine the results using Rubin's rules.
We used 5000 replicates of this process under each combination of simulation settings. Interest is inˇ1. We calculated bias, coverage of 95% confidence intervals and efficiency of Ǒ 1 (expressed by the empirical standard error, SD. Ǒ 1 / over all replications [29]) under models M1-M6, with analysis of complete data (i.e. before any data are set to missing) and complete cases (dropping observations with missing x p ) also provided for reference. Table IV summarises the results of our simulation study. Results of the complete data and complete cases analyses are both as expected. Complete data are always unbiased with 95% coverage and the smallest empirical standard error of all methods. Complete cases are unbiased under MCAR but biased under MAR. Coverage is correspondingly low, and efficiency is lower than complete data. M1 is mainly unbiased, but there is a small upward bias under MAR and R 2 D 0:3, and coverage is slightly low when data are MAR. This is perhaps because it assumes normality for x p when it is actually lognormal. M1 also tends to be inefficient compared to other imputation models, as would be expected, regardless of the missingness mechanism.

Results
With this general pattern of missingness, M3 is usually more biased than M2, although coverage tends to be similar (except where CV(a 2 ) D 0.3 and R 2 D 0:3). Efficiency of M2 and M3 seems to depend on CV(a 2 ) and R 2 . Model M4 has similar bias to M2 and M3; at worst, this reaches about 4% with both large CV(a 2 ) and R 2 . Empirical standard errors for M4 are at least as small as M2 and M3, while coverage tends to be good except when both CV(a 2 ) and R 2 are 0.3.
Model M5 performs well in the two scenarios when CV.a 2 / D 0:1. There is a small downward bias, but efficiency and coverage are both good compared with other methods. However, when CV(a 2 ) D 0.3, we observe unacceptable bias towards the null and lower efficiency than other methods, although coverage is still over 90%. When considered alongside bias, this coverage implies that while the empirical standard error is large, the estimated standard errors are even larger, reducing the effect of the large bias on coverage and implying low power. M6 is more biased than M5 when CV(a 2 ) D 0.1 but much less so when CV(a 2 ) D 0.3. Across all of our settings, it is more efficient than M1-M5 and with coverage close to 95%. If the small bias seems acceptable, then this is the best imputation model.

Discussion
We have presented the results of two case studies involving commonly used ratios and a simulation study based in part on these datasets. A key message is the caution against passive imputation of a 1 and a 2 without prior transformation. Superficially, the approach appears to make more use of the available data; however, it is often inefficient and can suffer from large bias. Our analysis of the EPIC-Norfolk data demonstrated this problem in practice. However, in our Aurum case study, the use of passive imputation appeared to make little difference to the substantive results compared to active imputation. Our simulation study confirmed that problems arise when CV(a 2 ) is large. Note that a ratio with very small CV(a 2 ) is unlikely to be used in applied work (unless CV(a 1 ) is also very small) because as CV.a 2 / ! 0, x p becomes a function of a 1 divided by a constant. We therefore recommend that incomplete ratios be imputed actively or passively after log transformation as in model M6.
In considering models for missing data, joint models for the covariates and outcome are attractive because they use the full data likelihood in a coherent way. In our two case studies, we attempted to fit fully Bayesian joint models and summarise posterior distributions for parameters of interest. Computational problems prevented this approach from being useful. In one dataset, some of the models did not appear to converge to any true posterior distribution (or if they did, results were extraordinarily sensitive to the choice of model for the ratio). In the other dataset, it was not possible to load the observed data into WinBUGS, and so the attempt was abandoned.
Compatibility is a useful concept for considering whether various imputation models are sensible. We hypothesised that models M1 and M2-M4 would perform well because of being compatible and semicompatible respectively, while models M5 and M6 would perform poorly because of being incompatible. In our simulations, M1-M4 did tend to perform well despite being misspecified, and model M5 did often perform poorly. In our EPIC-Norfolk example, where model M5 gave nonsense results, problems could be identified by inspecting the imputed values of x p .
Model M6 was surprisingly as good as any other model considered throughout. Despite being more robust than M5, we know it is not completely 'safe'. In our simulation study, the imputation model assumed .log.a 1 /; log.a 2 / j y/ N, and because log.x p / D log.a 1 / log.a 2 /, this implies .log.x p / j y/ N. The imputation model therefore has mean function log.x p / D˛0 C˛1y, while the analysis model has mean function y Dˇ0 Cˇ1x p . In further simulations, we noted that M6 was still robust when R 2 D 0:5 and CV(a 2 ) D 0.3 (results not shown). We can provide no guarantee for greater values other than that this model will eventually fall apart. However, it is our experience that associations stronger than R 2 D 0:5 are rare in medical applications.
Some of the issues with model M5 could have been alleviated by using partly parametric imputation techniques such as PMM [30] or local residual draws [28]. In practice, this requires a switch to the chained equations approach rather than a multivariate imputation model. Because a parametric model is used only to identify suitable donors, this makes it impossible to think about compatibility. We investigated PMM in the problematic EPIC-Norfolk dataset and found model M5 much improved. PMM may therefore be a useful adjunct to a suitably chosen imputation model.
In evaluating methods, we have focused on bias, coverage and efficiency. For those interested in accurate prediction, efficiency may be more important and coverage less so or even unimportant [31]. It is worth noting that precision is also lower for model M5. Therefore, if passive imputation is to be used for a ratio in prediction settings, it should be performed on the log scale.
We have considered the imputation of ratio covariates. Some similar issues arise when the analysis model contains any nonlinear function, for example, interactions and squares. The difference is that in both cases, the main effects and their interaction, or the variable and its square, are included in the analysis model. In the case of squares, a measurement and its square will also be observed or missing simultaneously. Imputation is then complicated by the fact that the analysis model contains both the untransformed variable and a nonlinear function as covariates, rather than just the nonlinear function, as in the case of ratios. This makes issues around compatibility somewhat more complicated. See von Hippel [22], Seaman et al. [23] and Bartlett et al. [11] for recent work on imputation of squares and interactions. Bartlett et al. proposed the use of rejection sampling when producing imputations and showed it to be useful for imputing squares and interactions; this may therefore be a good approach for imputing ratios. By explicitly involving the analysis model in the specification of the imputation model, each imputation model used in the chained equations is compatible with the imputation model [11]. However, the method is more time intensive than any imputation models investigated here, and it is yet to become available in standard software packages. It also sacrifices one of the advantages of MI: separation of missing data issues from substantive analyses. However, this may be necessary and has already been partly conceded when we tailor imputation models to be compatible with the analysis model.

Theorem 1
Given two conditional densities f .x j y/ and g.y j x/, a joint density exists if and only if f.x; y/ W f .x j y/ > 0g D f.x; y/ W g.y j x/ > 0g, and there exist functions u.x/ and v.y/ such that, first, and, second, u.x/ is integrable.
Here, u.x/ is a marginal density for x, and v.y/ is a marginal density for y. Later, we posit an analysis model and check compatibility against two different imputation models using (2).
We distinguish between two kinds of non-compatibility: Semi-compatibility: There is a special case of the imputation model that is compatible with the analysis model. Incompatibility: There is no case of the imputation model that is compatible with the analysis model. That is, if setting certain parameters of the imputation model to 0 yields a compatible model, the imputation model is drawing on more information than the analysis model and is richer rather than the same, hence semi-compatible. If parameters of the imputation model cannot be set to 0 to identify a compatible model, the imputation model is using different information to or less information than the analysis model. Previous work has shown that incompatibility can be harmless or beneficial [18][19][20]. When the analysis model is correctly specified, these are examples of using semi-compatible imputation models, while incompatible imputation models are always harmful when the analysis model is correctly specified.
Appendices A.1 and A.2 work through two simple examples. For both, the analysis model involves only the ratio as a covariate. Appendix A.1 uses model M5 and is shown to be incompatible; A.2 uses model M1 and is shown to be compatible.
Instead of dividing the densities, we subtract the log-densities. For clarity, we omit the intercept terms 0 andˇ0 from the imputation model and the analysis model, respectively, assuming both equal zero. Note that because neither parameter involves a 1 ; a 2 nor y, this does not impact on compatibility.

A.1. Imputation model incompatible with the analysis model
Suppose the proposed analysis model is a linear regression of y on the ratio a 1 =a 2 . The log-density for this is The proposed imputation model is a bivariate normal model, which has log-density The imputation model (4) is of the form b.a 1 ; a 2 /Cc.y/Cd 1 .a 1 y/Cd 2 .a 2 y/ and the analysis model . Subtracting one from the other, we cannot express the result as u.a 1 ; a 2 / v.y/, indicating that they are incompatible.

A.2. Imputation model compatible with the analysis model
The proposed analysis model is as in (A.1), and so the log-density is given by (3). However, the imputation model involves a linear regression of a 1 a 2 on y. The log-density is Subtracting (5) from (3), we obtain ln a By setting˛= 2 a Dˇ= 2 y , we can express (6) without any terms involving both .a 1 ; a 2 / and y, indicating that for any choice of (˛; 2 a ), there are values of (ˇ; 2 y ) for which the proposed imputation model is compatible with the analysis model.

Appendix B. Bayesian models for an incomplete ratio
It is conceptually natural to model missing covariates using Bayesian methods. The problem discussed in Section 3.3, that the imputation model and the analysis model may not correspond to any joint model, does not exist for Bayesian models, where the model for missing data and the analysis model are joint. The compatibility between the missing data model and the analysis model is thus assured.
The practical disadvantage of fully Bayesian models for an incomplete ratio and/or its components is computation. Bayesian models are also in general more computationally demanding than MI. Further, the imputation models described previously could be implemented fairly automatically using a choice of software, while the Bayesian models require knowledge of WinBUGS [33] and/or the ability to code the models manually in another package.
Here, we explore whether Bayesian models, by working with the full joint likelihood, will provide more coherent results than MI. In our example datasets, we aim to obtain posterior means and credible intervals under various models.

B.1. Models, software and priors
A Bayesian model combines model (1) with a model for the incomplete covariates given the complete covariates. We list candidate Bayesian models for the covariates in Table B.1 (again, note the Label column, where the number corresponds to the imputation model with equivalent motivation). Section 3.5 and Appendix B.2 give details of how the Cox model is fit. In contrast to MI, no explicit conditioning on the outcome is required for Bayesian models.
Note that, except for the lack of issues around compatibility, the critique of the imputation models given in section 3.4 with equivalent labels applies equally to the Bayesian models given in Table B.1. That is, models B1-B3 may ignore some of the observed data, while B2-B4 are likely to be misspecified to some degree.
To fit Bayesian joint models in our case studies, we used WinBUGS 1.4.3 [33]. Because we are dealing with the Cox model, we used the method outlined in the WinBUGS manuals (Leuk: survival analysis using Cox regression in Examples Volume I) to specify the models [34].
We used vague prior distributions for all parameters (see B.2 for details).

B.2. Details on Bayesian analyses
Below, we give WinBUGS code used to demonstrate the setup of the fully Bayesian Cox model where x p is modelled and a 1 ; a 2 are ignored (this is the model denoted B1 in Table III). Models B2-B6 differ only in that they simply specify the models for BMI, weight and height 2 differently. The data file is made up of the covariates age sex hb logvl sqcd4 bmi, a vector of length N indicating death fail, a vector of length N of survival times for all individuals obst, and a vector of length T of distinct failure times t. Note that the data must be sorted in ascending order of obst before being passed to WinBUGS. All covariates are centred at their mean.
The priors for regression coefficients are N.0; 100/. The prior for dL 0 , the baseline intensity, requires slightly more explanation. This is modelled as dL 0 .crft .j C1/ t .j / g; c/, that is, a gamma distribution with mean rft .j C1/ t .j / g and variance rft .j C1/ t .j / g=c. The expression ft .j C1/ t .j / g is the time increment between the j th and j C 1th failure times; in the Aurum data, the mean time increment was 8 days. Note that r is not invariant to the scale of t , although c is. We used c D 0:1 and r D 0:1. A change of time scale would require r to be altered to obtain an equivalent prior distribution.

B.3. Results
Fitting the Bayesian models in WinBUGS was troublesome.
For the Aurum data, all MCMC chains ran slowly, and some stalled persistently. The simplest models (for example B1) took 5-10 h to produce 5000 iterations of the MCMC sampler. Model B5 took 10 days to produce 1000 iterations and would only update under a very specific set of initial values. WinBUGS stalled repeatedly, and the need to set the model updating again inflated the run time. We present results for model B5 but do not claim the MCMC sampling converged to the true posterior distribution. Results for model B6 are absent because WinBUGS was unable to sample at all; the reason for this was unclear. WinBUGS ran a lot faster when fitting models that imputed missing values of x p actively, that is, B1-B4. Figure B.1 presents results for the Aurum data (contrasting with the results obtained via MI in Figure 1). Posterior distributions obtained from different fully Bayesian analyses give diverse results. For haemoglobin, posterior means for all models except B5 are slightly closer to 0 than any of the MI models, and the 95% credible intervals tend to be slightly shorter than the MI confidence intervals. This may in part be the effect of the prior for the hazard, as seen in the comparison of Bayesian and frequentist analysis of complete cases. Under model B5, the posterior distribution for the log hazard ratio had mean much closer to zero with smaller posterior variance than under other models.
For BMI, posterior means from B1-B5 are very variable. B1 and B2 largely agree with the MI and (Bayesian) complete cases estimates, although the intervals are longer than those obtained after MI. Posterior means from B3 and B4 are closer to 0 and have shorter credible intervals than MI models or the other Bayesian models. For B4, this perhaps reflects the incorrect assumption made about the joint distribution of x p ; a 1 ; a 2 (this is surprising because the issue does not appear to affect model M4). Model B5 shows an effect in the opposite direction to all other estimates. This was the model that was very difficult to run in WinBUGS. As noted previously, we do not claim B5 ever converged to the true posterior density.
For the EPIC-Norfolk data, it was not possible to compile any of the fully Bayesian models in WinBUGS, even for complete cases. We tried compiling the complete cases model for subsets of the data of gradually increasing size (starting with n D 1000); model compilation failed beyond n > 4000. The EPIC-Norfolk dataset is too large for WinBUGS, and so attempts to fit the fully Bayesian models were abandoned. This is a setting where a fully Bayesian analysis is impractical to any but the most dedicated.

Appendix C. Results for EPIC-Norfolk after imputation using predictive mean matching
As described in Section 4.2.1, we re-ran the imputation models for EPIC-Norfolk using PMM. Figure C.1 gives the full results analogous to those given in Figure 2. Note that, with the exception of model M5, there is less consistency between models than between the models that did not use PMM. Note also that the fraction of missing information is uniformly greater for the models that use PMM.