Abstract
 Top of page
 Abstract
 Introduction
 Datasets: Aurum and EPICNorfolk
 Methods and models
 Case studies
 Simulation study
 Discussion
 Appendix: Compatibility
 Appendix: Bayesian models for an incomplete ratio
 Appendix: Results for EPICNorfolk after imputation using predictive mean matching
 Acknowledgements
 References
We are concerned with multiple imputation of the ratio of two variables, which is to be used as a covariate in a regression analysis. If the numerator and denominator are not missing simultaneously, it seems sensible to make use of the observed variable in the imputation model. One such strategy is to impute missing values for the numerator and denominator, or the logtransformed numerator and denominator, and then calculate the ratio of interest; we call this ‘passive’ imputation. Alternatively, missing ratio values might be imputed directly, with or without the numerator and/or the denominator in the imputation model; we call this ‘active’ imputation. In two motivating datasets, one involving body mass index as a covariate and the other involving the ratio of total to highdensity lipoprotein cholesterol, we assess the sensitivity of results to the choice of imputation model and, as an alternative, explore fully Bayesian joint models for the outcome and incomplete ratio. Fully Bayesian approaches using Winbugs were unusable in both datasets because of computational problems. In our first dataset, multiple imputation results are similar regardless of the imputation model; in the second, results are sensitive to the choice of imputation model. Sensitivity depends strongly on the coefficient of variation of the ratio's denominator. A simulation study demonstrates that passive imputation without transformation is risky because it can lead to downward bias when the coefficient of variation of the ratio's denominator is larger than about 0.1. Active imputation or passive imputation after logtransformation is preferable. © 2013 The Authors. Statistics in Medicine published by John Wiley & Sons, Ltd.
Introduction
 Top of page
 Abstract
 Introduction
 Datasets: Aurum and EPICNorfolk
 Methods and models
 Case studies
 Simulation study
 Discussion
 Appendix: Compatibility
 Appendix: Bayesian models for an incomplete ratio
 Appendix: Results for EPICNorfolk after imputation using predictive mean matching
 Acknowledgements
 References
Missing values of covariates are a common problem in regression analyses. Missing data are classified as being missing completely at random (MCAR) if missingness does not depend on observed or unobserved data, missing at random (MAR) if missingness does not depend on unobserved data given observed data, or missing not at random if missingness depends on missing data even given the observed data [1]. Amongst methods that attempt to deal with missing data, rather than discarding them, multiple imputation (MI) can provide valid inference under MAR and has become popular in practice since its inception over 30 years ago [2].
Briefly, MI works as follows. Missing values are replaced with imputed values, drawn from their posterior predictive distribution under a model given the observed data. We term this model the imputation model. The process is repeated M > 1 times, giving M imputed datasets with no missing values. Each imputed dataset is analysed using the model that would have been used had the missing values been observed. We call this model the analysis model. The M estimates of each parameter of interest are then combined using ‘Rubin's rules’ [3]. When the imputation model is correctly specified, Rubin's rules can provide standard errors and confidence intervals that fully incorporate uncertainty due to missing data.
MI is an attractive tool for analyses with missing data: The nuisance issue of modelling missing data is neatly separated from the analyses of substantive interest; the imputation model can make use of auxiliary variables that it would be undesirable to include as covariates in the analysis model (such as postbaseline measurements in a randomised controlled trial); the same M imputed datasets can be used for a variety of substantive analyses; and the imputation model can be tailored to reflect possible departures from MAR, which is helpful for sensitivity analysis.
Ratios are commonly used as covariates in regression analyses; examples are body mass index (BMI = Weight in kg ÷ (Height in m) ^{2}) [4], waist–hip ratio [5], urinary albumintocreatinine ratio (Albumin concentration in mg/g ÷ Creatinine concentration in mg/g) [6], and what we refer to as ‘cholesterol ratio’ (Total cholesterol in mg/dL ÷ HDL in mg/dL) [7].
An individual's ratio measurement may be missing for one of the three reasons:
The denominator is missing.
The numerator is missing.
Both components are missing.
For both 1 and 2, the ratio is semimissing rather than fully missing; that is, one of the two components is observed. Ratio missingness due to more than one of these reasons for different observations in the same dataset means it is not obvious how best to impute the ratio. A mixture of reasons 1 and 2 is particularly awkward.
One reasonable question at this stage is, ‘Why use a ratio covariate?’ There are mathematical arguments against their use [8]. Senn and Julious claim that ratios are always poor candidates for parametric analysis unless the components, and therefore the ratio, follow a lognormal distribution or the ratio's coefficient of variation (CV) is small [9]. We make three points. First, applied researchers do use ratios, and we are unlikely to persuade them to stop, especially because the use of certain ratios is well established; we should be pragmatic and try to guide practitioners on how to analyse datasets involving incomplete ratio covariates. Second, arguments against ratios assume that a ratio is not the correct functional form for a covariate, but it may be. Third, ratios are not used by accident: A ratio may be of genuine substantive interest when its separate components are not. For example, BMI is widely used because it measures weightforheight and as such is regarded as a proxy measure of body fat. Substantive interest is in the influence of body fat on outcome, not weight or height. Weight alone may be considered a measure of body fat, but BMI is measured with less error because it aims to remove the effect of height (although it may not do so completely or accurately). It is our opinion that when researchers propose a relationship they believe, such as the influence of a ratio on outcome, this should not be cast aside lightly. The substantive question should not be altered for statistical convenience unless we have little choice.
We assume that the aims of analysis are unbiased estimation of a parameter describing the association between a ratio and some outcome, confidence intervals with the ascribed coverage and fully efficient parameter estimation. There may be other covariates in the analysis model, and primary interest may be in one of these, but the properties of the ratio parameter estimator are important nonetheless. There has been no previous methodological work on MI for a ratio covariate, although White et al. [10] and Bartlett et al. [11] allude to the issue, but practitioners are imputing ratio covariates nonetheless [12]. We aim to highlight issues with imputing an incomplete ratio covariate and to identify imputation strategies that are practicable for applied statisticians.
Despite the positive features listed previously, MI is neither the only approach to dealing with missing covariates, nor necessarily the best approach for any given analysis. Joint models for the outcome and covariates may be superior because they make use of the full likelihood in a coherent way. In this paper, we also investigate results for fully Bayesian joint models.
The remainder of this paper is as follows. In Section 2, we introduce and describe our two motivating datasets; in Section 3, we consider candidate models for imputing incomplete ratios. Section 4 presents two case studies, contrasting the different imputation models (for the datasets introduced in Section 2). Section 5 presents a simulation study in a simpler setting than our case studies; and Section 6 is a discussion.
Datasets: Aurum and EPICNorfolk
 Top of page
 Abstract
 Introduction
 Datasets: Aurum and EPICNorfolk
 Methods and models
 Case studies
 Simulation study
 Discussion
 Appendix: Compatibility
 Appendix: Bayesian models for an incomplete ratio
 Appendix: Results for EPICNorfolk after imputation using predictive mean matching
 Acknowledgements
 References
For both of our datasets, regression analyses involving a ratio as a covariate have previously been published [4, 7]. The analysis models used in our example analyses are not the same as the original articles because of the following: (i) we want to keep the analysis models and imputation models relatively simple, and (ii) we do not wish to make any substantive claims about these data. Therefore, we have chosen to use analysis models resembling but not matching those used in the earlier publications [4, 7].
For both datasets, the analysis model is the Cox model,
 (1)
where H_{0}(t) is the nonparametric baseline hazard function at time t, h_{i}(t  x_{i}) is the hazard for the ith individual and x_{ci} is the value of the cth covariate in the ith individual. Survival (or censoring) times are assumed to be fully observed.
The Aurum cohort
The Aurum dataset comes from a South African cohort study of 1350 HIVinfected participants starting antiretroviral therapy. Participants were recruited from 27 centres in five provinces between February 2005 and June 2006 and followed to March 2007. Information was recorded on a range of baseline characteristics, and participants were followed up for death. The aim of the work by Russell et al. [4] was to estimate the influence of hæmoglobin on mortality using a Cox model. Of the participants, 1348 had a recorded time of death/censoring, with 185 deaths occurring within the followup time. We restrict our analysis to these 1348 individuals.
The analysis model is (1) with p = 6, where x_{1}, … ,x_{6} are age in years, sex, hæmoglobin in g/mL, viral load in copies per mL, CD4 count in cells per μL and BMI. Table 1 provides a summary of these covariates and of weight and height. We give transformations of the covariates used in the analysis model, and summarise the transformed measure in the final column. Note that 381 (28%) patients are missing a weight and/or height measurement, but only five of these have height missing when weight is observed. Five of the covariates are continuous, and one (sex, which is complete) is categorical. Hæmoglobin, weight, height ^{2} and BMI appear to be approximately normal on the transformed scale, while (log) viral load and (square root of) CD4 count do not. We focus on the estimation of β_{3} and β_{6}, the log hazard ratios for hæmoglobin and BMI, respectively, (hæmoglobin was the focus of the original publication [4]).
Table 1. Aurum summary of covariates and of the analysis model and components of body mass index (BMI); n = 1348.  Covariate  Frequency missing (%)  Mean (SD) or frequency (%) 


x_{1}  Age (years)  0 (0%)  37 (9) 
x_{2}  Sex: male  0 (0%)  542 (40%) 
x_{3}  Hæmoglobin (g/mL)  143 (11%)  11.4 (2.3) 
x_{4}  *Viral load (copies per mL)  162 (12%)  4.8 (0.8)† 
x_{5}  *CD4 count (cells per μL)  94 (7%)  8.9 (4.5)† 
x_{6} = a_{1} ∕ a_{2}  BMI (kg/m ^{2})  381 (28%)  21.9 (4.9) 
a_{1}  ‡Weight (kg)  376 (28%)  58 (12) 
a_{2}  ‡Height (m ^{2})  275 (20%)  2.7 (0.3)† 
The EPICNorfolk cohort
The European Prospective Investigation Into Cancer and Nutrition (EPIC)Norfolk study is a large cohort study designed to investigate the link between dietary factors and cancer. Dietary and nondietary factors were collected at baseline, and participants were followed up for cancer and noncancer outcomes. We use some of the nondietary characteristics as covariates and time to death as the outcome.
The analysis model is (1) with p = 6, where x_{1}, … ,x_{6} are age, sex, smoking status, systolic blood pressure, diastolic blood pressure and cholesterol ratio. We summarise these six covariates and total cholesterol and HDL in Table 2; none are transformed. In total, 2155 (9%) participants are missing a total cholesterol and/or HDL measurement. Total cholesterol is always missing when HDL is missing. Incomplete covariates are all continuous and appear approximately normal, except for HDL, which is positively skewed. We focus on the estimation of β_{6}, the log hazard ratio for cholesterol ratio.
Table 2. EPICNorfolk summary of covariates of the analysis model and of components of cholesterol ratio; n = 22 754.  Covariate  Frequency missing (%)  Mean (SD) or frequency (%) 


x_{1}  Age (years)  0 (0%)  59 (9) 
x_{2}  Sex: male  0 (0%)  10145 (45%) 
x_{3}  Smoking status: ever smoked  0 (0%)  11971 (53%) 
x_{4}  Systolic blood pressure (mm Hg)  52 (<1%)  135 (18) 
x_{5}  Diastolic blood pressure (mm Hg)  52 (<1%)  82 (11) 
x_{6} = a_{1} ∕ a_{2}  Cholesterol ratio  2155 (9%)  4.7 (1.6) 
a_{1}  †Total cholesterol (mg/dl)  1514 (7%)  6.2 (1.2) 
a_{2}  †HDL (mg/dl)  2155 (9%)  1.4 (0.4) 
Discussion
 Top of page
 Abstract
 Introduction
 Datasets: Aurum and EPICNorfolk
 Methods and models
 Case studies
 Simulation study
 Discussion
 Appendix: Compatibility
 Appendix: Bayesian models for an incomplete ratio
 Appendix: Results for EPICNorfolk after imputation using predictive mean matching
 Acknowledgements
 References
We have presented the results of two case studies involving commonly used ratios and a simulation study based in part on these datasets. A key message is the caution against passive imputation of a_{1} and a_{2} without prior transformation. Superficially, the approach appears to make more use of the available data; however, it is often inefficient and can suffer from large bias. Our analysis of the EPICNorfolk data demonstrated this problem in practice. However, in our Aurum case study, the use of passive imputation appeared to make little difference to the substantive results compared to active imputation. Our simulation study confirmed that problems arise when CV(a_{2}) is large. Note that a ratio with very small CV(a_{2}) is unlikely to be used in applied work (unless CV(a_{1}) is also very small) because as CV (a_{2}) 0, x_{p} becomes a function of a_{1} divided by a constant. We therefore recommend that incomplete ratios be imputed actively or passively after log transformation as in model M6.
In considering models for missing data, joint models for the covariates and outcome are attractive because they use the full data likelihood in a coherent way. In our two case studies, we attempted to fit fully Bayesian joint models and summarise posterior distributions for parameters of interest. Computational problems prevented this approach from being useful. In one dataset, some of the models did not appear to converge to any true posterior distribution (or if they did, results were extraordinarily sensitive to the choice of model for the ratio). In the other dataset, it was not possible to load the observed data into Winbugs, and so the attempt was abandoned.
Compatibility is a useful concept for considering whether various imputation models are sensible. We hypothesised that models M1 and M2–M4 would perform well because of being compatible and semicompatible respectively, while models M5 and M6 would perform poorly because of being incompatible. In our simulations, M1–M4 did tend to perform well despite being misspecified, and model M5 did often perform poorly. In our EPICNorfolk example, where model M5 gave nonsense results, problems could be identified by inspecting the imputed values of x_{p}.
Model M6 was surprisingly as good as any other model considered throughout. Despite being more robust than M5, we know it is not completely ‘safe’. In our simulation study, the imputation model assumed (log(a_{1}),log(a_{2})  y) ∼ N, and because log(x_{p}) = log(a_{1}) − log(a_{2}), this implies (log(x_{p})  y) ∼ N. The imputation model therefore has mean function log(x_{p}) = α_{0} + α_{1}y, while the analysis model has mean function y = β_{0} + β_{1}x_{p}. In further simulations, we noted that M6 was still robust when R^{2} = 0.5 and CV(a_{2}) = 0.3 (results not shown). We can provide no guarantee for greater values other than that this model will eventually fall apart. However, it is our experience that associations stronger than R^{2} = 0.5 are rare in medical applications.
Some of the issues with model M5 could have been alleviated by using partly parametric imputation techniques such as PMM [30] or local residual draws [28]. In practice, this requires a switch to the chained equations approach rather than a multivariate imputation model. Because a parametric model is used only to identify suitable donors, this makes it impossible to think about compatibility. We investigated PMM in the problematic EPICNorfolk dataset and found model M5 much improved. PMM may therefore be a useful adjunct to a suitably chosen imputation model.
In evaluating methods, we have focused on bias, coverage and efficiency. For those interested in accurate prediction, efficiency may be more important and coverage less so or even unimportant [31]. It is worth noting that precision is also lower for model M5. Therefore, if passive imputation is to be used for a ratio in prediction settings, it should be performed on the log scale.
We have considered the imputation of ratio covariates. Some similar issues arise when the analysis model contains any nonlinear function, for example, interactions and squares. The difference is that in both cases, the main effects and their interaction, or the variable and its square, are included in the analysis model. In the case of squares, a measurement and its square will also be observed or missing simultaneously. Imputation is then complicated by the fact that the analysis model contains both the untransformed variable and a nonlinear function as covariates, rather than just the nonlinear function, as in the case of ratios. This makes issues around compatibility somewhat more complicated. See von Hippel [22], Seaman et al. [23] and Bartlett et al. [11] for recent work on imputation of squares and interactions.
Bartlett et al. proposed the use of rejection sampling when producing imputations and showed it to be useful for imputing squares and interactions; this may therefore be a good approach for imputing ratios. By explicitly involving the analysis model in the specification of the imputation model, each imputation model used in the chained equations is compatible with the imputation model [11]. However, the method is more time intensive than any imputation models investigated here, and it is yet to become available in standard software packages. It also sacrifices one of the advantages of MI: separation of missing data issues from substantive analyses. However, this may be necessary and has already been partly conceded when we tailor imputation models to be compatible with the analysis model.