## 1. Introduction

Much of the theory and practice of statistics involves fitting parametric models to data. Most text-books, and much of the literature, assume that the model is known and that we obtain observations on the variables that are described by that model: we have model *certainty* and *complete* data. In reality this is almost never the case. Apart from simple sampling experiments like choosing balls from urns, there is no magical process telling us which model is correct. And, especially in large data sets, there are invariably complications in measurement, with observations missing, censored or corrupted in some way. In practice we have model *uncertainty* and *incomplete* data.

There is a long history of discussion of these problems at meetings of the Royal Statistical Society. Two of the most often quoted papers in the statistical literature were read to the Society in the 1970s and are about the analysis of incomplete data: Cox (1972) on censored data and Dempster *et al.* (1977) on the EM algorithm for computing maximum likelihood estimates (MLEs) in incomplete-data problems. Discussion papers on robustness, influence and model choice have addressed various aspects of model uncertainty. Draper (1995) addressed model uncertainty through Bayesian model averaging. Most recently, Greenland (2005) discusses problems of unidentifiable biases in epidemiological studies and raises many of the same issues that we go on to discuss here.

An example which illustrates some of these problems is the somewhat controversial assessment of the link between passive smoking and lung cancer. The report of the UK Government's Scientific Committee on Tobacco and Health (Department of Health, 1998) concluded that environmental tobacco smoke was a cause of lung cancer and recommended that smoking in public places should be restricted on the grounds of public health. This advice has been widely heeded, although there remain many dissenting voices. The committee followed Hackshaw *et al.* (1997) in estimating the relative risk as 1.24 (95% confidence limits 1.13–1.36)—prolonged exposure of non-smokers to other people's tobacco smoke increases the risk of lung cancer by 24%. This figure is based on a meta-analysis of 37 published epidemiological studies, each of which compares the risk of lung cancer in non-smokers according to whether the spouse of the subject does or does not smoke. But, as discussed by these and later researchers, this analysis is likely to be influenced by several sources of potential bias, suggesting that the figure of 24%, and the strength of the evidence that is claimed for a causal link, needs to be interpreted with considerable caution.

The problems with this analysis, as with many published meta-analyses, include the following.

- (a)
*Publication bias*: the studies in the meta-analysis are those found in a literature review, not necessarily an unbiased sample of all the studies which have been done in the area. If, for example, studies showing a significant and positive effect are more likely to be published than those giving an inconclusive result, then an analysis that is based on the published studies alone will be biased upwards. - (b)
*Confounding*: the risk of lung cancer among non-smokers may also depend on other factors which are themselves correlated with exposure to tobacco smoke. If, for example, a healthy (high fruit and vegetable) diet tends to protect against lung cancer, and diet in non-smoking households tends to be better than in households in which one or both partners smokes, then the apparent relative risk using smoking status alone will again be biased upwards. - (c)
*Measurement error*: the response variable in a case–control study is the level of exposure, measured here by the reported smoking status of the subject's spouse. But this is a very imperfect measure of actual exposure. If, for example, some of the spouses who claim to be non-smokers are in fact current or former smokers, then the relative risk based on observed exposure will be biased downwards.

Hackshaw *et al.* (1997) claimed that publication bias is not a problem in their analysis but gave extended discussions of the other two sources of potential bias. They concluded that, as far as it is possible to estimate from other sources, these biases will roughly cancel, leaving the original crude estimate of 1.24 as their best estimate of relative risk. Their arguments for ignoring these biases have since been vigorously challenged (Copas and Shi, 2000a, b; Nilsson, 2001).

Although these causes of bias are very different, they all come under our general theme of incomplete data and model uncertainty. We could, at least in principle, correct for these biases if we had further data available. For publication bias, we would need data on unpublished as well as published studies; for confounding bias we would need values of all possible confounders for each subject; for measurement error on exposure we would need an unbiased measure of actual exposure for each subject. The problem arises when these complete data are not available. To make progress with the incomplete data we must rely on a model which asserts that these biases are, in some sense, ignorable. For publication bias, standard methods of meta-analysis assume that the probability that a study is published may depend on ancillary quantities such as sample size, but not on the observed outcome of the trial. For confounders, we allow these to be correlated with the observed outcome but not with the main exposure or treatment variable (after conditioning on available covariates). For measurement error, standard models condition on observed covariates and assume that the measurement errors are subsumed within the residual variation of the response.

Model uncertainty is important because these models make assumptions which cannot be tested with the available data. Ignorability assumptions are typically made as a matter of expediency rather than through any conviction that they are true. In general, it is difficult to make any useful inferences at all if models are grossly misspecified, but to study sensitivity to *small* departures from the model is a useful first step. Our aim is to suggest a rather general asymptotic setting for exploring the link between local model uncertainty, defined in an appropriate way, and the bias in likelihood inference. Our set-up includes the above, and other, special cases as discussed in detail in later sections.

Our general notation for complete data *z* and incomplete data *y* is set out in Section 2. This is similar to Heitjan and Rubin's (1991) idea of ‘coarsened’ data, although we use a rather different and simpler formulation. A parametric model *f*_{Z}=*f*_{Z}(*z*;*θ*) specifies the distribution of *z*, but the observable likelihood is based on the derived distribution *f*_{Y}=*f*_{Y}(*y*;*θ*) of *y*. Some examples are discussed in Section 3.

Model misspecification is discussed in Section 4. We assume that inference is based on model *f*_{Z} (with corresponding model *f*_{Y}), but that *z* is in fact generated by a ‘nearby’ distribution *g*_{Z} (with corresponding distribution *g*_{Y}). We follow Copas and Eguchi (2001) by expressing *g*_{Z} in terms of *f*_{Z} and their log-likelihood ratio, although our formulation here is both simpler and more general than in Copas and Eguchi (2001). Taking a geometric view, we see *incomplete-data bias* as the difference between the projection of true model *g*_{Y} onto assumed model *f*_{Y}, and the corresponding projection which we would be able to make with complete data, in which *g*_{Z} is projected onto assumed model *f*_{Z}. Some special cases are discussed in Section 5, generalizing the results of Copas and Eguchi (2001) for univariate missing data problems.

In ordinary statistical problems where we can observe data on *z*, we can make the useful distinction between *f*_{Z} as a ‘true’ model, one that we are willing to assume is correct, and *f*_{Z} as a ‘working’ model, one that we use because it gives a good description of the data. In our formulation, a true model means that *f*_{Z}=*g*_{Z}; a working model means that, roughly, the ‘distance’ between *f*_{Z} and *g*_{Z} is of the same order of magnitude as the sampling error in estimates of parameter *θ*. Incomplete-data problems are much more difficult, because there may be assumptions within model *f*_{Z} which cannot be assessed from data on *y* alone. Thus *f*_{Z} may be grossly misspecified (leading to a large bias) and yet appear to give a good fit to the available data. We argue that, in practice, the model uncertainty when we use *f*_{Y} as a model for incomplete data *y* should be at least as much as the uncertainty that we would have about *f*_{Z} if we were to use it as a working model given the luxury of being able to observe the complete data *z*. This argument gives a lower bound to the size of the incomplete-data bias, as explained in Section 6. As the direction of this bias is unidentified, we treat it as a source of extra uncertainty in inference about *θ*. We do this by widening the conventional confidence interval for *θ* by a factor which turns out to be a rather simple function of the amount of information that is lost in the transformation of *z* into *y*. We show that this factor is bounded above by √2, i.e. ‘double the variance’.

In Section 7 we return to the problem of confounding in the passive smoking and lung cancer study, again using the data in Hackshaw *et al.* (1997) as an example. Estimates of relative risk are seen to be highly sensitive to assumptions about ignorability. We use this example to illustrate some of the underlying ideas of the paper, presenting this section in such a way that it can be read virtually independently of the more technical material in the earlier sections.

The paper concludes with some comments in Section 8, and a technical appendix giving the proof of the ‘double-the-variance’ result of Section 6.