Local model uncertainty and incomplete-data bias (with discussion)


John Copas, Department of Statistics, University of Warwick, Coventry, CV4 7AL, UK.
E-mail: jbc@stats.warwick.ac.uk


Summary.  Problems of the analysis of data with incomplete observations are all too familiar in statistics. They are doubly difficult if we are also uncertain about the choice of model. We propose a general formulation for the discussion of such problems and develop approximations to the resulting bias of maximum likelihood estimates on the assumption that model departures are small. Loss of efficiency in parameter estimation due to incompleteness in the data has a dual interpretation: the increase in variance when an assumed model is correct; the bias in estimation when the model is incorrect. Examples include non-ignorable missing data, hidden confounders in observational studies and publication bias in meta-analysis. Doubling variances before calculating confidence intervals or test statistics is suggested as a crude way of addressing the possibility of undetectably small departures from the model. The problem of assessing the risk of lung cancer from passive smoking is used as a motivating example.

1. Introduction

Much of the theory and practice of statistics involves fitting parametric models to data. Most text-books, and much of the literature, assume that the model is known and that we obtain observations on the variables that are described by that model: we have model certainty and complete data. In reality this is almost never the case. Apart from simple sampling experiments like choosing balls from urns, there is no magical process telling us which model is correct. And, especially in large data sets, there are invariably complications in measurement, with observations missing, censored or corrupted in some way. In practice we have model uncertainty and incomplete data.

There is a long history of discussion of these problems at meetings of the Royal Statistical Society. Two of the most often quoted papers in the statistical literature were read to the Society in the 1970s and are about the analysis of incomplete data: Cox (1972) on censored data and Dempster et al. (1977) on the EM algorithm for computing maximum likelihood estimates (MLEs) in incomplete-data problems. Discussion papers on robustness, influence and model choice have addressed various aspects of model uncertainty. Draper (1995) addressed model uncertainty through Bayesian model averaging. Most recently, Greenland (2005) discusses problems of unidentifiable biases in epidemiological studies and raises many of the same issues that we go on to discuss here.

An example which illustrates some of these problems is the somewhat controversial assessment of the link between passive smoking and lung cancer. The report of the UK Government's Scientific Committee on Tobacco and Health (Department of Health, 1998) concluded that environmental tobacco smoke was a cause of lung cancer and recommended that smoking in public places should be restricted on the grounds of public health. This advice has been widely heeded, although there remain many dissenting voices. The committee followed Hackshaw et al. (1997) in estimating the relative risk as 1.24 (95% confidence limits 1.13–1.36)—prolonged exposure of non-smokers to other people's tobacco smoke increases the risk of lung cancer by 24%. This figure is based on a meta-analysis of 37 published epidemiological studies, each of which compares the risk of lung cancer in non-smokers according to whether the spouse of the subject does or does not smoke. But, as discussed by these and later researchers, this analysis is likely to be influenced by several sources of potential bias, suggesting that the figure of 24%, and the strength of the evidence that is claimed for a causal link, needs to be interpreted with considerable caution.

The problems with this analysis, as with many published meta-analyses, include the following.

  • (a) Publication bias: the studies in the meta-analysis are those found in a literature review, not necessarily an unbiased sample of all the studies which have been done in the area. If, for example, studies showing a significant and positive effect are more likely to be published than those giving an inconclusive result, then an analysis that is based on the published studies alone will be biased upwards.
  • (b) Confounding: the risk of lung cancer among non-smokers may also depend on other factors which are themselves correlated with exposure to tobacco smoke. If, for example, a healthy (high fruit and vegetable) diet tends to protect against lung cancer, and diet in non-smoking households tends to be better than in households in which one or both partners smokes, then the apparent relative risk using smoking status alone will again be biased upwards.
  • (c) Measurement error: the response variable in a case–control study is the level of exposure, measured here by the reported smoking status of the subject's spouse. But this is a very imperfect measure of actual exposure. If, for example, some of the spouses who claim to be non-smokers are in fact current or former smokers, then the relative risk based on observed exposure will be biased downwards.

Hackshaw et al. (1997) claimed that publication bias is not a problem in their analysis but gave extended discussions of the other two sources of potential bias. They concluded that, as far as it is possible to estimate from other sources, these biases will roughly cancel, leaving the original crude estimate of 1.24 as their best estimate of relative risk. Their arguments for ignoring these biases have since been vigorously challenged (Copas and Shi, 2000a, b; Nilsson, 2001).

Although these causes of bias are very different, they all come under our general theme of incomplete data and model uncertainty. We could, at least in principle, correct for these biases if we had further data available. For publication bias, we would need data on unpublished as well as published studies; for confounding bias we would need values of all possible confounders for each subject; for measurement error on exposure we would need an unbiased measure of actual exposure for each subject. The problem arises when these complete data are not available. To make progress with the incomplete data we must rely on a model which asserts that these biases are, in some sense, ignorable. For publication bias, standard methods of meta-analysis assume that the probability that a study is published may depend on ancillary quantities such as sample size, but not on the observed outcome of the trial. For confounders, we allow these to be correlated with the observed outcome but not with the main exposure or treatment variable (after conditioning on available covariates). For measurement error, standard models condition on observed covariates and assume that the measurement errors are subsumed within the residual variation of the response.

Model uncertainty is important because these models make assumptions which cannot be tested with the available data. Ignorability assumptions are typically made as a matter of expediency rather than through any conviction that they are true. In general, it is difficult to make any useful inferences at all if models are grossly misspecified, but to study sensitivity to small departures from the model is a useful first step. Our aim is to suggest a rather general asymptotic setting for exploring the link between local model uncertainty, defined in an appropriate way, and the bias in likelihood inference. Our set-up includes the above, and other, special cases as discussed in detail in later sections.

Our general notation for complete data z and incomplete data y is set out in Section 2. This is similar to Heitjan and Rubin's (1991) idea of ‘coarsened’ data, although we use a rather different and simpler formulation. A parametric model fZ=fZ(z;θ) specifies the distribution of z, but the observable likelihood is based on the derived distribution fY=fY(y;θ) of y. Some examples are discussed in Section 3.

Model misspecification is discussed in Section 4. We assume that inference is based on model fZ (with corresponding model fY), but that z is in fact generated by a ‘nearby’ distribution gZ (with corresponding distribution gY). We follow Copas and Eguchi (2001) by expressing gZ in terms of fZ and their log-likelihood ratio, although our formulation here is both simpler and more general than in Copas and Eguchi (2001). Taking a geometric view, we see incomplete-data bias as the difference between the projection of true model gY onto assumed model fY, and the corresponding projection which we would be able to make with complete data, in which gZ is projected onto assumed model fZ. Some special cases are discussed in Section 5, generalizing the results of Copas and Eguchi (2001) for univariate missing data problems.

In ordinary statistical problems where we can observe data on z, we can make the useful distinction between fZ as a ‘true’ model, one that we are willing to assume is correct, and fZ as a ‘working’ model, one that we use because it gives a good description of the data. In our formulation, a true model means that fZ=gZ; a working model means that, roughly, the ‘distance’ between fZ and gZ is of the same order of magnitude as the sampling error in estimates of parameter θ. Incomplete-data problems are much more difficult, because there may be assumptions within model fZ which cannot be assessed from data on y alone. Thus fZ may be grossly misspecified (leading to a large bias) and yet appear to give a good fit to the available data. We argue that, in practice, the model uncertainty when we use fY as a model for incomplete data y should be at least as much as the uncertainty that we would have about fZ if we were to use it as a working model given the luxury of being able to observe the complete data z. This argument gives a lower bound to the size of the incomplete-data bias, as explained in Section 6. As the direction of this bias is unidentified, we treat it as a source of extra uncertainty in inference about θ. We do this by widening the conventional confidence interval for θ by a factor which turns out to be a rather simple function of the amount of information that is lost in the transformation of z into y. We show that this factor is bounded above by √2, i.e. ‘double the variance’.

In Section 7 we return to the problem of confounding in the passive smoking and lung cancer study, again using the data in Hackshaw et al. (1997) as an example. Estimates of relative risk are seen to be highly sensitive to assumptions about ignorability. We use this example to illustrate some of the underlying ideas of the paper, presenting this section in such a way that it can be read virtually independently of the more technical material in the earlier sections.

The paper concludes with some comments in Section 8, and a technical appendix giving the proof of the ‘double-the-variance’ result of Section 6.

2. Complete and incomplete data

Suppose that random variable z has distribution zfZ(z;θ), indexed by unknown parameter θ. Usually both z and θ will be vectors. The presentation will assume that z is continuous so fZ is a probability density function, but we take for granted the obvious changes in notation if some or all of the components of z are discrete. We shall assume throughout that fZ satisfies the regularity conditions that are necessary for the asymptotic properties of maximum likelihood estimates to apply in the usual way.

Define sZ(z;θ)=∂{ log (fZ)}/∂θ and IZ=E[−∂2{ log (fZ)}/∂θ ∂θT] to be the score vector and information matrix of fZ respectively. Then, given a large sample z1,z2,…,zn, the MLE inline image is given by inline image and standard asymptotic theory gives


We shall refer to observations on z as the complete data, the observations that we would ideally like to have for inference about θ.

Sometimes we cannot observe z directly but can observe only the derived random variable y=h(z), for some given function h. Or, there may be other reasons for wishing to base inference on the marginal likelihood using y rather than the full likelihood using z. If h is a one-to-one smooth function then this is just a data transformation, but if h is many to one then there will in general be a loss of information. We shall then refer to observations on y as the incomplete data. Most forms of incomplete data can be written in this way by a suitable choice of z and h. Note that y is not necessarily a numerical scalar or vector in the usual sense; it may contain ranges or subsets in some or all of its components.

The nature of the function h is given by its level sets (y)={z:h(z)=y}. Fig. 1 illustrates these level sets for six simple examples where z has two components, z=(z1,z2). Fig. 1(a) shows single points, so h is 1:1 with (y)=h−1(y). Fig. 1(b) is for missing data, where z1 is the response of interest and z2 is the missing data indicator. If z2=1 we observe z1, but if z2=0 then z1 is missing so all that we can say about z1 is that −∞<z1<∞. In Fig. 1(c), z1 is the response of interest, but it can only be observed with measurement error. Here z2 is the measurement error so what we observe is the sum z1+z2.

Figure 1.

Examples of level sets of h(z)

Next, in Fig. 1(d), we have data which may be rounded. If z2=1 we observe z1 exactly, but if z2=0 then z1 is rounded to the nearest integer. Fig. 1(e) represents two competing risks, where we observe the minimum of two (potential) cause-specific lifetimes. If z is in the upper octant, we observe z1 but z2 is censored so all we know is that z2>z1, and the other way round if z is in the lower octant. Finally, Fig. 1(f) illustrates the case of a hidden variable: we observe z1 but not z2.

These examples suggest that the components of vector z will typically consist of the main response variables of interest plus subsidiary variables that are involved in the process of observation. For Figs 1(b)–1(f), conventional models for fZ may assume that z1 and z2 are independent. Misspecifications of these models would then allow (z1,z2) to be correlated, or the measurement process non-ignorable. Figs 1(b), 1(f) and 1(c) are simplified versions of the problems of publication bias, hidden confounders and measurement error which were mentioned in Section 1 in connection with the passive smoking study.

This set-up is quite similar to that of the ‘coarse data’ model of Heitjan and Rubin (1991), who envisaged a function that is analogous to h above, but with two arguments, z plus a stochastic ‘coarsening variable’. Equivalently we could think of h as a stochastic observation equation converting z into an observable y. Here we subsume the coarsening variable of the observational process into the vector z, so that h can be thought of as deterministic. Although this simplifies the relationship between y and z, a disadvantage of our notation is that it can obscure the distinction between components of θ of interest, and components of θ which are simply nuisance parameters of the coarsening process.

The model fZ(z;θ) for z induces the corresponding model fY(y;θ) for y. In a somewhat informal notation,


where (y) on the integration sign means integration with respect to z over the level set that is given by y=h(z). If, for example, (y) fixes one component of z but leaves the others as ranges, then the integral returns the value that is fixed, with the appropriate Jacobian for any transformation of that component, and integrates over the ranges of the other components. See Jacobsen and Keiding (1995) for a rigorous formulation of integrals of this kind.

The score function for fY is now


the conditional expectation of sZ(z;θ) over the level set (y). If inline image is the MLE based on incomplete data y1,y2,…,yn,


where IY is the corresponding information matrix for fY.

A consequence of equation (3) is that IZIY is non-negative definite, so the eigenvalues of the matrix


are between 0 and 1. Matrix Λ reflects the proportion of information that is retained when we observe y instead of z, and its eigenvalues λi can be thought of as relative efficiencies in estimating contrasts of θ using inline image instead of inline image. In particular, for estimating φ=dTθ,


For simplicity of notation we have defined θ as a single vector parameter indexing both fZ and fY. In practice, only some components of θ will be of interest, and so only some of the eigenvalues of Λ will be relevant. In some applications it will be useful to label nuisance parameters explicitly by writing θ as (θ,ψ). In other cases, we need to extend the notation by allowing fZ to depend on nuisance parameters which do not enter fY. In the general formulation this means that some of the eigenvalues of Λ may be 0. However, we assume throughout that, under model fZ, the components of θ which are of interest are fully identifiable from observations on y. This means that model fZ must include sufficiently strong assumptions to make this so, e.g. the ignorability assumptions that are implied by equations (7) and (13) in the next section.

3. Examples

3.1. Missing data

To start with the simplest case, suppose that we are trying to sample a scalar random variable t, but that some observations are missing. Let r be the response indicator, which is equal to 1 if t is observed and equal to 0 if t is missing. Then z=(t,r), and




We use the symbol ℝ here to mean that when r=0 all we know is that t takes some value in (−∞,∞). This is the set-up that was noted earlier in Fig. 1(b).

The simplest model here is the data missing completely at random (MCAR) model, which asserts that


where fT is the marginal density of t and 1−ψ is the probability that an observation is missing. We assume that parameters θ and ψ are functionally independent. This model asserts that t and r are independent, so the likelihood for θ is the usual likelihood for the observed cases only, without the need to know the value of ψ.

If sT(t;θ) is the score function for the main model fT(t;θ), the score functions for (θ,ψ) are


where sR(r;ψ)=(rψ)/ψ(1−ψ). The matrix Λ is diagonal, with diagonal entries equal to ψ for the θ-components and 1 for the ψ-component. As expected, the relative efficiency of estimating θ just reflects the reduction in the sample size that is caused by the missing data, and there is no loss of efficiency in estimating ψ because r is always observed.

More generally, let z=(t,r) where t=(t1,t2,…,tm) is a vector of m measurements and r is the corresponding vector of response indicators r=(r1,r2,…,rm), with ri=1 if ti is observed and ri=0 if ti is missing. Now y=h(t,r)=(t(r),r) with inline image, each component defined as in expression (6). Then the data missing at random (MAR) model is


where fR|T(r,t;ψ) is the conditional probability distribution of r given t, which is assumed to depend on t only through the value of t(r). This is the crucial data MAR assumption, that for any given r the value of fR|T(r,t;ψ) depends on t only through those components ti for which ri=1. Under data MAR, the missing data mechanism is not allowed to depend on the values of any unobserved components of t. Lu and Copas (2004) showed that if fT is a complete distribution family then data MAR is a necessary and sufficient condition for the missing data process to be ignorable, in the sense that the likelihood function for θ can be constructed directly from the marginal distributions of the subsets of the tis that are observed, without the need to know fR|T explicitly.

Several special cases of fR|T(r,t;ψ) are of interest. Data MAR are MCAR if fR|T(r,t;ψ) depends on r but does not depend at all on t. If t1 are strata variables in a sample design, r1=1 (always observed). Case non-response is when the ris are all the same (1 for a responder; 0 for a refusal). The simplest set-up in meta-analysis has m=2,t1 the result that is reported in a typical study (e.g. estimated relative risk), t2 the within-study estimated standard error of t1r1=r2=1 if the study is published and r1=r2=0 otherwise. Standard methods of meta-analysis ignore publication bias by assuming that fR|T depends on t2 but not t1.

There is a very large literature on missing data problems, and extensive discussions of the data MAR assumption. Excellent texts include Little and Rubin (2002) and Schafer (1997), with many references therein. A good general introduction to statistical methods for meta-analysis is Sutton et al. (2000a).

3.2. Potential confounders

Here we envisage an observational study in which we wish to assess the dependence of a response t on a treatment or exposure variable x. Suppose that t is also influenced by a hidden variable c. Then, if c is independent of xc is ignorable in the sense that it just contributes to the residual variation of t given x. But, if c is associated with x as well as t, then it is a potential confounder. If we could observe z=(t,x,c) then, at least in principle, we could disentangle the influences of x and c on t, but in practice we can observe only y=h(t,x,c)=(t,x).

Using Pearl's notation of the ‘do’ operator (Pearl, 2000), the causal effect of x on t can be described by


This is the distribution which we would obtain if the treatments x had been chosen at random, so the distribution of c is the same within each level of x. However, the incomplete data y=(t,x) only give us information on


Equations (10) and (11) are the same if and only if c and x are independent. This is our basic model, that we analyse observations on y as if they had arisen from a randomized experiment. The influence of c as a potential confounder will then be discussed as a misspecification from this model in Section 5.2.

We write our parametric model for y as


Our modelling assumption is that fY is the marginal distribution of (t,x) from the distribution of (t,x,c) given by a complete-data model fZ in which x and c are independent, namely


Here, fZ may also involve other parameters, (β,γ), but parameterized in such a way that β and γ disappear from equation (13) when we integrate out c. The θ-components of the score functions sY and sZ come directly from the conditional distributions in equations (12) and (13).

In the special case of a linear model, suppose that


Then the observable response distribution is just the ordinary regression model


Assume that the components of x are centred so that, under fXx has mean 0 and variance Ψ. Under this set-up, the θ-components of the score functions sZ and sY are


respectively, and the θ-submatrix of Λ is


Under the ignorable data model, the proportion of information that is lost through not observing the hidden variable c is just the square of the partial correlation between t and c given x.

3.3. Other examples

Fig. 1 illustrated some other examples of the general set-up, which we mention here but do not develop in detail.

Following Fig. 1(e), let t and v represent two competing risks, t a lifetime of interest and v the time of (potential) censoring. What we observe in censored survival analysis is the minimum of t and v, and whether the event that is observed at this time was an actual failure or a censored observation. Thus z=(t,v), and y=h(t,v) is {t,[t,∞)} if tleqslant R: less-than-or-eq, slantv and {[v,∞),v} if t>v. Standard models (ignorable censoring) assume that t and v are independent, so that fZ=fT(t;θfV(v;ψ). If we define sT to be the score vector for fT, the θ-component of sY is sT(t;θ) if tleqslant R: less-than-or-eq, slantv and ∂[ log {ST(v;θ)}]/∂θ if t>v, where ST(t;θ) is the survival function for fT(t;θ).

The simplest example is when t and v are exponential survival distributions with rate parameters θ and ψ respectively. We find that the θ-submatrix of Λ is the scalar θ/(θ+ψ). In this special case (but not generally) the proportion of information that is lost through the censoring is just equal to the proportion of observations which are censored.

For an example with measurement error, following Fig. 1(c), suppose that we are interested in the linear regression of response t on covariates x. But we can only observe t indirectly through the sum t+v, where v is the measurement error. Here z=(t,v,x) and y=(t+v,x). Model fZ is made up of the factors t|xN(θTx,ψ2β2),vN(0,β2) and xfX. Model fY replaces the first two factors of fZ by (t+v)|xN(θTx,ψ2). In this model, the measurement error is pure random error and is ignorable in the sense that regression coefficient θ can be estimated from the observable distribution fY. The proportion of information about θ that is retained in the reduction of z to y is just the squared correlation between t and t+v, or 1−β2/ψ2. The non-ignorable case is when fZ is misspecified so that there may be a dependence between v and t or x or both. In that case, the estimate of θ from data on y may be biased.

4. Misspecified models

In practice we can never be sure that these or any other parametric statistical models are correct. Even if we could observe z there will be uncertainty about fZ, and hence even more uncertainty about fY when we can only observe y. Of particular interest will be uncertainty about the ig-norability assumptions that are implied in models such as those of Section 3. In this section we develop a rather general asymptotic theory for local misspecifications of fZ and fY. By ‘local’, we mean model departures of a magnitude which could not easily be detected empirically from samples of the complete data z.

To formulate distributions in a local neighbourhood of fZ, let uZ(z;θ) be any scalar function of z and θ, standardized to have mean 0 and variance 1 under the model fZ. Then for small values of ɛ


is non-negative and integrates to 1 up to and including first-order terms in ɛ, and so identifies a distribution in the neighbourhood of fZ. Essentially, any distribution in this close neighbourhood can be represented by equation (16) with an appropriate choice of ɛ and uZ. Our assumption is that the actual distribution generating z is a member of this family for some small value of ɛ. The family of distributions (16) is a much more general version of the local misspecification family of Copas and Eguchi (2001).

If ɛ=0 then gZ=fZ. Intuitively, ɛ can be thought of as the ‘magnitude’ of misspecification and uZ can be thought of as the ‘direction’ of misspecification. In fact the squared misspecification magnitude ɛ2 is, in its leading term, just half of the Kullback–Leibler divergence between fZ and gZ. Geometrically, we can think of the model fZ(z;θ) as belonging to a curve in distribution space, different points on the curve corresponding to different values of θ. Then, if we fix ɛ and imagine θ and uZ ranging over all possibilities, gZ will cover all distributions within a ‘tubular neighbourhood’ of ‘radius’ɛ about this curve.

Using the same (informal) notation as in equation (2), the distribution of y=h(z) that is induced by gZ is




These and later approximations are correct to first-order terms in ɛ.

If we fit the model fZ(z;θ) to a random sample of n observations from gZ, the limiting value of the MLE inline image as n→∞ is


in the sense of almost sure convergence. This follows from expression (16), noting that Ef{sZ(z;θ)}=0. Similarly, if we are sampling from gY, the limiting value of inline image is


The important point to note is that when ɛ≠0 these are not the same. We define the first-order approximation to the difference θgYθgZ to be the incomplete-data biasb, which is given by


In Fig. 2 we illustrate geometrically what we are assuming about these distributions, and how the incomplete-data bias arises. For any given θ we can think of the misspecification quantity ɛ uZ(z;θ) as the vector joining gZ(z;θ) to fZ(z;θ). This vector has ‘length’ɛ and ‘direction’ given by the unit vector uZ(z;θ). Fig. 2(a) is the orthogonal case when Ef(sZuZ)=0. This means that θ in expression (16), and θgZ in expression (18), the value given by the projection from gZ onto the line of the model, are the same. Fig. 2(b) is the corresponding diagram for the distributions of y rather than z. Now the vector ɛ uY(y;θ) that is defined in expression (17) is not orthogonal to the line of the model, so the projection of gY onto fY defines some other value of θ. The incomplete-data bias corresponds to the side of the right-angled triangle that is shown. We have only attempted here a very superficial description of the geometrical aspects—concepts of information geometry would be needed for deeper insights (Amari, 1985).

Figure 2.

Distributions of z and y: (a) complete-data model; (b) incomplete-data model

A natural standardized measure of the size of b is the quadratic form that is defined with respect to the incomplete-data information matrix IY. This gives


where ‖·‖ denotes the largest eigenvalue. To derive this bound, inequality (22) follows from equation (21) by using an elementary generalization of the Schwarz inequality in multivariate analysis, that, for any scalar random variable a, any vector random variable b with V=E(bbT), and any conformable positive definite matrix C,


The bound is attained when a=αTb, where α is the principal eigenvalue of the matrix C1/2VC1/2. Expression (23) follows from inequality (22) since


The incomplete-data bias for estimating the scalar parameter φ=dTθ is b=dTb. The corresponding inequality in the squared standardized bias is


Equality in expression (25) is attained when


This misspecification direction uZ(z,θ) is the ‘worst case’ as far as the incomplete-data bias in inline image is concerned. Note that this uZ is orthogonal to the model, so θgZ=θ to first order, which is the case that we have illustrated in Fig. 2(a). This uZ is also the global worst case in the sense of attaining the lower bounds in both inequality (22) and inequality (26) when d is inline image times the eigenvector of Λ with the smallest eigenvalue.

Inequalities (5) and (26) show how λmin plays a dual role in describing what happens when we estimate a contrast from y rather than from z. In inequality (5), λmin is the lower bound to relative efficiency when the model is correct. In inequality (26), 1−λmin is the multiple of ɛ2 which gives the upper bound to the standardized incomplete-data bias when the model is locally misspecified.

By letting uZ be any standardized function of z, we are allowing for a very general class of local model misspecification. In most cases fZ will be made up of a number of submodels, with θ partitioned accordingly. It is then useful to represent uZ as a corresponding sum of components. Through equation (20), this allows us to decompose the overall incomplete-data bias b into a sum of bias terms that are specific to misspecification of different parts of the model. Equivalently, additional constraints can be imposed on uZ to focus attention on specific problems of interest, e.g. on bias caused by the failure of an ignorability assumption.

Using likelihood methods when sampling from a distribution which does not belong to the assumed model raises important questions which we have not discussed. We have assumed that θgZ and θgY in expressions (18) and (19) are uniquely defined, and later we shall assume that the MLEs inline image and inline image are asymptotically normal. Using White's (1982) more general discussion, Gustafson (2001) showed that these assumptions hold for local model misspecifications (in our notation, for sufficiently small ɛ) under quite weak regularity conditions (see Appendix A of Gustafson (2001)).

A more fundamental question is the interpretation of θ, since once we move outside a parametric model the concept of θ as the ‘true value’ no longer has its usual meaning. Royall and Tsou (2003) distinguished between the ‘object of inference’, θINF say, and the ‘object of interest’, θINT say. The object of inference is the value of θ for which the model is closest to the true distribution in the sense of Kullback–Leibler divergence. This corresponds to θgZ and θgY defined above, for the complete- and incomplete-data models respectively. The object of interest, however, is a matter of the scientific objective of the study, and this may or may not be the same thing as the object of inference. For example, if θINT is the mean of the population from which we are sampling, then θINT=θINF for the models N(θ,σ2) and Poisson(θ), but not in general. Royall and Tsou (2003) argued that parametric inference about θ is only meaningful when θINF=θINT, and this is the assumption behind their idea of the robust adjusted likelihood function. With this assumption, b can be interpreted as the bias of inline image in the usual asymptotic sense, the difference between its expected value in large samples and the object of interest of the true distribution from which the complete data are sampled. In this setting, the difference between θgZ and θ in expression (18) is just an artefact of the notation and is not a bias in any meaningful sense. We could, without loss of generality, reparameterize the model so that θ=θgZ or, equivalently, assume that uZ satisfies the orthogonality constraint Ef(sZuZ)=0 from the outset.

The misspecification function uZ(z;θ) needs to depend on θ because of the constraint Ef(uZ)=0, which is the necessary condition for gZ to integrate to 1 up to linear terms in ɛ. However, the exact nature of this dependence is unimportant for the accuracy of the approximations that are studied here, as uZ only enters our calculations through the first-order term ɛuZ. To simplify the notation we can therefore write uZ(z) instead of uZ(z;θ).

5. Examples continued

5.1. Missing data

To start with, return to the simple data MCAR model (7). The true model is now


If ɛ=0 model (28) is for data MCAR as before. If ɛ≠0, the function uZ(t,r) allows the missing data process to be non-ignorable, in that Pg(r=0|t) can now depend on the (unobserved) value of t.

If IT is the information matrix of the main model fT(t;θ), we find from expressions (8) and (20) that the θ-component of the incomplete-data bias is


where uD(t)=uZ(t,1)−uZ(t,0). The ψ-component of the incomplete-data bias is 0, as we have full information about r under both y and z.

The size of the quantity ɛ uD(t) indicates how much the actual missing data process differs from MCAR. It is easy to show that, for small ɛ,


This variance measures how much the probability of non-response varies across the population of different values of t. Under the assumption of data MCAR, the log-odds ratio in this expression is the same for all t and so its variance is 0.

The standardized bias in equation (21) can now be written


If θ is a scalar parameter, and we let nY=nψ be the expected actual sample size, and σY be the sample standard deviation of the incomplete-data MLE of θ, then expression (29) reduces to


This is equation (5) of Copas and Eguchi (2001). The maximum bias in inequality (30) is the product of four terms: σY (the first-order bias depends on the distribution of t only through the standard error of the MLE), 1−ψ (the proportion of observations that are missing), √nY (bias becomes more important relative to the standard error the bigger is the actual sample size) and sMCAR (the standard deviation of the log-odds ratio measuring the contrast between data MCAR and the actual pattern of missing data). See Copas and Eguchi (2001) for further discussion of this formula and its generalizations.

For the more general data MAR model in equation (9), uZ is written as the sum of the two parts


The misspecified model is then


Note that uT(t), satisfying Ef{uT(t)}=0, perturbs the complete-data distribution fT(t;θ). The second part uR|T(r,t), satisfying Ef{uR|T(r,t)|t}=0 for all t, is the perturbation on the conditional distribution of r given t. The important point is that fR|T(r,t;ψ) satisfies the assumption of data MAR (for any given r,fR|T(r,t;ψ) depends on t only through those tis with ri=1), whereas uR|T(r,t) can break the assumption of data MAR by allowing some dependence on the values of ti with ri=0.

Writing uZ=uT+uR|T in equation (20) shows that the incomplete-data bias also splits into two parts. If the main model fT is misspecified, then the fitted model involves a balance of discrepancies across the sample space, and this clearly changes if some some values of t are more likely to be observed than others. This is the first component of bias. The second component of the bias is of particular interest in missing data problems since it describes the consequence of the missing data mechanism being non-ignorable. The ψ-elements of both components of bias are 0 (we observe (r,t(r)) under both complete and incomplete data) and so it is only the θ-elements which are of interest. We focus on bMAR, the θ-elements of this second bias component.

Under the data MAR model, both IZ and IY are block diagonal, with the ψ-submatrix the same in each case. Let


Then, from equation (20),


Thus, by analogy with inequality (22),




and λmin is the smallest eigenvalue of inline image.

As before, sMAR can be given a statistical interpretation, contrasting the actual non-ignorable missing data mechanism under gZ with the closest matching data MAR model. These are given respectively by




In this and similar expressions it is unimportant, as far as first-order accuracy in ɛ is concerned, whether the E and var operators in expression (32) are with respect to f or g.

To interpret expression (32), we imagine that we can calculate, for each possible r and t, the logarithm of the ratio of the actual probability of r given t to the probability of that of r given only the values of the tis for which ri=1. Let this log-ratio be LR say. If the data MAR assumption is true then LR=0 for all possible values of the unobserved tis for which ri=0. The conditional variance in expression (32) is the variance of LR over these potential unobserved values; the larger this is the more at fault is the assumption of data MAR. Taking the expected value of this variance over all possible tis with ri=1, and then over all possible missing data patterns r, gives inline image as an overall measure of non-ignorability.

In the special case of m=1 and the data MCAR model fR|T(r,t)=ψr(1−ψ)1−r, we have inline image if r=1 and inline image if r=0. Also uR|T(1,t)=(1−ψuD(t). Hence equation (31) reduces to expression (29). Another special case of interest is m=2 with fR|T{(0,1),t}=fR|T{(0,0),t}=0 for all t=(t1,t2). This is a regression model with covariate t1, always observed, and response t2, subject to missing observations. Let p(t1)=fR|T{(1,1),t}. Then in this case


There are many other special cases of the general model which are of interest in particular applications. Horton and Fitzmaurice (2002), for example, discussed the analysis of a childhood psychopathology study where one of the variables (t1 say, always measured) identifies the reasons why particular outcomes were not observed. Some causes may be assumed ignorable; others non-ignorable. In the notation above, this could be modelled by assuming that uR|T(r,t)=0 for some values of t1 (the ignorable cases) but allowing uR|T(r,t)≠0 for other values of t1 (the non-ignorable cases). Horton and Fitzmaurice (2002) presented a sensitivity analysis for assessing their assumptions on the non-zero part of this function.

5.2. Potential confounders

Our second example, continuing Section 3.2, concerns the association between response t and treatment x, in the presence of a hidden variable c. Under the ‘randomization’ model (13), c is an intermediate variable which is assumed to be independent of x, giving the observable marginal model (12). But, under the perturbed model gZx and c are allowed to be dependent, in which case c is a potential confounder.

Firstly, note that the information matrix IY from equation (12) is block diagonal with respect to the parameter partition (θ,ψ). Similarly, IZ from equation (13) is block diagonal for the parameter partition {(θ,β),γ,ψ}. Let sT|X(t,x;θ) be the score function for θ from the regression factor in equation (12), and sT|XC(t,x,c;θ,β) be the score for (θ,β) from the regression factor in equation (13). Then, for any overall misspecification function uZ, the θ-component of the incomplete-data bias in equation (20) is


where [·] denotes the θ-components of the relevant vector.

It is more informative, however, to decompose uZ into additive terms affecting the different factors in equation (13). Let


and define


Then the true distribution gZ is


Only the last term here brings in an association between x and c; in fact evaluating equations (10) and (11) from equation (34) we find


This confirms that only the component uXC of uZ affects the role of c as a confounder. The corresponding component of the bias in equation (33) is


since Ef(sT|XC|x,c)=0 for all x and c. We use the notation bRAN (which is analogous to bMAR in the previous section) to emphasize that this is the consequence in terms of bias of the fact that the study is not randomized. Then we find




The variance inline image, which is analogous to inline image in expression (32), measures the strength of the association between c and x.

In the special case of the linear model (14), we find from expression (15) that the θ-component of the incomplete-data bias is approximately


The incomplete-data bias in θ is only affected by the non-ignorability component, uXC, of uZ; the contributions to the bias from the other three components all reduce to 0 in this case. The size of the standardized bias is now


which is maximized over uXC when uXC=cdTx for some constant vector d. This means that for small ɛ the conditional distribution of c given x is approximately


This is the worst case as far as bias in the estimation of the treatment effect is concerned. This corresponds to our intuition, that the most troublesome confounder is one which is linearly correlated with treatment.

In many applications x will just be a scalar, in which case the correlation between c and x that is implied by distribution (36) is ɛγdΨ1/2. The size of the squared standardized bias is bounded by


The right-hand side of inequality (37) is the value of equation (35) in the scalar case when uXC=cdx and var(uXC)=1. The first term on the right-hand side of inequality (37) is the proportion of the variance of t that is accounted for by c over the influence of x, and so measures how much we lose by not measuring c. The second term is the dependence between the treatment and confounder that is caused by the lack of randomization, and so is a measure of non-ignorability of the design.

5.3. The Heckman model

We now compare our approach with one of the earliest and probably best-known systematic approaches to modelling ignorability problems: the Heckman model for selection bias. Heckman's original formulation (Heckman, 1979) has led to a very large literature, mostly in econometrics. The main idea is that we have two linear models, one for the response variable(s) of interest, and the other for the mechanism by which these responses are observed or selected. The residuals of the models are correlated, with correlation ρ say. Then if ρ=0 the selection mechanism is ignorable but, if ρ≠0, inference which ignores the selection mechanism will be biased. See Copas and Li (1997) for an extended discussion of the Heckman model and some statistical applications.

When ρ is small, Heckman-type models are special cases of our more general formulation. For example, suppose that


where δ1 and δ2 are standard normal residuals, jointly normal with correlation ρ. We observe x and we observe the sign of v (but not its actual value), but we only observe t if vgeqslant R: gt-or-equal, slanted0. We are back in the set-up of Section 5.1 for missing data, with r=1 if vgeqslant R: gt-or-equal, slanted0 and r=0 if v<0. The model is data MAR if and only if ρ=0. Note that, since r and x are always observed, ψ can be consistently estimated by probit analysis as


where Φ is the standard normal distribution function.

Given a sample (ti,xi,ri), the least squares estimate of θ based on the observed cases (the cases with ri=1) is


But, using elementary properties of the bivariate normal distribution,


where λ is Mills's ratio λ=φ/Φ and φ is the standard normal density function. Also, for any function a(x) of x,


Hence the asymptotic bias of inline image is


Let fX(x) be the distribution of x. Then, in the notation of Section 5.1, the distribution gZ of z=(t,x,r) can be written, for small values of ρ, as




The form of this function follows from the first-order approximation


The variance of u* is


and so, if we set


then expression (43) is of the form (16). From equation (45) we have an interpretation of the misspecification quantity ɛ in terms of the correlation coefficient ρ.

The θ-components of the score and information matrices for the z- and y-versions of this model are given by




and so the general asymptotic formula for the bias in equation (20) gives


which, to first order in ρ, is the same as equation (42).

The standardized size of this bias, defined as in Section 4, is


This can be compared with the maximum bias that is given by equation (23), which in this case is


The simplest example of the Heckman model is when x is the scalar x=1. Here, θ is just the mean of tN(θ,σ2) and ψTx is a constant. Then it is easy to check that expressions (46) and (47) are the same. Thus, in the problem of estimating the mean of a normal distribution with missing observations, the Heckman model is the worst case as far as bias is concerned. In general, however, the size of the bias that is given by the Heckman model is strictly less than the upper bound (47).

5.4. Publication bias

We mentioned publication bias in meta-analysis as an important example of incomplete-data analysis, which is particularly contentious in the case of the passive smoking example that was discussed in Section 1. The risk of bias arises because the studies that are available for analysis (the published studies) are not necessarily a random selection from the imagined population of all studies that have been done in the particular area of interest.

The simplest, and most common, setting is the meta-analysis of clinical trials in which a binary outcome is compared for patients given treatment and control. The standard approach is to assume a random-effects model based on the normal approximation of log-odds ratios from the resulting 2×2 tables (Sutton et al., 2000a). Each study gives an estimated log-odds ratio t, reported with (within-study) variance inline image, and inline image is the between-study variance. It is usual to take these variances as known and to ignore the fact that in practice we use sample estimates. The meta-analysis assumes that, independently for each study,


where inline image is the total variance and θ is the true value of the treatment effect that is to be estimated. Given observed pairs (t,x) in the review, θ can be estimated by the sample weighted average of t with weights inversely proportional to x2.

Sutton et al. (2000b) reviewed the large literature on publication bias. One approach, going back to Lane and Dunlap (1978) and Hedges (1984), is to assume a weight function which we interpret as a selection probability


The estimate of θ is then found by maximizing the likelihood based on the conditional distribution of (t,x) given that a study has been selected. The estimate clearly depends on the choice of weight function. For example, w might be modelled to be an increasing function of the standardized size of effect |t|/x. Recognizing the arbitrary nature of such a choice, Greenhouse and Iyengar (1994) introduced an adjustable parameter into w and reported a sensitivity analysis in which the bias is estimated for various values of this parameter.

If fX(x) is the distribution of x across the population of all studies, the conditional joint distribution of (t,x) given that the study is selected into the meta-analysis is, from expressions (48) and (49),


where p=E{w(t,x)} is the marginal proportion of studies selected. The score function for θ from equation (48) is (tθ)/x2, and so the bias b in the estimation of θ from expression (50) is


where w(x)=E{w(t,x)|x} is the conditional probability of selection given x. Note that the size of b depends on how strongly w(t,x) depends on t. The bias is 0 if w is a function of x only, meaning that selection can depend on the ‘size’ of a study but not on its outcome.

If w(t,x) depends only slightly on t, so that the publication bias b is small, then this is just another special case of our general discussion. Define, for each study, the selection indicator r to be 1 if the study is selected and to be 0 otherwise. Then the joint distribution of (t,x,r) is




so that E{δ(t,x)|x}=0 for all x. Then, if δ(t,x) is small (weak selection bias), expression (52) is approximately




We have now written expression (52) in the general form (16), with z=(t,x,r). Note that the fZ-model corresponding to expression (53) is not the same as data MAR, since P(r=0|x)=1−w(x) and x is not observed for unpublished studies.

In the notation analogous to that of Section 5.1, the incomplete-data score for θ is inline image and so the general formula for the bias in equation (20) gives


But inline image and so expression (54) is exactly the same as equation (51). Since the distribution of x cannot be identified from the observed studies alone, a more useful form of expression (54) is


where the observable distribution p(x|r=1)=fOBS(x)=fX(xw(x)/p can now be identified with the empirical distribution of x over the selected studies.

Model uncertainty is a major issue here. Some researchers argue that so little is known about w(t,x) that a sensitivity analysis is the only sensible way forward. In a series of papers (Copas, 1999; Copas and Shi, 2000b, 2001; Shi and Copas, 2002) we used a Heckman model on the lines of Section 5.3, exploiting the analogy between w(t,x) and expression (44). Copas and Jackson (2004) avoided the approximation that δ(t,x) is small by evaluating the upper bound for b over a wide class of possible weight functions. Others noted that, according to distribution (48), the conditional mean of t given x should not depend on x, so any observed dependence and asymmetry in the scatterplot of t against x must be due to the multiplying factor w(t,x) in expression (50). This leads to the test for publication bias that was proposed by Egger et al. (1997) and the imputation method of Duval and Tweedie (2000).

6. Undetectable misspecification

6.1. Identifiability and tests of fit

Our local approximations have assumed that ɛ is small, but we have not discussed the size of misspecification that is needed for the bias approximations to be useful in the practical setting of inference from a sample of n observations. Standard asymptotic inference allows us to estimate θ to within an accuracy of the order of magnitude Op(n−1/2), and the local bias approximations are of the order of magnitude O(ɛ). Thus our approximations allow us to combine these two sources of error in meaningful ways when ɛ=O(n−1/2). This size of ɛ means that the misspecification is ‘undetectable’ in the sense that empirical evidence for discriminating between f and g remains uncertain even when sample sizes are indefinitely large.

To see this, first consider the ideal situation in which we can observe a sample of the complete data z1,z2,…,zn. If we knew θ and the misspecification function uZ, then we could test gZ in equation (16) against fZ, i.e. test the null hypothesis H0:ɛ=0, with the uniformly most powerful standardized test statistic


Under the null hypothesis, TZ is asymptotically standard normal. If, for significance level α, we reject hypothesis H0 when |TZ|geqslant R: gt-or-equal, slantedd−1(1−α/2), the asymptotic power function is


since Eg{uZ(z;θ)}=ɛ+O(ɛ2) from equation (16). With ɛ=O(n−1/2), the term n1/2ɛ remains finite for large n, and so the misspecification is undetectable in the sense that the power of the optimum test does not tend to 1 as n→∞. Note that this argument is unaffected if we use inline image in place of θ in equation (55), since


The last term vanishes, which is a consequence of the identity Ef(uZ)=0 for all θ which, on differentiating with respect to θ, yields Ef(∂uZ/∂θ)=−Ef(sZuZ)=0, as discussed at the end of Section 4.

If we could sample z, then we could use TZ to test for misspecification in any given direction uZ(z;θ). Of particular interest would be the directions (27) which give maximum bias for estimating scalar contrasts of θ. However, the situation is quite different when we can only sample the incomplete data y1,y2,…,yn, since there may be misspecifications in gZ which cannot be detected from data on gY. If uZ is such that uY=dTsY for some vector of constants d, then


and so the misspecification is completely confounded with the unknown value of θ. An example of this happening is the simple pattern mixture model for missing data with z=(t,r) and t|rN(θ+rɛ,1). It is obvious that we have no information about the size of ɛ since we can only observe data on the conditional distribution of t given r=1. In this case we find d=1 since uY=sY=r(tθ).

To see that this can always happen, consider the analogue of TZ for observations on y, namely


When we make this test operational by replacing θ by the incomplete-data estimate inline image, the analogue of expression (57) is


The term in braces is the sample residual when uY is projected onto the linear space that is spanned by the components of the score function sY and so is identically zero if uY=dTsY. But this is exactly what happens in the worst case misspecification in equation (27), for then


This is well defined provided that λmax<1. Referring to Fig. 2, the worst case for bias is when the triangle that is defined by the projection of gY onto the model collapses onto a line along the model. The first-order approximation to gY is then a member of the family of distributions fY, just with a shift in the value of θ.

The test that is based on TY fails because we are allowing misspecification to be in any direction, including the worst case function uZ in equation (27). If, however, uZ is known to take some other functional form, the test is possible and the model is identifiable. An interesting case of this is the Heckman model (38) and (39). Here, in the notation of that section,


This is closely related to the standard method of fitting the Heckman model (Heckman, 1979), which is to estimate ψ from equation (40), to add the corresponding estimate of λ(ψTx) as an additional covariate to the linear regression of t on x and to refit by ordinary least squares. This is because, from equation (41), values of t and x among those cases with r=1 can be written


where δ* is a random residual with mean 0. The (unweighted) least squares estimate of σρ, the coefficient on the Mills ratio term in equation (59), gives


where inline image is the (unweighted) least squares coefficient in the observed regression of λ(ψTx) on x. Thus equation (58) is proportional to the Heckman estimate inline image, the constant of proportionality depending only on the values of x in the observed case. Hence, if we test the hypothesis H0:ρ=0 conditional on the observed values of x, the test based on equation (58) is equivalent to the regression test based on inline image.

Little (1985) pointed out, as have many others, that the estimate inline image is unsatisfactory in practice because of its strong dependence on the correct specification of equations (38) and (39), and on the need for the range of variation of the propensity score ψTx to be sufficiently large for the non-linearity of λ(ψTx) to be evident in the observed cases. If the range of values of ψTx is small, the new regressor λ(ψTx) is highly collinear with the existing regressors x, and so equation (60) is unstable. If the test of fit that is based on inline image is unstable, then so is the test that is based on TY. Copas and Li (1997) gave an example where two transformations of t, apparently fitting the observed data equally well, lead to sharply different estimates of ρ, and hence different estimates of θ.

In Section 5.3 we commented that, in the simpler problem of estimating the mean of a normal sample with missing observations, the Heckman model does attain maximum bias. Here ψTx is a constant, and so the added term λ(ψTx) in equation (59) is completely confounded with the main term in the regression. In this case, equation (58) is identically zero, as are both the numerator and the denominator of inline image in equation (60). Again, this is a case where uY is a linear function of sY.

A rather similar situation arises in the literature on identifiability of competing risks. In a classic paper, Tsiatis (1972) showed that there is no available information on the dependence between the potential lifetimes in the competing risks problem. However, Heckman and Honore (1989) showed that the problem is fully identified if we impose parametric models on the marginal life distributions. Crowder (1994) and others have since pointed out that the resulting estimates are highly dependent on the modelling assumptions that are made. See Crowder (2001) for a good review of this whole area.

There are also many other examples in the literature of identifiable models for the kind of problems that we are considering. For missing data, identifiable parametric models for the joint distribution of t and r of Section 5.1 have been proposed by Baker and Laird (1988), Chambers and Welsh (1993) and Park and Brown (1994), among many others. Identifiability may come through strong assumptions on other aspects of the model; for example Tang et al. (2003) assumed that the marginal distribution of covariates is known. Again, such models involve untestable assumptions or, in a Bayesian context, influential prior distributions.

6.2. Extra uncertainty

This discussion illustrates the central problem of incomplete-data analysis, that unless we make strong and unverifiable modelling assumptions we have little or no information about ɛ, and hence little or no information about bias. Investing in a good model is always important, but particularly so here because of the lack of identifiability of important aspects of the model such as the ignorability assumptions that are implied in all our examples.

The strongest assumption is that ɛ=0. This is modelling in the usual sense: we assume that fZ (and hence fY) is the ‘true model’ in the sense that we are willing to rely on the inferences that are derived from it. In particular, if φ=dTθ, inline image and inline image, the asymptotic coverage probability of the confidence interval


is 1−α. Equivalently, the pivotal quantity inline image is asymptotically standard normal.

In complete-data problems, where we can observe a sample of values of z, a weaker interpretation of fZ is as a ‘working model’: we do not assume that fZ is necessarily true, but we use it for inference on the grounds that it gives an acceptable fit. We interpret this to mean that the actual distribution generating z is gZ, but that fZ is accepted because |TZ|leqslant R: less-than-or-eq, slantd where TZ is constructed for the worst case misspecification (27). Uncertainty of inference is now evaluated with respect to gZ with ɛ≠0, but conditioning on the event |TZ|leqslant R: less-than-or-eq, slantd. Since we are now allowing for misspecification, we expect this to increase uncertainty relative to the standard true model inference, but ɛ is unlikely to be too large because the null hypothesis that ɛ=0 has been accepted by a goodness-of-fit test.

For this discussion to make sense, we need to ensure that the parameter θ retains its meaning under both fZ and gZ. In the terminology of Section 4, this means that we adopt Royall and Tsou's (2003) assumption that θINT=θINF, so that θ=θgZ and Ef(uZsZ)=0. We now consider calculating the confidence interval (61) after accepting that fZ is a ‘working model’. Our conjecture is that, to attain the same confidence coefficient (coverage), this interval needs to be widened to allow for the extra uncertainty through relaxing the status of fZ from a true model to a working model. We shall find a factor kgeqslant R: gt-or-equal, slanted1 such that the coverage of inline image in this broader sense remains at least 1−α.

Similarly, when we can only observe incomplete data, we could describe fY as a working model if fY gives an adequate fit to the observed sample of values of y. The difference now is that a good fit of fY no longer implies that ɛ is necessarily small. In the simple pattern mixture model that was mentioned in Section 6.1, for example, the observed values of t may give an excellent fit to the normal distribution that is required by fY, and yet ɛ may be large. If ɛ is large, then inference from inline image may be severely biased.

Since a good fit of fZ to data on z necessarily implies a good fit of fY to the corresponding values of y, we argue that the extra uncertainty that is implied by interpreting fY as a working model is as great as or greater than the extra uncertainty that is implied by interpreting fZ as a working model. We therefore evaluate the factor k that was defined above and use this as a lower bound when basing inference on a working model on y. If we are not willing to make the strong assumption that fY is the true model, and merely rely on its credentials as a working model, the actual error when we estimate φ by inline image may be larger, and possibly substantially larger, than this calculation implies.

This is the key idea of this section. We can never know that our model is ‘correct’; the best that we can hope for is that it gives a good description of the data. The problem is that with data only on y we can never test fZ fully, because of the identifiability problems that were discussed above. Instead we formulate our uncertainty on the assumption that fZ gives a good fit to the (unobserved) data on z and use this as a lower bound to the actual uncertainty that we suffer when fY is used as a model for the data on y. This argument leads to a confidence interval that is wider than expression (61), and hence less misleading than the naïve procedure which makes no allowance at all for model uncertainty.

To study k, we need the joint distribution, under gZ, of the pivot S and the test statistic TZ with uZ in equation (27), which is


where λ is the relative efficiency for estimating φ=dTθ as defined in expression (5). As expected, if we can observe z, we would test for the presence of incomplete-data bias by calculating inline image and inline image from the same set of data and testing the significance of the difference.

When ɛ=0, both S and TZ are asymptotically standard normal. Using equations (24), (1) and (4), the correlation between S and TZ is (1−λ)1/2. When ɛ≠0, inline image suffers the first-order approximate bias b from equation (20), and hence the corresponding approximate mean of S is inline image. Note that if ɛ=O(n−1/2) then bS is O(1). Similarly, the expected value of TZ is (1−λ)−1/2bS. Hence, if terms of size O(n−1/2) and smaller are ignored, S and TZ are jointly asymptotically normal with


Thus the conditional distribution of S given TZ is approximately


To this order of approximation, the conditional distribution of S given TZ does not involve the misspecification bias bS. This argument applies for any misspecification function uZ in gZ, not just the worst case function (27) that is used in TZ.

If we had actually observed TZ, we could use this conditional sampling distribution of the pivot S to construct a conditional confidence interval for φ. With the same significance level α this would give


If TZ<d, the lower limit of interval (63) cannot be less than




Similarly, if TZ>−d, the upper limit of expression (63) is at most expression (64) with the sign changed. Thus, if we assume that |TZ|leqslant R: less-than-or-eq, slantd, then a conservative confidence interval for φ is


Since 0<λleqslant R: less-than-or-eq, slant1 and so inline image, the second equality in equation (65) shows that


Comparing expression (66) with expression (61) we see that relaxing the status of model fZ from a true to a working model has led to a wider interval, by a factor which depends on the value of λ, i.e. on the proportion of information that is retained in the incomplete data. The width of the interval, however, never increases by more than a factor of √2, which we can think of as ‘doubling the variance’, recalculating the usual confidence interval with the variance inline image doubled to inline image.

To see this in another way, if we could observe z, then we could estimate φ with variance (under fZ) of inline image. With incomplete data, the variance increases to inline image. But, if fZ is weakened to a working model, the expanded confidence interval (66) is the same as the conventional interval (61) but with the variance inline image increased further to inline image, which we could call the pseudovariance. From interval (65), the pseudovariance is


The right-hand side of equation (67) splits the pseudovariance into the ordinary variance (assuming that fZ is true), plus the effect of bias resulting from model uncertainty. Fig. 3 shows the total variance inflation factor in equation (67) in terms of λ. The first term is the dotted line and the second the broken line, giving the total as the full line. All three curves are decreasing functions of λ, as expected.

Figure 3.

Pseudovariance versusλ: ·······, mean; - - - - - -, variance; ——, total

The rather informal argument leading to equation (65) is based on considering the upper and lower confidence limits separately. For a tighter bound, consider the (conditional) coverage probability of the two-sided confidence interval inline image under the working model fZ. This is


Now define k*=k*(λ,α) as the unique solution of


Then the coverage probability (68) is at least 1−α if kgeqslant R: gt-or-equal, slantedk*. If k<k* then the coverage falls below 1−α for at least some values of bS.

Fig. 4 illustrates the values of k* for α=0.05 and α=0.01. For each λk* increases as α becomes more extreme but is always less than the curve for k in equation (65), which is also shown. In fact


Of the three inequalities in expression (69), the first is attained when λ=1 (no loss of information), the second is attained in the limit as α→0 and the third is attained when inline image.

Figure 4.

Values of uncertainty factor k: ——, k=k*(0.05); ·······, k=k*(0.01); ·–·–·, k=k (equation (65))

In summary, we have three asymptotic coverage statements. Firstly,


the conventional asymptotic confidence interval when fZ is the true model. But if fZ has the weaker status of a working model, defined by conditioning on the event |TZ|leqslant R: less-than-or-eq, slantd (so that ɛ=O(n−1/2) for this event to happen with non-vanishing probability), then the conventional interval is no longer a confidence interval with this coverage, as


for at least some possible misspecified distributions gZ. But


for all possible distributions gZ within the asymptotic set-up that is being discussed. Of course this is a hypothetical calculation since TZ is unobserved, but the expanded confidence limits in inequality (71) involve only the ys. Our argument is that in practice, when we only accept fY as a working model, our uncertainty limits for φ should be at least as wide as those in inequality (71).

The fact that, when λ=1, k*=k=1 emphasizes the importance of the assumption that θINT=θINF. For complete data there is then no asymptotic penalty if we treat a working model as if it was a true model. But when λ<1 the distinction is important, as seen in inequality (70).

The same argument also applies in multiparameter problems. Suppose that we want to find a confidence region for θ itself, containing m components say. Let inline image, and inline image as before. Then the multivariate analogues of S and TZ are


If ɛ=0 then var(S)=var(TZ)=I. The distribution of S then allows us to write


where U is a random vector from N(0,I). The usual asymptotic confidence ellipsoid for θ is the set of all values of equation (72) that are consistent with the inequality UTUleqslant R: less-than-or-eq, slantd, where d is now the (1−α)-quantile of the χ2-distribution on m degrees of freedom.

When ɛ=O(n−1/2) we allow for first-order bias in the same way as before to give the generalization of distribution (62) as S|TZN{(I−Λ)1/2TZ,Λ}. Now we can write, conditional on TZ,


The expanded confidence region is now the collection of all values of equation (73) that are consistent with the two inequalities inline image and UTUleqslant R: less-than-or-eq, slantd.

To extend the univariate discussion, we now define kmin to be the smallest value of k such that the region that is generated from equation (73) by inline image and UTUleqslant R: less-than-or-eq, slantd lies everywhere within the region that is generated from equation (72) by UTUleqslant R: less-than-or-eq, slantk2d. The generalization of the ‘double-the-variance’ result is that inline image. This is proved in Appendix A. The condition for the bound to be attained is that the terms inline image are not all of the same sign, where the λis are the eigenvalues of Λ. If inline image for all i, or inline image for all i, then kmin=k in equation (65).

Fig. 5 shows two examples with m=2, α=0.05 and




In each of the two diagrams, the ellipses illustrate the two distinct components of the right-hand side of equation (73). The inner ellipse that is centred on the origin (drawn in bold) is the locus of the first term of equation (73) as TZ varies over the circle inline image. An ellipse corresponding to the values of the second term in equation (73) as U varies over the circle UTU=d is then centred on each point of the first ellipse, to give the collection of ellipses that are drawn with light lines. The outer envelope of these ellipses is the conservative confidence region for θ.

Figure 5.

Examples of the confidence ellipsoid: (a) unattainable case; (b) double variance

Of the two larger concentric ellipses that are drawn with bold lines in each graph of Fig. 5, the inner is the conventional confidence region for θ given by STS=d. The outer is the region for θ that is defined by STS=2d. Note that the envelope allowing for all possible values of the bias that are consistent with the acceptance region inline image is contained within the conventional confidence ellipse but with the variances doubled.

The eigenvalues for these two examples show that kmin=√2 in the second case but not the first. This is confirmed in Fig. 5. In Fig. 5(a) we see that the envelope is everywhere inside the outer ellipse, but in Fig. 5(b) we see that the envelope touches the outer ellipse at exactly four points.

7. Example: passive smoking and lung cancer

We return to the study of passive smoking and lung cancer that was discussed in Section 1 and examine in more detail how sensitively the estimation of relative risk can depend on the influence of potential confounders. As mentioned, Hackshaw et al. (1997) was based on a meta-analysis of 37 separate epidemiological studies which compared the risk of lung cancer among non-smokers according to whether the spouse of the subject did or did not smoke. See Hackshaw et al. (1997) for full details.

Using the notation of Section 3.2, let x be the binary exposure variable taking values 1 (exposed; spouse smokes) and 0 (unexposed; spouse a non-smoker), and let c be another (un-measured) variable which may also affect the risk of cancer. A measure of quality of diet is just one possibility for c. Suppose that an individual's risk of cancer is log-linear in x and c with coefficients θ and α respectively. Then, approximately, the estimate of log-relative-risk θ that is calculated from the jth study, inline image say, will be biased by the amount αdj, where dj is the difference in the average values of c between the n1j exposed cases and the n0j unexposed cases. If inline image is the variance of c within each level of x, and inline image, then


where inline image is the within-study variance and τ2 is the between-study (heterogeneity) variance, defined as in the usual random-effects model for meta-analysis (Sutton et al., 2000a). We have formulated the variance in distribution (74) so that inline image marginalizes over dj to the standard random-effects model, as explained below.



and define inline image to be the pooled estimate of log-relative-risk by using the method of DerSimonian and Laird (1986). Then inline image, and its conditional distribution given TZ, are


where inline image. For the pivot inline image, we therefore have


By ignoring the problem of confounding, the published analysis tacitly assumes that x and c are conditionally independent within each study. If this is so then E(dj)=0 and so distributions (74) and (76) marginalize to the usual distributions inline image and SN(0,1). But x and c may be correlated—suppose that, for each individual in the jth study,


This means that


The size of ɛ can be calibrated in terms of ρ=corr(x,c), by


If ρ=0 then ɛ=0 and vice versa.

If we let, for any given inline image and inline image, then the above is just another special case of the general formulation of Section 2. Here we have the trivial extension of allowing for the different sample sizes at each value of j. The model fZ (ignorable confounding) is the product of the submodels (74) and inline image. The model fY is inline image. Comparing inline image with inline image, the usual meta-analysis variance, gives the loss of efficiency through ignoring c as


In the notation of Section 4, the true model gZ is the product of distribution (74) and the distribution of dj in expression (78), and so for small ɛ


With the convention inline image must be rescaled to inline image for the general notation to apply to the jth study. Note that when ɛ=0 the confounding is ignorable and gZ=fZ.

To see how this example illustrates Section 6, if we had observed the values of c we could use the test statistic TZ in expression (75) to check the ignorability assumption in fZ by testing the null hypothesis that ɛ=0. From expression (78), TZ is standard normal if ɛ=0. Further, from equation (80) we see that distribution (76), the conditional distribution of the pivot S given TZ, agrees exactly with the general formula (62).

Following Section 6.2, we can now consider assessing the uncertainty in inline image under three different scenarios, making decreasingly strong assumptions about ɛ:

  • (a) we assume that ɛ=0;
  • (b) we are uncertain about ɛ but assume that, if we had been able to measure c for all the subjects in these studies, we would confirm that there is no significant correlation between c and x;
  • (c) we are uncertain about ɛ and cannot measure c.

Now let qL1(ɛ) and qU1(ɛ) be the lower and upper inline image-percentiles of the marginal distribution of S, and qL2(ɛ) and qU2(ɛ) be the same percentiles but for the conditional distribution of S given |TZ|leqslant R: less-than-or-eq, slant2. Then the corresponding confidence intervals for θ are, for k=1,2,


Note that qU1(0)=−qL1(0)≃2.

Under the first scenario, C1(0) is the ordinary 95% confidence interval for θ as calculated by Hackshaw et al. (1997). But, if ɛ≠0,C1(ɛ) reflects the confounding bias which we risk under scenario (c). Interval C2(ɛ) also reflects this bias, but conditional on a check that ɛ=0 would seem sensible in the light of measurements on c. According to Section 6.2, C2(ɛ) is everywhere within the interval C*, which is defined to be the interval C1(0) widened by the factor √2. As ɛ is unknown, C* seems a safe inference under scenario (b). Our state of ignorance about ɛ is greater under scenario (c) than under scenario (b), and so our argument is that in practice, when we know little or nothing about ɛ, it is less misleading to report our inference as C* than the usual confidence interval C1(0).

Fig. 6 shows these confidence intervals for the meta-analysis of Hackshaw et al. (1997), plotted against ρ. All values are shown on the original relative risk scale. We have calculated inline image and κj, and estimated the parameters σj and τ directly from Table 1 of Hackshaw et al. (1997). We assume that α and σc are such that inline image: this means that, under scenario (a), if we had been able to design these studies properly by controlling on levels of c, then only half the sample size would have been needed to give the same accuracy in assessing the effect of exposure. To calibrate ɛ in terms of ρ we have used equation (79) and taken σx to be the standard deviation of the binary variable given by the relative numbers of exposed and non-exposed subjects in these studies. The percentiles that are needed for C1(ɛ) and C2(ɛ) are estimated by simulation using 10000 replications of the joint distribution of S and TZ in distributions (76) and (78).

Figure 6.

Relative risk from passive smoking: ——, unconditional; ·–·–·–, conditional; ·······, naïve

The graph illustrates how sensitively inference can depend on assumptions about ignorability. The confidence interval of Hackshaw et al. (1997) for a relative risk of 1.13–1.36 is shown as the inner horizontal dotted lines. This agrees with C1 (full lines) when ρ=0. However, the correlation ρ has to rise to only 0.03 before the conclusion is compromised: if ρ>0.03,C1 includes values for the relative risk of less than 1, so the assertion of a causal link between passive smoking and lung cancer is no longer significant. This accords with the view that is taken by some epidemiologists that any observed relative risk of less than about 2 should be regarded with considerable caution. Interval C2 (the chain curves) controls the risk by staying within the expanded interval C*, which is shown as the outer horizontal dotted lines. Unless we can be confident that the correlation is less than about 0.01, C1 contains values that are outside both C1(0) and C*, but C* is closer. In this sense it seems safer to give the inference as C* rather than C1(0). For these data, C* gives the relative risk from 1.08 to 1.41, still suggesting a causal effect but with a considerably wider margin of error.


Our formulation of y as a deterministic function of z means that the components of z must be a mixture of responses of interest, and any subsidiary stochastic variables which determine the incompleteness of the data. Heitjan and Rubin (1991) avoided this by modelling the measurement process separately. Their formulation is more complicated algebraically, but assumptions such as ‘coarsening at random’ are more transparent. Here, details of the measurement process, including any nuisance parameters that are involved, are buried within the overall model fZ. Covariates are also included in z, so the single model fZ implies a fully random model rather than conditioning on observed covariates as would be usual in regression models. If fZ makes the covariates ancillary for θ, as in equation (13) for example, then this distinction is unimportant as far as asymptotic maximum likelihood estimation is concerned.

The approximations in this paper have been relatively simple because we have only retained linear terms in ɛ, and we have assumed standard asymptotics of maximum likelihood. The major simplifications include that the maximum bias in equation (23) depends on the model only through the information matrices IZ and IY, and the effect of misspecification on the variances can be ignored. Higher order approximations are much more complicated, although some progress is possible in particular cases. Copas and Li (1997), for example, showed that, for the Heckman model (Section 5.3), linear approximations to bias are in fact accurate up to and including second-order terms in ɛ.

By taking ɛ=O(n−1/2) we are working with local asymptotics and not asymptotics in the more usual statistical sense of keeping the model fixed but letting the sample size n→∞. If ɛ=O(1) then the bias will dominate the variance if n is sufficiently large, and the power function in expression (56) tends to 1. Our approximations only provide a sensitivity analysis for undetectably small misspecifications of fZ. Our interest in this is closely analogous to the theory of locally optimal statistical tests, which is concerned with evaluating power functions not against global alternative hypotheses but against alternatives that are increasingly close to the null. Sen (1985), page 96, wrote

‘for a meaningful study of the asymptotic power of tests, one would naturally confine oneself to the locality of the null hypothesis for which the power functions may not converge to 1 … we remark that a locally optimal test may not perform that well for nonlocal alternatives, particularly when the sample size is not large’.

Here we have the added difficulty that tests of fZ against alternatives gZ need the data on z and not just y, and hence our discussion in Section 6.

Our discussion of the consequences of misspecification is strongly parametric in the sense that we are interested only in the bias of the parameters that are defined by fZ. Even if θ=θgZ, and ɛ is sufficiently small for the misspecification to be undetectable, there may be other aspects of the fitted distribution which differ sharply from the corresponding true values for gZ. For example, in fitting a normal distribution we always have consistent estimates for the mean and variance, but this does not mean that the tails of the distribution are correctly estimated. Lawless (2003), pages 250–252, gave some striking examples of this in the context of estimating life distributions. Gustafson (2001) suggested a more general setting for this discussion, in which, for data z generated by true distribution g(z), the object of interest is the functional γg=T{g(·)}. For parametric model f(z;θ), with score function s(z;θ), let γ(θ)=T{f(·;θ)}. Then by concentrating on bias in estimates of θ we are tacitly making the strong assumption that the functional satisfies the identity Eg[s{z;γ−1(γg)}]=0 for all g. For other functionals T, γg may be grossly misspecified, as the examples in Lawless (2003) show.

Our paper raises the fundamental question of how to combine two kinds of uncertainty: uncertainty in the model and uncertainty arising from sampling variability in the usual sense. The Bayesian paradigm provides a complete solution, at least in principle, since the model, parameters and data are then all thought of as randomly sampled from some appropriately defined superpopulation. In our notation, the typical approach is to choose a parametric model for ɛuZ and to assume a joint prior distribution for the parameters of this model along with the parameter θ of the main model fZ. Many Bayesian papers take this approach for particular incomplete-data problems, e.g. Forster and Smith (1998) on missing data and Givens et al. (1997) on publication bias in meta-analysis. Problems of identifiability of model misspecification which we have discussed re-emerge in the form of strong dependence on the prior distribution or strong dependence on the particular form that is assumed for ɛuZ, or both. Sometimes there may be substantive prior information; for example Rubin (1977) discussed how the effects of missing data in a survey can be assessed by using subjective notions about the similarity between respondents and non-respondents, and Scharfstein et al. (2003) argued that the doctors who are involved in a longitudinal trial of human immunodeficiency virus drugs are likely to have clear opinions about which patients are likely to drop out before the end of the trial. However, it is difficult to see that this would be the case in most applications, and so the difficulty of combining the model and sampling uncertainty remains. Greenland (2005), section 4.5, discusses the same issue, emphasizing the inadequacy of methods which claim to ‘let the data speak for themselves’, pointing out that ‘without external inputs, observational data say nothing at all about causal effects’. We have proposed a rather contrived approach to one aspect of the problem in Section 6.2. Whether there is a fully satisfactory solution in the frequency domain remains an open question.

Another fundamental question is raised by our use of the term ‘working model’. This term is rather misleading, since it refers to both the model and the data and is not a property of the model as such. The aim is to capture normal practice in exploratory data analysis: we search possible models and use only one for inference after verifying that it gives an acceptable fit to the data. The model chosen then depends on the data and so it is misleading to use (unconditional) sampling distributions as if the model was fixed. A general formulation of this seems difficult. Our approach in Section 6.2 is to condition on the value of a goodness-of-fit test statistic, and then to take bounds over the range of this statistic which would lead us to accept this particular model. In this way, by conditioning out the first-order dependence on ɛ, we can work within both the single model fZ and its close neighbourhood gZ. If the model search had been sufficiently structured, we could work within a nested sequence of models fZ and define model choice in terms of some appropriate penalty function. Whether there is a satisfactory formulation of the problem which avoids making strong assumptions about such a nested sequence again seems an open question.

We have had space to discuss only a few special cases of incomplete data where model uncertainty is problematic. There are many other examples, which are equally important (and difficult), each with its own large literature. We could mention non-compliance in clinical trials (White and Pocock, 1996; Goetghebeur and Lapp, 1997), drop-out in longitudinal trials (Diggle and Kenward, 1994; Little, 1995; Scharfstein et al., 1999), non-ignorable censoring in survival analysis (Moeschberger and Klein, 1995; Scharfstein and Robins, 2002; Siannis et al., 2005) and errors in variables (Gustafson, 2002). Some of these problems, particularly those of Section 5.2, are closely related to the discussion of estimating causal effects (Rubin, 1974; Holland, 1986; Pearl, 2000). The design and analysis of observational studies raises many of these same issues: see Rosenbaum (2002) for a route into this large literature. Rosenbaum (2004) is related to Section 5.2 above and gives a useful review of the author's pioneering papers on sensitivity analysis, which are also reviewed in Copas and Li (1997). Greenland (2003) has given an accessible account of confounding and measurement problems in the context of a topical case-study (power lines and childhood leukaemia), with a broader discussion of the same application in Greenland (2005).


Appendix A

We now verify the statement that is made in Section 6.2, that the double-the-variance property extends to confidence regions for the vector parameter θ.

First we simplify the notation by transforming the parameter space of θ by multiplying by inline image after the shift to inline image. From equation (73), the confidence region now consists of the values of the new parameter ω satisfying


Our problem reduces to whether the region of ω satisfying equation (81) with constraints inline image and UTUleqslant R: less-than-or-eq, slantd is included in {ω|ωTωleqslant R: less-than-or-eq, slant2d}.

Let Br(c) be an m-dimensional ball with centre c and radius r. Then the assertion to be proved reduces to


Thus it suffices to show the following lemma.

Lemma 1. 


Proof.  Consider the Lagrangian function


where ν and η are Lagrangian multipliers. If f has an equilibrium point at (a,b,λ,ν,η), then


We can, without loss of generality, assume that 0leqslant R: less-than-or-eq, slantaileqslant R: less-than-or-eq, slantbi for i=1,2,…,m, these inequalities being strict for at least some values of i. Then η must be positive from equation (83), and so is ν from equation (82). From equation (84) we obtain


which, comparing with equation (82), gives η=ν. Hence


For this to be compatible with equation (83) we must also have η=1.

Finally, we substitute equation (85) into the two constraint equations to give


Adding these two equations gives inline image. Thus all equilibrium points of f satisfy f=2, and so this must be the global maximum (since inline image is clearly bounded).


Discussion on the paper by Copas and Eguchi

P. J. Diggle (Lancaster University)

The paper addresses themes of near universal relevance to modern statistics: incorrectness of an assumed model and incompleteness of the available data. It has often been said that ‘all models are wrong’; after tonight's presentation we might add that ‘all data are incomplete’. Perhaps the only exception to both of these rules is when data are gleaned from a designed, and properly randomized, experiment for which the data to be collected are precisely specified in advance, and an inferential model is induced by the randomization. So how do we make progress with the plethora of challenging scientific problems which are not amenable to investigation through randomized experiments? We build models. And, in so doing, ‘we buy information with assumptions’ (Coombs, 1964). I very much like the use of the word ‘buy’ in this quotation for two reasons: it reminds us that what is bought may be good or bad value for money, and it invites us to consider the strength of the currency. In general, assumptions which are based on judgments by subject-matter scientists have a high rate of exchange against assumptions which statisticians adopt as a matter of convenience. At the risk of pushing the metaphor too far, I would suggest that the Bayesian response to differing rates of exchange is to adopt a more or less informative prior, whereas the non-Bayesian response is, if only implicitly, to be more or less sceptical of marginally significant results. But, in either case, scientific context is a vital ingredient—an answer couched purely in terms of mathematics is not enough.

And so to the specific content of the paper: equation (16),


raises several questions in my mind, on which I would welcome the authors’ comments. Might this equation not be used as the basis of a sensitivity analysis, with no implication that ɛ is small in any sense? What does it mean in practice to assume that ɛ=O(n−1/2)? And do the authors really believe that ‘ɛ is unlikely to be large because the null hypothesis that ɛ=0 has been accepted by a goodness-of-fit test'?

One of the most immediately striking results in the paper is the ‘double-the-variance’ rule. This is beguiling, because it provides a temptingly simple strategy for the hard-pressed consulting statistician. Can the authors give us any guidance on when, in practice, this rule can safely be used? Consider the following model for selection bias in sampling from a population which, unbeknown to the investigator, consists of two subpopulations which differ in respect of both their response distribution and their willingness to respond. Let U be a 50–50 mixture of N(−1,ν2) and N(1,ν2). We never observe U, but we observe Y if and only if U>0. Consider two models for the conditional distribution of Y given U. Model 1 asserts that Y|UN(μ+U,σ2); model 2 that Y|UN(μ,σ2). The target parameter μ=E(Y) is the same under both models. The difference between the two models is formally undetectable from the observed data in the limit ν2→0, since the distribution of the observed response, i.e. of Y conditional on U>0, is then exactly normal under either model. I suspect that the difference is undetectable in practice for larger values of ν2 unless the sample size is very large, yet the expectation of the observed response is at least μ+1 under model 1 and exactly μ under model 2, irrespective of the sample size, i.e. the bias due to choosing the wrong model is O(1).

A final comment, which underlines the pervasive nature of the paper, is to note the wide range of topics which the authors list in their closing remarks and to add yet another, namely the question of informative non-missingness. This arises naturally in longitudinal or spatial settings, when we wish to make inferences about a real-valued stochastic process Y(x) on a continuous space x, but the points of x at which we observe Y(x) are determined by a point process which may be stochastically dependent on Y(x). A longitudinal example is a study in which patients report to clinics at times of their own choosing; a spatial example is the siting of environmental monitoring stations near suspected sources of pollution. In both cases, the target for inference is the marginal distribution of Y, which we write as [Y]. The natural way to formulate a model for the resulting data is as a joint distribution [X,Y]=[Y][X|Y] but the data, by construction, are a sample from [Y|X]. Conventional methods of longitudinal and spatial analysis assume that [Y|X]=[Y], i.e. that X and Y are stochastically independent. The longitudinal version of this problem is considered in Lipsitz et al. (2002) and in Lin et al. (2004). At Lancaster, Raquel Menezes and I are investigating the consequences of ignoring a stochastic dependence between X and Y in the spatial setting. Our preliminary conclusions are that, unsurprisingly, inferences which ignore the stochastic dependence can be arbitrarily bad but, admittedly under strong parametric assumptions, correct inferences can be recovered by treating the data as a realization of a marked point process.

I have found this to be an extremely stimulating paper, and I have great pleasure in proposing the vote of thanks.

Roderick Little (University of Michigan, Ann Arbor)

It is a great pleasure to return to the city where I was born and took my first steps in statistics to second the vote of thanks for this fine paper. The last time that I attended a Royal Statistical Society invited paper session as a discussant was in 1976, at the presentation of the famous EM paper by Dempster et al. (1977). I was armed with the confidence of youth but very little prior preparation, thinking that I could extemporize some clever comments, but when my moment came I was overawed, attempted some incoherent phrases and sat down with a very red face. It was good early training in academic discourse, and I hope that I am better prepared this time. I am told that the seconder of the vote should make remarks ‘of a more critical nature’, and hence I shall try to suppress my indoctrination in US midwestern friendliness and give the authors a hard time, while acknowledging their impressive efforts to elucidate a difficult subject.

The paper has many fine features. I liked the lucid formulation and discussion of the incomplete-data problem in the first part. Although the Heitjan and Rubin formulation has some advantages in singling out the coarsening mechanism, I find the proposed framework simpler and quite powerful. The authors are to be commended for the attempt to provide a very general formulation, and this reach is illustrated by the wide array of examples. The range of approximations that they glean from simple and general assumptions, and formulae such as equation (30) are elegant generalizations of ideas that have been previously presented as special cases.

A key feature of many of the authors’ results is the assumption ‘ɛ is small’, i.e. only local deviations from the complete-data model are allowed. Additional references on local model departures for missing data include Troxel et al. (2004) and Verbeke et al. (2001). The ɛ is small assumption is defensible in complete-data modelling, since judicious model checks should be able to detect larger-than-local departures, but in the incomplete-data setting it is often not reasonable, except perhaps when data are close to missing completely at random. The reason (as the authors discuss) is that we can only detect departures from the observed data, and the main bias problem concerns deviations from assumptions for the unobserved data, which of course we do not get to see. Thus non-ignorable data models assume that respondents and non-respondents are different, even after conditioning on variables observed for both. There is no reason to expect these deviations to be O(n−1/2). On the contrary, in a practical sense the deviations are O(1) and may in fact increase with the sample size, since given finite budgets there is a trade-off between devoting resources to increase sample size or to reduce and adjust for non-response. For example, large simple clinical trials may collect less covariate information for non-respondents than smaller trials with more follow-up and covariate information. The ‘small ɛ’ assumption is also doubtful in models where the data are missing at random (MAR). As an extreme example if y is observed when x<c and y is missing if x>c, and x is fully observed, the data are MAR, but missing data models require extrapolating from the distribution of (y|x<c) to the distribution of (y|x>c), which is completely unobserved. Model errors in this extrapolation cannot be detected, however large.

These considerations leave me uneasy about missing data tenets based on a theory of local model departures. In particular, the summary of the paper suggests ‘doubling the variance’ as a ‘crude way of addressing the possibility of undetectably small departures from the model’. An unnecessary feature of the doubling the variance rule is that it ignores dependence on the fraction of missing information λ, which to my mind is an essential feature of the problem. This arises from replacing the λ(1−λ)-term in equation (65) by its upper bound of inline image, which is an excessive oversimplification—whatever the appropriate adjustment for model uncertainty due to missing data, it should be greater for 50% missing information than for 5% missing information. More importantly, the double-the-variance rule fails to address the possibility of large but undetectable departures from the model, and this is often the main concern when data are missing. The authors argue in Section 6.2 that the rule should be treated as a lower bound for the actual uncertainty in such cases. But lower bounds on uncertainty are of little practical use. I worry that casual readers will be enticed by the simplicity of the doubling the variance rule without reading or fully appreciating the material on page 484 of the paper.

Concerning models with data MAR, an alternative to basing missing data adjustments on parametric models and then attempting to adjust for model uncertainty is to attempt to build robustness into the missing data model by relaxing key parametric assumptions. In recent work (Little and An, 2004), we considered multiple imputation of an incomplete variable Y based on a model that

  • (a) regresses Y on a penalized spline of the estimated response propensity, the function of covariates that is vulnerable to misspecification under data MAR, and
  • (b) includes covariates that are orthogonal to the propensity score parametrically, since model mis-specification is less critical in these directions.

This work is generalized to multivariate data in An (2004), which we plan to develop into publications in the near future.

A good way of assessing model uncertainty for non-ignorable missing data models is to assess sensitivity to substantively plausible differences between respondents and non-respondents. One might reason ‘suppose that the missing values for an incomplete variable Y deviate from the observed values by 0.5 or 1 standard deviation, after adjusting for observed covariates—how much impact does that have on the inference?’. Or, in a comparison of treatments, ‘how large does the non-response bias have to be to render the treatment effect insignificant?’. If a single answer is required, we might express differences between non-respondents and respondents by a prior distribution and do a Bayesian analysis (e.g. Rubin (1977)), as the authors mention in the discussion. Thus, for the Heckman model that is discussed in Section 6.1, I favour a sensitivity analysis for different values of ρ, or a Bayesian analysis based on a proper prior for ρ, to attempting to estimate ρ from the data (Little and Wang, 1996). These analyses are subjective, but subjectivity is an unavoidable feature of an explicit treatment of the problem. The Bayesian approach is natural (at least to this Bayesian) and makes assumptions about the non-ignorable features of the model explicit and subject to criticism and debate.

A final comment about terminology: the authors state in Section 3.1 that ‘data MAR is a necessary and sufficient condition for the missing data process to be ignorable’. This statement conflicts with Rubin's definitions, where ignorability is defined as MAR with the additional condition that the parameters θ and φ have distinct parameter spaces (Rubin, 1976; Little and Rubin, 2002). Without this distinctness condition, the term in the likelihood from the missing data mechanism contains information about θ, so maximum likelihood inference about θ from the full likelihood differs from inference under the ignorable likelihood.

The vote of thanks was passed by acclamation.

Tim F. Liao (University of Illinois, Urbana–Champaign)

I congratulate Copas and Eguchi for their contribution to the literature on two extremely common yet deeply difficult issues in statistical methodology: model uncertainty and missing data. They break new ground by treating missing data as a source of uncertainty and establishing elegantly the bounds of the biases in observational data. Below I focus on the two issues of bounds estimation and model uncertainty.

There are popular methods for dealing with hidden confounders and endogeneity by estimating bounds for treatment effects in non-experimental data such as Manski's (1990) bounds, Heckman and Vytlacil's (1999) bounds or Rosenbaum's (2002) bounds; the first two bounds methods explicitly specify the coun-terfactual probabilities whereas the last begins with a matching method. Manski (2003) also considered missing data as a partial identification problem. The current proposal uses a general geometric formation for estimation with incomplete data or uncertainty. It would be more informative to evaluate the proposal vis-à-vis the available approaches.

The authors use in their paper a specific version of model uncertainty, namely the misspecification of omitted variables. Chatfield (1995) summarized uncertainty about model structure as one of three types of statistical uncertainty, and he attributed such uncertainty to model misspecification (i.e. omitted variables), to the specification of a set of models of which the true model is a special yet unknown member and to the selection between two or more models of quite different structure.

The concept of ‘model uncertainty’ can be inclusively defined as the property of the presence of many alternative models fY in the model space fY, which includes the ‘working’ model fZ but may not include the ‘true’ model fZ. In addition to omitted variables as misspecification, model uncertainty can come in the form of model causal structure, functional form (of explanatory variables), link function, error distribution and even variable coding scheme. Raftery (1996) dealt with model selection in the model space, which may mean graphical models in Occam's window (Madigan and Raftery, 1994); Draper (1995) examined the choice of link function and of error distribution for generalized linear models as sources of model structural uncertainty; the combined choices between coding scheme and non-additivity can be illustrated with a simple example of merely two five-category ordered variables that could produce as few as three and as many as 25 parameters in a model. All these situations constitute model uncertainty.

A natural next step is to compare with other bounds methods and to consider the other aspects of model uncertainty. I hope that these can be developed at a later stage in sequel papers.

N. T. Longford (SNTL, Leicester)

I have comments on the public image of statistics and on some technical matters.

The public are bombarded by results of scientific research tainted by problems that are similar to those in Hackshaw et al. (1997). Much of the public regard this information assault with cynicism. Downgrading the credibility of the messages appears to be a well-calibrated response to ‘certaintitis’, the lamentable condition of pretending certainty or confidence without a good foundation.

There is much sympathy for the attitude of ‘Can't solve it, so it's not a problem’, which is the target of the authors’ criticism. The alternative that is presented is commendable—let us just speculate intelligently what might happen if we had a little more information, or had almost all the information that would enable a credible solution by the established methods.

I liked very much the illustration of how powerful the concept of missing information is. I would like to take it one step further, though, by considering the unknown model as the missing information when all the variables are observed, in a finite sample setting. The E-step of the EM algorithm would conclude with the conditional probabilities of which model is valid, and the M-step would linearly combine the maximum likelihood estimators based on the competing models. Suppose that one of the models is such that all the others are its submodels generated by constraining the continuum of values of one or several parameters to 0. Then the EM algorithm ends up assigning all the weight to the most general model, ruling out any model reduction that we often regard as desirable.

The source of this contradiction is our reliance on asymptotics. Maximum likelihood with a valid model is efficient only asymptotically. For finite samples, maximum likelihood based on some submodels of a valid model may be more efficient, because the bias squared that is incurred is smaller than the variance reduction (Longford, 2003). I am not trying to defend the analysis in Hackshaw et al. (1997) but want to point out that the focus on bias is appropriate only asymptotically.

Publication bias is an application of missing information. The complete set of studies is not likely to be representative of the contexts (countries, subpopulations, etc.) for which the inferences from the meta-analysis are meant to apply. The consequent study conduct bias should be regarded as another nuisance (observational) feature and associated with another source of bias, unless there is no between-study variation.

Ben Torsney and Helen Parker (University of Glasgow)

We raise two issues.

Optimal design

Assume a regression model with control over the explanatory variables and that we would opt to perform a design, optimal with respect to some criterion, if all observations are guaranteed to be realized.

One alternative, when observations are likely to be missing, is to choose a design which optimizes an expected value of the criterion. The resultant design is likely to differ from the ‘non-missing’ one.

When the model is linear, one exception appears to arise in the case of exact D-optimality (when we wish to maximize the determinant of the information matrix) under the model of data missing completely at random.

We have in mind a scenario with a discrete design space consisting of treatment indicators or differences, as in paired comparisons experiments. We must choose an N-point design in a k-parameter context, i.e. choose v1,v2,…,vN (v=f(x)) from a design space V⊆ℝk to maximize  det (M), inline image.

Now det(M) is the sum across all subsets, of size k, of the information matrices based on just k of the above design points (see theorem 2.3.1 of Fedorov (l972)). So under the model of data missing completely at random the expected determinant is (1−ψ)k det (M). Hence the same subset of size N from V is optimal. This will not be true for other criteria such as  log { det (M)} or  det (M1/2) (see Herzberg and Andrews (1976)).

Frontier models

In these models errors ɛ are composed of two components, i.e. ɛ=v+u where v is the usual symmetric zero-mean error term corresponding to measurement error, and u is an asymmetric term corresponding to efficiency. This is an extension of the example of Fig. l(c) (see Section 3.3) in which z=(t,v,u,x) and y=(t+v+u,x), where the additional term u is an asymmetric error corresponding to some type of (economic) efficiency. Model fZ would additionally include u with a distribution such as a half-normal, exponential or gamma distribution.

We observe t+v+u which, under fY, will not be normal but whose likely moments are


where α=��(u) and γ2=V(u). This suggests that the error u is not ignorable in general if θTx includes a constant term which becomes confounded with α.

We can also add some comments on optimal design that arise from Parker's on-going doctoral research. If we assume that

  • (a) the frontier model is linear with a non-zero intercept,
  • (b) inline image),
  • (c) inline image or u∼exp(1/σu) and
  • (d) u and v are distributed independently of each other, and of the regressors,


  • (i) D-optimum designs for frontier models with ɛ=v+u are equivalent to D-optimum designs for linear models with ɛ=v (i.e. no asymmetric error term),
  • (ii) the result from (i) is also true for the trace criterion and
  • (iii) for fixed design points, optimum design weights under the c-criterion for frontier models with ɛ=v+u are the same as the optimum design weights for linear models with ɛ=v (i.e. no asymmetric error term).

Peter W. F. Smith (University of Southampton) and Paul S. Clarke (Imperial College London)

We found this to be an interesting and stimulating paper, and have started to investigate its implications for our recent work concerning the family of incomplete categorical data models developed by Baker and Laird (1988). We wish to make three points relating to our work, which has focused on the non-ignorable log-linear model for partially observed two-way contingency tables M=TX+TR. Here T is an outcome variable which suffers from non-response, R is its response indicator and X is a fully observed covariate.

First we note that, although the examples that were presented by Copas and Eguchi consider misspecification of ignorable models while allowing for locally non-ignorable non-response, the proposed framework also permits setting non-ignorable M as the working model fY. Results from Clarke and Smith (2003, 2004) should be useful when calculating the key quantities that are proposed by Copas and Eguchi to investigate local misspecification for M, such as the eigenvalues of Λ and the incomplete-data bias b.

Second, the ignorable model M=TX+XR for two-way tables is always saturated and will therefore fit the data at least as well as M. However, as was noted by Forster and Smith (1998), we do not have the information to assess empirically the conditional independence assumptions that are required to identify M and M. Therefore, we recommend that both M and M should be considered simultaneously to allow for model uncertainty. Forster and Smith (1998) proposed a Bayesian approach for this. The framework of Copas and Eguchi could be used to develop an alternative non-Bayesian approach. The direction of misspecification uz would be the sum of two components for each model. The uz for M would have a component for the ‘data model’ (i.e. for the TX parameters) and another for the ‘non-response model’ (i.e. for the TR parameters). The uz for M would have the same first component for the data model, but a separate component for the non-response model (i.e. for the XR parameters).

Third, in our study of the finite sample properties of confidence intervals for non-ignorable log-linear models, we found that intervals that are based on the usual asymptotically normal pivotal quantity can have very poor coverage, whereas profile-likelihood-based confidence intervals perform better (Clarke and Smith, 2004). Hence, it might be worthwhile to develop the ideas of Copas and Eguchi for profile-likelihood-based inference.

A. P. Dawid (University College London)

Although it is certainly of importance to investigate biases due to model inadequacy, as the authors have valuably done, it is by no means obvious what we should do with the results of the exercise. In particular, a simple bias correction to a face value analysis of the data may not be appropriate. Thus suppose that the data we observe are subject to a non-ignorable missing data process. Before seeing the data we can calculate a bias correction to take account of that process; but if, by chance, it turns out that the data set actually observed has no missing values, it would surely be inappropriate to apply that correction. More generally, the more missing data we have, the greater the correction we might wish to apply. Do the authors have anything to say about how we might construct such ‘post-data’ misspecification corrections?

Chris Skinner (University of Southampton)

My comments build on the measurement error examples in Sections 1–3.

The paper's framework seems natural and potentially valuable for understanding possible impacts of measurement error and for considering ways of allowing for these impacts. Suppose, for example, that, under the ‘working’ model fZ, inline image is the variable of interest, inline image is measurement error, y=z1+z2 is observed, θ is the parameter of interest and inline image and inline image are assumed known. The proportion of information that is retained in y is inline image, the ‘reliability ratio’ (Fuller (1987), page 3). Under the gZ-model for the ‘worst case’ misspecification direction in equation (27), we find inline image (with inline image as before), i.e. the worst bias when estimating θ will occur if the mean of additive measurement error is erroneously specified, which is a natural result. There seems scope for extending such analysis to more complex measurement error models. There is an established literature using small measurement error asymptotics (e.g. Chesher (1991) and Skinner and Humphreys (1999)), to assess bias impacts of measurement error and to suggest bias-corrected estimators under assumptions about the extent of measurement error. This literature shares the use of local model perturbation with the paper but differs by assuming small inline image. It may be interesting to explore a combination of these two approaches.

The broad framework that is provided for sensitivity analysis seems helpful in the case of measurement error, but I had more difficulty with the practical interpretation of the specific approach of Section 6.2 in this case. The hypothetical scenario where the complete data are known corresponds, in the measurement error setting, to having internal validation data (Carroll et al. (1995), page 12) on the whole sample. This is not a scenario that seems likely have a natural interpretation for the practitioner, since if z1 is observed for all units then there is no need to use y to estimate θ. Sample sizes for validation data will not in general be the same as for the main data, implying a certain arbitrariness in the sample size assumptions in the scenario.

Michael Sørensen (University of Copenhagen)

I would like to discuss briefly how the results in this interesting paper can be applied to study the effects of misspecification when a continuous time stochastic process is sampled at discrete time points. This problem can often usefully be viewed as an incomplete-data problem; see for example Bladt and Sørensen (2005). The authors have developed their theory in the framework of asymptotic likelihood theory for independent data, but essentially the same theory can be derived for stochastic process models under conditions ensuring the usual asymptotic results; see for example Barndorff-Nielsen and Sørensen (1994).

Here I shall limit myself to considering a simple example: the Ornstein–Uhlenbeck process given by


where W is a Wiener process. Suppose that the value of X has been observed at the time points 0,Δ,…,nΔ. In the (unrealistic) situation, where the process is observed continuously in the time interval [0,T],T=nΔ, there is a well-known explicit expression for the likelihood function; see for example Küchler and Sørensen (1997). For models in the ɛ-neighbourhood of the Ornstein–Uhlenbeck model, the quantity uZ will, by results in Jacod and Ménin (1976), necessarily be of the form


where as may depend on the behaviour of X between time 0 and s and is normalized such that


Under the alternative model X solves


Thus all models in the ɛ-neighbourhood are models where the drift has been perturbed. The models in the neighbourhood can by highly non-Markovian, since a might depend on the entire past of X. It is certainly interesting to study the effect of a misspecified drift, but it is important to note that the fact that the models in the ɛ-neighbourhood have density with respect to the original model excludes misspecification of the diffusion coefficient.

In the simple example that is considered here, the Fisher informations IZ and IY can be calculated explicitly, but for general continuous time models the discrete time likelihood is only rarely explicitly available. For diffusion models the information matrix IY can be approximated for instance by means of the approximation to the likelihood function by Aït-Sahalia (2002). When the EM-algorithm can be applied, as for example in Bladt and Sørensen (2005), the observed information matrix corresponding to IY can be obtained.

Anthony C. Atkinson (London School of Economics and Political Science)

In the discussion of this interesting paper, Torsney and Parker introduce ideas of experimental design when some data are potentially missing. Optimum designs, including D-optimum designs, concentrate the experimental effort at a few design points in the experimental region ��; several replicate readings may be taken at each of these points.

Herzberg and Andrews (1976) considered the effect of individual observations being missing completely at random (MCAR). Using simple examples they showed that designs that have more than the minimum number of design points in ��, and so less replication, may be less likely to be completely non-informative. Of course, there will be some loss of efficiency if no data are missing.

Another mechanism may sometimes be important when all observations in some small regions Δxk ∈ �� are missing. We could call it MCARX. Under MCARX the effect of replication is much more severe than that calculated by Herzberg and Andrews. To overcome this effect, in extreme cases designs may spread observations as uniformly as possible throughout the experimental region. Bates et al. (1996) described the use of such designs in error-free simulation experiments. One example in standard experimental design is when the conditions in some part of �� are so severe as to produce a different response; tar instead of a clear liquid in a chemical experiment. In both survey design and agricultural experiments blocked geographically, a whole block may be at risk from flood, locusts or war. In phase I clinical trials certainly suboptimum designs are used that may employ an appreciable number of dose levels to avoid regions of toxicity. This is something to be pleased about if you volunteer for a clinical trial although, as a patient treated in accordance with the results of the trial, I may be more interested in good information.

Of course, the literature on missing observations that was cited by Copas and Eguchi in Section 3.1 discusses missing values of xij. My point is that spatial or other proximity may cause all observations in some Δxk to be missing. The problem is, I think, more one of survey and experimental design than of analysis. ‘Once it's gone, it's gone.’

The following contributions were received in writing after the meeting.

C. Chatfield (University of Bath)

Model uncertainty is perhaps the most neglected outstanding problem in statistical science. My impression is that the vast majority of inference is still made conditional on a model that is assumed to be both true and known. As useful general references, I mention two of my own (Chatfield (l995) and Chatfield (2001), chapter 8) and the book on modelling by Burnham and Anderson (2002) that is unusual in not assuming the existence of a true model.

As well as model uncertainty, this paper looks at another important outstanding problem, namely how to handle incomplete data. The key equation in the paper is equation (16) and the authors show how this can facilitate handling incomplete data when model uncertainty is present. Thus, from my reading of the paper, I wonder whether a clearer title for the paper would be ‘Bias arising from incomplete data in the presence of local model uncertainty’.

I am less sure whether, and if so how, equation (16) might be used to tackle the typical problem that is faced by a modeller with complete data but incomplete information about the model, as for example the time series analyst trying to decide whether an appropriate model is an AR(l), or AR(2), or MA(1) or …. If equation (16) can help, could we have an example?

Section 8 (re)raises the fundamental question about how to combine model uncertainty with sampling variability. The authors go on to say that the ‘Bayesian paradigm provides a complete solution, at least in principle’. For this to be true, the modeller would need to know the relevant equation (16), assume that it is true and have priors for all necessary quantities. This is unlikely in practice, to say the least, and I doubt that this sort of statement is helpful. The authors also comment that a general formulation for handling model uncertainty when the model is chosen to depend on the data (as it often is) is ‘difficult’. I am tempted to change this to ‘impossible’.

Finding some of the mathematics rather difficult, I looked with particular interest at the main example in Section 7. Although I do not understand all the details, it seems to me that the example is mainly concerned with model sensitivity rather than model uncertainty. It is, of course, valuable to investigate how sensitive any conclusions are to the model assumptions, but this does not solve the general model uncertainty problem. The example assumes knowledge of a log-linear model and of likely departures from it, which is more information than the modeller often has.

Robert Curnow (University of Reading)

I congratulate the authors on an interesting paper on an important topic.

The example that they use to illustrate their approach, a meta-analysis of the risk of lung cancer from passive smoking, is taken from Hackshaw et al. (1997) and, as the authors state, was used by the Scientific Committee on Tobacco and Health in their report to the Chief Medical Officer at the Department of Health (in 1998). I was a member of this Scientific Committee at the time of the report. The authors do not make clear that the advice in the report took account of much wider evidence on the likely risks from environmental tobacco smoke. This included a parallel study of the literature on passive smoking and ischaemic heart disease (Law et al., 1997) and evidence concerning sudden infant death syndrome and serious respiratory illness, asthmatic attacks and middle ear disease in children.

We did consider the possible biases in the epidemiological studies that were mentioned by the authors. We may have underestimated the evidence of a publication bias. Certainly possible important biases should be identified and dealt with as well as possible. However, the major problem is always the translation of the results of analyses into relevant information about the likely consequences of alternative policies. We estimated that the relative risk of lung cancer from passive smoking of 1.24 translated into several hundred extra lung cancer deaths a year. We concluded that the similar estimated relative risk for ischaemic heart disease translates into much larger numbers and ‘represents a substantial public health hazard’. We hoped that these estimates together with measures of their uncertainties would be used by policy advisors and decision makers in comparing the costs and benefits of alternative policies. Confidence intervals as in Fig. 6 of the paper are interesting but the cost–benefit comparisons will surely influence our attitude to uncertainties in the estimates and in the relation of these uncertainties to the unknown level of correlation of exposure and the confounder. In this context, the importance of an interval at an arbitrary confidence level not including 1 is far from clear. Any rule that relative risks of less than 2 (quoted in Section 7) should be ‘regarded with considerable caution’ takes no account of the costs and benefits of alternative policies.

David Draper (University of California, Santa Cruz)

I can only add two small notes to this interesting and important paper.

  • (a) In Section 8 the authors
  • (i) mention data-driven model search as normal practice,
  • (ii) note that after such a search ‘it is misleading to use (unconditional) sampling distributions as if the model were fixed’ and
  • (iii) conclude that ‘A general formulation of this seems difficult’, by which they presumably mean ‘difficult in the frequentist paradigm employed in the paper’ (in the Bayesian approach, by contrast, it is natural to cope with (much of) this problem by integrating over the model uncertainty that is uncovered by the search).

A noteworthy reference in the econometrics literature that makes a good start on the general frequentist formulation that is desired by the authors is the work of Pötscher (1991), updated recently in Leeb and Pötscher (2003).

  • (b) The key idea of Section 6.2, which is natural from the Bayesian viewpoint and was not always as easy to find when reasoning in a frequentist way before the advent of random-effects models, is to treat bias of unknown magnitude and direction as variance.

The authors use this idea to motivate what in physics used to be called a ‘fudge factor’: if you are not sure whether your measuring instrument is biased, construct the usual 95% confidence interval based on your possibly biased measurements and widen it by a multiple of k (it was common in physics some decades ago to use values of k in the range (1.5, 2.5), which are larger than the √2 that is recommended by the authors). Fig. 7, based on a plot in Henrion and Fischhoff (1986) and discussed further in Draper et al. (1993), illustrates the bias dilemma that was faced by 20th-century physicists in estimating the speed of light c. Measurements by using the best available technology were taken seven times between 1929 and 1973, and the graph plots 68% Gaussian interval estimates assuming no bias (some of these intervals are so narrow as to appear just as points); the currently accepted value of c from 1999 is indicated by the dotted line. The differences from one decade to another in the current estimate of c are not explainable by models which reflect only sampling variability and no bias, which motivated the fudge factor solution mentioned above in the early part of last century. Random-effects meta-analytic models, for instance of the form


where yij is measurement j on occasion i and bi is the unknown bias on that occasion, offer a more principled way to arrive at an appropriate fudge factor. Of course such models can only be fitted when an effort is made to replicate the measurement process by using several different possibly biased methods, something which is attempted far more often in fields such as physics than in (say) the social sciences.

Figure 7.

Estimates of the speed of light from 1929 to 1973 (- - - - - - -, currently accepted value since 1999)

Claire Ferguson, Neil Henderson, Mathias Onabid, Helen Parker, Gareth Pritchard, Maarya Sharif, Ximin Zhu and Ernst Wit (University of Glasgow)

The main result of the paper is the so-called ‘doubling the variance’ rule. Although elegant in its simplicity, the authors remark in passing (Section 6.2) that the bounds are not optimal. As the authors do not work out these optimal bounds explicitly, we make this the aim of our contribution here. We believe that such sharper bounds are of potential use to applied statisticians.

Our comments are directed towards Fig. 4 in the paper, where the authors provide a plot of k*(α,λ), varying over λ for specific values of α. As λ is determined by the amount of information that is lost by observing only y instead of z, this quantity is typically unknown in practice. Therefore, we suggest that it would be more useful to plot k*(α,λ) for the worst case scenario of λ, i.e. k*(α)=max{k*(α,λ)}, against different significance levels α. This would then provide a more useful confidence interval inflation factor for applied statisticians.

We derive k*(α,λ) and subsequently k*(α) from the equation below equation (68),


Solutions for equation (86) were obtained for fixed α and λ via a grid search. Fig. 8 suggests that the upper bound of √2 that is given in equation (69) only applies to a 100% confidence interval and may be reduced to a value of 1.35 for a more typical case, where the significance level is not below 0.01. Fig. 8 also summarizes some of the most commonly used significance levels and their corresponding values of k*(α). We therefore propose Fig. 8 as a more useful alternative than Fig. 4 in the paper.

Figure 8.

Doubling the variance is a suboptimal result for fixed values of the level of significance: the inflation factor for the confidence intervals can be substantially smaller, depending on the level of significance that is required

Sander Greenland (University of California, Los Angeles)

Copas and Eguchi tackle elegantly the problem of bias dodged by conventional statistics. I prefer subjective Bayesian approaches, in the belief that constructing informative priors helps to ground models in the science, clarifies underlying assumptions and allows thorough accounting for sources of uncertainty. None-the-less, Copas and Eguchi provide an improvement over conventional methods and a welcome step towards realistic analysis of observational studies. I regard most of the claimed simplicity and interpretability advantages of conventional methods as illusions produced by packaged software and by glossing over their misinterpretation and unrealism in observational settings (Berk, 2004; Greenland, 2005). More realistic methods make it possible to do a better job; although their incorporation into software will no doubt lead to abuse, abuses can be studied to improve teaching of the methods.

Still, certain pragmatic concerns about my recent bias modelling paper (Greenland, 2005) may have parallels for the Copas–Eguchi approach. Although valuable for someone who is immersed in both theory and application, typical scientists may find their approach a magic box yielding prescriptions (‘double the variance’) whose justification, meaning and limitations (to ‘local model uncertainty’) are unclear. To minimize misinterpretation, the limitations need to be underscored in the simplest possible terms. Especially, if an approach produces only an O(n−1/2) absolute expansion of interval estimates, then that approach is insufficient when O(1) bias is the largest concern, as in meta-analyses. Also, variance inflation alone cannot capture the asymmetric uncertainty effect of measurement errors (even when those errors are symmetrically distributed). Both points are seen in Greenland (2005), Table 2, where the final variance is over 20 times the conventional variance and the final interval is highly log-asymmetric.

In Section 1 Copas and Eguchi imply that correction for confounding would require ‘values of all possible confounders for each subject’. Fortunately the requirement is not that stringent: at least since Moses (1969) it has been recognized that sufficient confounding adjustment can be attained through a greatly reduced confounder summary or subset; Greenland et al. (1999) gave graphical and probabilistic sufficiency criteria. In Section 5.2, c could (perhaps should) be taken as a minimal sufficient confounder summary. In this regard, the propensity score is a sufficient but not minimal sufficient summary: although it is the coarsest balancing score for the included covariates (Rosenbaum, l995), minimal sufficient summaries can be much smaller. For example, suppose that all causes of treatment are independent of disease given treatment; then there is no confounding and so the minimal sufficient confounder subset is empty, but the propensity score may be arbitrarily complex.

Paul Gustafson (University of British Columbia, Vancouver)

There is much to think about in this fascinating paper on the interplay between data which are incomplete or ‘degraded’ in some sense and models which are not quite right. Of course both incomplete data and misspecified models feature heavily in many real applications of statistical methods. This reader expects to return to the paper again and again in the future, given its rich methodological developments and its interesting set of examples.

On casual glance, the authors’ findings concerning the ‘incomplete-data’ bias, defined as θgYθgZ, seem to portray an undesirable interaction between the absence of complete data and model misspecification, i.e. the effect of misspecification is always worse in the incomplete-data case. However, as the authors are careful to point out, such an interpretation is tied up with the Royall and Tsou (2003) assumption about the object of inference matching the object of interest. More generally, the effects of model misspecification and data imperfections can be offsetting. For instance, Gustafson (2002) looked at misspecified linear models applied to mismeasured predictors. Simple forms of misspecification are considered, such as ignoring interaction or curvature in the regression function. The inferential interest is in the relationship between the response and the true but unobserved predictor, but no adjustment is made for the predictor measurement error. Trade-offs arise in that the squared bias that is induced by model misspecification varies inversely with the squared bias that is induced by measurement error. Although the framework does not match that of the present paper exactly, both works point towards a rather nuanced interplay between incomplete data and model misspecification.

Another finding in Gustafson (2002) which jibes with the present paper concerns the detectability of model misspecification. The elegant discussion of this by Copas and Eguchi matches with a graphical illustration in Gustafson (2002) that model misspecification and incomplete data can induce a ‘double blow’. Not only does the model misspecification induce a bias, but also this misspecification is more difficult to detect from incomplete data than from complete data. Specifically, Gustafson (2002) illustrated that, as the predictor measurement error increases, regression diagnostics from fitting the response to the mismeasured predictor are less able to detect curvature in the underlying regression function for the response and the true predictor.

Manabu Iwasaki (Seikei University, Tokyo)

The authors are to be congratulated for their thought-provoking and mathematically lucid paper. They broaden the concept of incompleteness and have reached a remarkably simple result, namely their double-the-variance rule. A simple result derived from deep insights should be very useful in various research fields: a good example is the Akaike information criterion.

Among the many incomplete-data problems that are discussed by the authors I would like to ask them about the relationship between sensitivity analysis and the double-the-variance rule. In observational studies with possible hidden bias, for several conceivable settings of the bias, sensitivity analysis provides the magnitude of the discrepancy of estimates from the one under the ideal randomization model: see Rosenbaum (2002). In some cases even a small bias in a certain unfortunate direction may cause a seriously large discrepancy, whereas big bias may not largely affect the result that is actually obtained. Such information is quite important for practical applications to judge the extent to which the inference is influenced by hidden bias. The authors’ double-the-variance rule is simple but seems too simple to provide us such useful information. How do we interpret the result that is given by the rule? Is it a worst case in the setting of local modelling?

I have one additional comment on terminology. The authors use ‘true model’ as a reference model in their argument. I think that the term true model is used by the authors to mean the true mechanism of the data generation process. It might be misleading because ‘Models, of course, are never true, but fortunately it is only necessary that they be useful’ (Box (1971), page 2).

Charles F. Manski (Northwestern University, Evanston)

The local likelihood-based analysis that is undertaken by Copas and Eguchi requires that we specify a full parametric model for a sampling process and take the model very seriously. Their analysis does not demand that the model be precisely correct but it does require that deviations be small, indeed vanishingly small in the specific sense that they define. I admire the technical dexterity that the authors exhibit, but I am not sanguine that their work will help many empirical researchers to confront real problems of incomplete data.

The central problem is the rarity of the circumstances in which it is credible to assume that an actual sampling process lies within a small neighbourhood of a specified parametric model. To make the point, I shall consider an example with which I am familiar. Copas and Eguchi discuss the so-called ‘Heckman model’, a parametric latent variable model for missing outcome data that was developed by various econometricians in the 1970s and that was particularly advocated at the time by James Heckman. Empirical researchers, especially in labour economics, initially embraced this model. However, methodological studies and empirical experience soon made it clear that the model is extremely fragile, the identification of its parameters resting on linearity and normality assumptions that rarely if ever are credible in economic research. As a result, the standing of this model diminished sharply by the mid-1980s. The slight weakening that is proposed by Copas and Eguchi will not enhance its empirical relevance.

My own nonparametric research on incomplete-data problems is a pole away from the approach of Copas and Eguchi. Beginning in Manski (1989) and continuing through many journal papers and two books (Manski, 1995, 2003), I first ask what observation of the many-to-one function y=h(z) by itself reveals about P(z), the probability distribution of z. I then go on to ask what observation of y combined with relatively weak but credible assumptions reveals about P(z). The generic answer, whose specifics depend on the nature of h(·) and the maintained assumptions, is that observation of y restric-ts P(z) to some set of feasible distributions, called its identification region. The analytical challenge is to characterize this region constructively and to show how it may be estimated from finite sample data. Copas and Eguchi may, perhaps, be unaware of this research given that most of it has appeared in the econometrics literature. However, contributions in the statistics literature include Manski et al. (1992), Horowitz and Manski (2000) and Manski (2003), as well as the related work of Balke and Pearl (1997).

John W. McDonald (University of Southampton)

I would like to ask how the doubling variances idea would work in practice in the context of the meta-analysis example of passive smoking and lung cancer. Should the variances be doubled for each of the individual studies, but then not doubled for the random-effects model? Or should the variances not be doubled for each of the individual studies, but only doubled for the random-effects model? Or should the variances be doubled at both the individual study level and for the meta-analysis? If this idea catches on, one future potential practical problem in performing a meta-analysis will be if some, but not all, individual studies double variances, but do not report this information (an extra level of uncertainty).

Clare McGrory, Sarah Barry, Alastair Fearnside, The Mahn Nguyen, Rossella Lo Conte, James Weir, James Miller, Angela Recchia and Ernst Wit (University of Glasgow)

When stripped down to its bare essentials, the argument in the paper goes something like this:

  • (a) we observe incomplete data y to perform inference about an identifiable θ;
  • (b) if we had observed the complete data z, then we could have tested whether or not z comes from fZ or gZ;
  • (c) under the hypothetical assumption that we have done such a test, we hypothetically did not reject fZ as the true model;
  • (d) conditional on not rejecting the goodness-of-fit test, the actual confidence interval for θ is at most √2 wider than the marginal fY would suggest.

The epistemological status of steps (b) and (c) is not made particularly explicit in the paper. Do the authors believe that any data set from a non-designed experiment can always be regarded as potentially incomplete? To put it in simple terms, are the authors suggesting that unfortunately we always live in the fY-world without necessarily knowing what the complete data might look like? In that case, the √2-rule would not be merely a rule of thumb, but a universal correction factor that should be applied out of epistemological prudence—without ignoring the real possibility that y makes inference about the parameter of interest impossible.

If the paper has such universal implications, are the examples not slightly misleading? In the passive smoking example, the authors apply the double-the-variance rule because they believe that there may be a confounding variable, such as poor diet, which may be correlated with the exposure variable. However, if there is any reason to believe that such a possible confounder exists, then assumption (c) seems at least dubious, if not unreasonable. In particular, in the passive smoking example it would seem more sensible to consider inference based on a sensitivity analysis using the full curves C1 in Fig. 6 and not C*. Conclusions would be quantitatively and qualitatively different even with a correlation as low as 0.03 between the exposure and the confounder. Only in the case where all reasonable confounding variables have been ‘disproved’, in the sense that they are not found to be significant, would it then seem relevant to use the double-the-variance rule to account for some other confounder which is unknown?

Vilda Purutçuoḡlu and Ernst Wit (University of Glasgow)

In this comment we consider a hypothetical extension of the passive smoking example. We assume that the complete data consist of the total number of lung cancer cases in a passive smoking household environment, in which there are m passive smokers as the result of one actual smoker. We aim to study the effects of model misspecification, in the form of overdispersion, on the width of the confidence interval within the framework that is presented in the paper.

Let the complete data z stand for the number of lung cancer cases in a family. Under the working model fZ, the data z are binomially distributed, Bi(m,θ). The true overdispersed model gZ claims that the data actually came from gZ|Θp, where gZ|Θ=fZ and p is a beta(α,β) distribution.

It is easy to show that the model misspecification is given as


The misspecification size ɛ depends intrinsically on α and β. In fact, Fig. 9 shows that by letting αβO(n1/2) we achieve the critical misspecification rate ɛO(n−1/2). A special case occurs for m=1 when the misspecified model fZ and the correctly specified model gZ are in fact one and the same.

Figure 9.

n1/2ɛuZ has constant variance for αβO(n1/2) for m=3, which implies that this achieves the required ɛ∼O(n−1/2)

Paul R. Rosenbaum (University of Pennsylvania, Philadelphia) and Alan J. Salzberg (Quantitative Analysis, Inc., New York)

In one interesting theme of their interesting paper, Copas and Eguchi observe that model uncertainty or sensitivity to bias has both magnitude and direction, with different directions affected differently. Absence of identification (λmin=0) is compatible with immunity to problems in certain directions (λmax=1); an information matrix may be singular but not zero. Consider a 23 complete factorial Z with constant, three main effects (coded ±1), three two-factor interactions and one three-factor interaction, with independent normal errors having variance σ2, so the 8×8 information matrix IZ is a diagonal matrix with all diagonal elements equal to 8/σ2. If a half-fraction Y is selected to alias the constant and the three-factor interaction, the 8×8 information matrix IY has 4/σ2 on both diagonals and 0s elsewhere; then the eigenvalues in Section 2 are (1, 1, 1, 1, 0, 0, 0, 0), so there is information but effects are aliased. An unwise form of replication does the same half-fraction twice, yielding eigenvalues (2, 2, 2, 2, 0, 0, 0, 0), so variability is reduced but aliasing is not. A wise form of replication uses the complementary fraction or foldover (Box and Wilson, 1951), yielding eigenvalues (1, 1, 1, 1, 1, 1, 1, 1), so aliasing is reduced. We shall briefly discuss how the same issues arise in observational studies.

Shadish et al. (2002) added structures to immunize observational studies to specific biases, without eliminating all biases. This is formalized in Salzberg (1999) where certain non-ignorable departures from random treatment assignment do not perturb certain statistics, and illustrated by New York's litigation against Bandolene Fuels alleging overbilling for heating fuel. Fuel consumption was contrasted in 1979, 1980 and 1981, in two groups of buildings: those serviced by Bandolene only in 1980 and those never serviced by Bandolene. The 6=3×2 distributions of fuel consumption exhibit a remarkable pattern: a dramatic increase in fuel consumption only in 1980 only in buildings that had been serviced by Bandolene. Although not an experiment, there is immunity to specific unobserved selection biases: a cold winter in 1980, or poor insulation in the Bandolene buildings or deteriorating insulation in Bandolene buildings could not produce the observed fuel consumption. Still, the ‘Bandolene effect’ is aliased with an unobserved bias that assigned to Bandolene those buildings that would consume a large amount of fuel only in 1980, however implausible such a bias may seem. The treatment effect is immune to bias in certain plausible directions, but aliased in less plausible directions.

If an observational study is replicated, biases may replicate along with effects. Replication reduces uncertainty about unobserved biases if replicates vary the direction of potential biases, and ways to do this are discussed in Rosenbaum (2001).

The authors replied later, in writing, as follows.

We thank the discussants for their comments. An unusually large number of discussants could mean either that the paper is unusually interesting, or that it is unusually contentious. We hope for the former but have some fears for the latter. The further examples and insights into our general theme that both model uncertainty and incompleteness have direction as well as magnitude (measured by Λ, ɛ and uZ) are very instructive. However, one or two discussants take us to task for promoting the ‘double-the-variance rule’ as if it were a kind of panacea for all problems of bias. To suggest that there is a simple solution to these deeply challenging problems was certainly not our intention.

Several contributors raise the question of whether ɛ is O(n−1/2) or O(1). Of course by the former we do not mean in any literal sense that as you acquire more data the biases will eventually vanish, just as when we use the simple asymptotic recipe inline image for a confidence interval (based on n→∞) we do not really believe that our data are part of an ever expanding experiment. In fact, as Dr Little points out, the reverse is more likely to be the case; big data sets are probably less likely to control effectively for sources of bias. These limits are just mathematical devices for deriving useful approximations for finite n. In our case ɛ=O(n−1/2) yields remarkably simple, and almost universal, approximations for bias as we have shown. No such transparent approximations will exist for the limit n→∞ with fixed ɛ, i.e. bias may depend in non-trivial ways on higher order modelling assumptions.

Our crucial distinction is between complete data z and incomplete data y. For z, Dr Little expresses it more elegantly than we have done when he says ‘small ɛ… is defensible in complete-data modelling, since judicious model checks should be able to detect larger-than-local departures’. We formalize this by arguing that ɛ=O(n−1/2) gives a sensible asymptotic theory for such model checks. But, for y, no amount of data checking can insulate us from the possibility that ɛ is large. In that case, we argue, the bias is likely to be worse than the corresponding bias for z-inference: hence the variance inflation factor of Section 6 as a lower bound to uncertainty.

Some discussants question whether a lower bound to uncertainty is of any practical use. We see this as a small but sensible improvement over naïve methods which every one recognizes are misleading—we agree with Professor Greenland's comment ‘conventional methods [are] illusions produced by packaged software … glossing over … unrealism in observational settings’. Consider medical papers which report P-values from observational studies using naïve null distributions inline image, so inline image. Then conventionally (assuming ɛ=0) we might

  • (a) dismiss effects as noise if P>0.05 and
  • (b) flag up effects as potentially interesting if Pleqslant R: less-than-or-eq, slant0.05.

When ɛ is unknown then (a) still seems reasonable, but inferences in case (b) are now inconclusive in the sense that effects with Pleqslant R: less-than-or-eq, slant0.05 could be explained by inadequacies in the design. We have replaced P by the more cautious inline image. Case (b) is still inconclusive (ɛ might be large) but we now include into the acceptance region cases with Pleqslant R: less-than-or-eq, slant0.05<P*, on the grounds that they could reasonably be explained away by a combination of sampling error plus the level of model uncertainties which we customarily ignore (even if we could check the model carefully by recovering all missing observations). If inline image (the worst case, as in our example in Section 7) this is equivalent to tightening the conventional significance threshold from P=5% to P=0.3%. In practice this would cast into doubt a sizable proportion of naïve significance claims.

In Professor Greenland's example, however, this adjustment would cut little ice as it seems that in his study ɛ is probably quite large. Similarly, in Dr Draper's example, the biases in the early studies appear to be so great that a much higher ‘fudge factor’ (to use his term) than √2 would be needed to bring the experiments into line. However, in our experience, researchers often ‘trawl for significance’ while conveniently forgetting the potential biases that they have induced by ignoring missing observations. Even 5% missing data (k=1.2), which would be a wonderfully high response rate even for well-designed surveys, tightens the conventional significance threshold from P=5% to P=1%.

Dr Little says that ‘double the variance’ is unnecessarily crude in that it ignores the amount of missing data. We agree; when λ is known (as for missing data) we should use k, although our Fig. 4 shows that λ must be quite close to 1 (little loss of information) for k to be close to k=1. The simplicity of k=√2 is that often we shall not know the value of λ, or even be able to identify z or fZ (save that y=h(z) for some function h). For example in practice we shall not know the strength of potential confounders. Dr Ferguson and her colleagues have such cases in mind when they replace √2 by the more careful calculations of the maxima over λ in their Fig. 8. However, we remain to be convinced that great accuracy is justified, given the rather arbitrary nature of these arguments. Included in these is the convenient, but not entirely convincing, assumption in Section 6.2 that the same significance level is used in the model diagnostics as in the confidence intervals.

Several points that were raised by discussants concern finite sampling aspects and so are beyond the scope of our strictly asymptotic analysis. Our asymptotics with ɛ=O(n−1/2) mean that the first-order effect of model misspecification is in the bias and not the variance. This means that there is no trade-off between bias and variance, as Dr Longford points out. In finite samples the variance is also affected, sometimes markedly so. In this case it is not surprising that finite sample coverage of confidence regions based on the naïve asymptotic pivot can be poor (Dr Smith's last comment). Our only immediate comment to Professor Dawid's challenging question is to point out that in practice nuisance parameters governing the missing data process will be estimated from the data, and so if there happen to be no missing observations then parameters like ψ in Section 3.1 will have inline image and so estimates of bias will be 0. Equivalently, asymptotic inference will be essentially the same as conditional inference given the observed incidence of missing data. It will be interesting to develop proper conditional inference in our setting, conditioning on the actual outcomes of the data coarsening process that is implicit in our function h(z).

As noted by Dr Chatfield, our discussion of model uncertainty does not cover more general problems of model choice, such as the problem which he mentions of order determination in time series. As discussed by Dr Manski, we fix on a single model fZ and discuss local and essentially nonparametric departures belonging to a tubular neighbourhood around that model. Several problems that were mentioned by discussants do not fall within this formulation, such as the alternative approaches to bias problems that were listed by Dr Liao. Similarly, Professor Diggle's intriguing example concerns two specific alternative models, again not within our framework. A similar problem but closer to our set-up would be to take model 1 as g, model 2 as f and ν as the analogue of ɛ. Then we require f=g if ν=0, so if we retain parameter μ in f we need to redefine μ as μ−1 in g. Then for small ν we find that the local bias, which is defined to be the limiting value of inline image, when model f is fitted to data from model g, is approximately b=2ν3 φ(1/ν). As required by our approach, this tends to 0 as ν→0.

Our examples discuss only single sources of bias, but the formulation naturally extends to multiple biases acting simultaneously, e.g. publication bias and hidden confounding in the passive smoking studies. As we have shown in Section 4, a global uZ can be split into meaningful directional components. Professor Skinner suggests combining local bias with small σ asymptotics in measurement error problems. Let


and gZ=fZ exp (ɛuZ), where θ=(μ,σ2). As he points out, worst case local measurement bias perturbs μ, whereas the variance of measurement error perturbs σ2. Consider the particular class of uZs that is given by equation (27) and hence the direction uY reduced from y=z1+z2 is


for (−1leqslant R: less-than-or-eq, slantrleqslant R: less-than-or-eq, slant1). Then, if we take the limit σ2→0, the corresponding gY is


The misspecification is in the mean (r=±1), the variance (r=0) or a combination of the two (−1<r<1).

We are grateful to discussants for raising questions of design. Fractional replication y of a complete factorial z is a rather natural example of incomplete data, although an extreme one in that some eigenvalues are 0, as pointed out by Dr Rosenbaum. Design is then about the choice of h, and robustness of design is about sensitivity to model uncertainty about fZ (Professor Atkinson's comments). Dr Rosenbaum's discussion of directions in the fractional factorial is essentially about contrast vectors d in our equation (5). In the general set-up, the direction d which suffers least or most loss of information is inline image where e is the eigenvector corresponding respectively to the largest or smallest eigenvalue of Λ. The bias is then inline image. The worst case misspecification occurs when inline image. These simple formulae give insight into the nuances between incompleteness and misspecification that were discussed by Dr Gustafson and Dr Iwasaki, including Dr Gustafson's ‘double blow’ when both worst cases happen together.

Dr Iwasaki asks about the relationship between our idea in Section 6 here and the sensitivity analysis that was discussed in earlier sections and in Copas and Eguchi (2001). Essentially, we are replacing dependence on ɛ by conditioning on inline image (also see our futher comment in Section 8). However, the conditioning argument naturally suggests a sensitivity analysis via the calibration


and so, asymptotically, inline image has a non-central χ2-distribution χm(ν) with m degrees of freedom and non-centrality parameter


Consider a maximizer inline image of ɛ that satisfies


Then this calibration inline image corresponds to the ‘double-variance’ idea.

Modelling a continuous time process with non-ignorable selection of the times at which it is measured is perhaps the most challenging problem brought up in the discussion (Professor Diggle and Professor Sørensen). We have no quick answers to offer, except to reiterate that such problems fall naturally into our framework, e.g. with uZ as defined as in Professor Sørensen's contribution.

Dr Manski describes his approach to incomplete-data problems as ‘a pole apart’ from the approach that we discuss here. We are not convinced that our approaches are quite as disparate as he suggests. We both avoid making rigid parametric assumptions about uZ, such as in the Heckman model. Our purpose in Section 5.3, of course, was not to promote the Heckman model but to point out how it fits in as a special case of our general formulation, and to cite its fragility as motivation for the later discussion in Section 6. In several expressions such as expressions (23), (30) and (37) we have fixed ɛ, or its calibration in the context of the particular problem, and found worst case bounds for bias. We have cited some similar work in Copas and Jackson (2004) where we extended our small ɛ bounds by finding worst case bounds with ɛ=O(1), but with an added monotonicity assumption about uZ.

We thank Professor Curnow for reminding us of the broader scientific issues that are raised by the passive smoking example. This is a meta-analysis on which one of us has worked previously, and it is included here merely to illustrate the motivation behind our paper. We accept that the task of the advisory committee was to advise on matters of policy in the light of the costs and benefits that are involved, and that this is more than to adjudicate on the ‘significance’ or otherwise of one particular relative risk. However, if ‘estimates together with measures of their uncertainties would be used by policy advisors and decision makers’ then we have to be able to estimate how large these uncertainties are: hence our discussion. Whatever policies are eventually decided by Governments, it seems to us that people whose lives are affected by them should be able to ask about the strength of the scientific evidence on which they were based.

We do not have space to comment on all of the many points that have been made in the discussion. The fact that these problems now feature prominently in the literature of three separate disciplines (statistics, epidemiology and econometrics) confirms their ‘near universal relevance’ (to quote Professor Diggle) and indicates the increasing scope for cross-fertilization of ideas and approaches. In conclusion, we thank the discussants again for their interest in our paper, and we look forward to taking their comments and further references on board in future work.

References in the discussion