#### 6.1. Identifiability and tests of fit

Our local approximations have assumed that *ɛ* is small, but we have not discussed the size of misspecification that is needed for the bias approximations to be useful in the practical setting of inference from a sample of *n* observations. Standard asymptotic inference allows us to estimate *θ* to within an accuracy of the order of magnitude *O*_{p}(*n*^{−1/2}), and the local bias approximations are of the order of magnitude *O*(*ɛ*). Thus our approximations allow us to combine these two sources of error in meaningful ways when *ɛ*=*O*(*n*^{−1/2}). This size of *ɛ* means that the misspecification is ‘undetectable’ in the sense that empirical evidence for discriminating between *f* and *g* remains uncertain even when sample sizes are indefinitely large.

To see this, first consider the ideal situation in which we can observe a sample of the complete data *z*_{1},*z*_{2},…,*z*_{n}. If we knew *θ* and the misspecification function *u*_{Z}, then we could test *g*_{Z} in equation (16) against *f*_{Z}, i.e. test the null hypothesis *H*_{0}:*ɛ*=0, with the uniformly most powerful standardized test statistic

- (55)

Under the null hypothesis, *T*_{Z} is asymptotically standard normal. If, for significance level *α*, we reject hypothesis *H*_{0} when |*T*_{Z}|*d*=Φ^{−1}(1−*α*/2), the asymptotic power function is

- (56)

since *E*_{g}{*u*_{Z}(*z*;*θ*)}=*ɛ*+*O*(*ɛ*^{2}) from equation (16). With *ɛ*=*O*(*n*^{−1/2}), the term *n*^{1/2}*ɛ* remains finite for large *n*, and so the misspecification is undetectable in the sense that the power of the optimum test does not tend to 1 as *n*∞. Note that this argument is unaffected if we use in place of *θ* in equation (55), since

- (57)

The last term vanishes, which is a consequence of the identity *E*_{f}(*u*_{Z})=0 for all *θ* which, on differentiating with respect to *θ*, yields *E*_{f}(∂*u*_{Z}/∂*θ*)=−*E*_{f}(*s*_{Zu}_{Z})=0, as discussed at the end of Section 4.

If we could sample *z*, then we could use *T*_{Z} to test for misspecification in any given direction *u*_{Z}(*z*;*θ*). Of particular interest would be the directions (27) which give maximum bias for estimating scalar contrasts of *θ*. However, the situation is quite different when we can only sample the incomplete data *y*_{1},*y*_{2},…,*y*_{n}, since there may be misspecifications in *g*_{Z} which cannot be detected from data on *g*_{Y}. If *u*_{Z} is such that *u*_{Y}=*d*^{T}*s*_{Y} for some vector of constants *d*, then

and so the misspecification is completely confounded with the unknown value of *θ*. An example of this happening is the simple pattern mixture model for missing data with *z*=(*t*,*r*) and *t*|*r*∼*N*(*θ*+*r**ɛ*,1). It is obvious that we have no information about the size of *ɛ* since we can only observe data on the conditional distribution of *t* given *r*=1. In this case we find *d*=1 since *u*_{Y}=*s*_{Y}=*r*(*t*−*θ*).

To see that this can always happen, consider the analogue of *T*_{Z} for observations on *y*, namely

When we make this test operational by replacing *θ* by the incomplete-data estimate , the analogue of expression (57) is

The term in braces is the sample residual when *u*_{Y} is projected onto the linear space that is spanned by the components of the score function *s*_{Y} and so is identically zero if *u*_{Y}=*d*^{T}*s*_{Y}. But this is exactly what happens in the worst case misspecification in equation (27), for then

This is well defined provided that *λ*_{max}<1. Referring to Fig. 2, the worst case for bias is when the triangle that is defined by the projection of *g*_{Y} onto the model collapses onto a line along the model. The first-order approximation to *g*_{Y} is then a member of the family of distributions *f*_{Y}, just with a shift in the value of *θ*.

The test that is based on *T*_{Y} fails because we are allowing misspecification to be in *any* direction, including the worst case function *u*_{Z} in equation (27). If, however, *u*_{Z} is *known* to take some other functional form, the test is possible and the model is identifiable. An interesting case of this is the Heckman model (38) and (39). Here, in the notation of that section,

- (58)

This is closely related to the standard method of fitting the Heckman model (Heckman, 1979), which is to estimate *ψ* from equation (40), to add the corresponding estimate of *λ*(*ψ*^{T}*x*) as an additional covariate to the linear regression of *t* on *x* and to refit by ordinary least squares. This is because, from equation (41), values of *t* and *x* among those cases with *r*=1 can be written

- (59)

where *δ*^{*} is a random residual with mean 0. The (unweighted) least squares estimate of *σ**ρ*, the coefficient on the Mills ratio term in equation (59), gives

- (60)

where is the (unweighted) least squares coefficient in the observed regression of *λ*(*ψ*^{T}*x*) on *x*. Thus equation (58) is proportional to the Heckman estimate , the constant of proportionality depending only on the values of *x* in the observed case. Hence, if we test the hypothesis *H*_{0}:*ρ*=0 conditional on the observed values of *x*, the test based on equation (58) is equivalent to the regression test based on .

Little (1985) pointed out, as have many others, that the estimate is unsatisfactory in practice because of its strong dependence on the correct specification of equations (38) and (39), and on the need for the range of variation of the propensity score *ψ*^{T}*x* to be sufficiently large for the non-linearity of *λ*(*ψ*^{T}*x*) to be evident in the observed cases. If the range of values of *ψ*^{T}*x* is small, the new regressor *λ*(*ψ*^{T}*x*) is highly collinear with the existing regressors *x*, and so equation (60) is unstable. If the test of fit that is based on is unstable, then so is the test that is based on *T*_{Y}. Copas and Li (1997) gave an example where two transformations of *t*, apparently fitting the observed data equally well, lead to sharply different estimates of *ρ*, and hence different estimates of *θ*.

In Section 5.3 we commented that, in the simpler problem of estimating the mean of a normal sample with missing observations, the Heckman model does attain maximum bias. Here *ψ*^{T}*x* is a constant, and so the added term *λ*(*ψ*^{T}*x*) in equation (59) is completely confounded with the main term in the regression. In this case, equation (58) is identically zero, as are both the numerator and the denominator of in equation (60). Again, this is a case where *u*_{Y} is a linear function of *s*_{Y}.

A rather similar situation arises in the literature on identifiability of competing risks. In a classic paper, Tsiatis (1972) showed that there is no available information on the dependence between the potential lifetimes in the competing risks problem. However, Heckman and Honore (1989) showed that the problem is fully identified if we impose parametric models on the marginal life distributions. Crowder (1994) and others have since pointed out that the resulting estimates are highly dependent on the modelling assumptions that are made. See Crowder (2001) for a good review of this whole area.

There are also many other examples in the literature of identifiable models for the kind of problems that we are considering. For missing data, identifiable parametric models for the joint distribution of *t* and *r* of Section 5.1 have been proposed by Baker and Laird (1988), Chambers and Welsh (1993) and Park and Brown (1994), among many others. Identifiability may come through strong assumptions on other aspects of the model; for example Tang *et al.* (2003) assumed that the marginal distribution of covariates is known. Again, such models involve untestable assumptions or, in a Bayesian context, influential prior distributions.

#### 6.2. Extra uncertainty

This discussion illustrates the central problem of incomplete-data analysis, that unless we make strong and unverifiable modelling assumptions we have little or no information about *ɛ*, and hence little or no information about bias. Investing in a good model is always important, but particularly so here because of the lack of identifiability of important aspects of the model such as the ignorability assumptions that are implied in all our examples.

In complete-data problems, where we can observe a sample of values of *z*, a weaker interpretation of *f*_{Z} is as a ‘working model’: we do not assume that *f*_{Z} is necessarily true, but we use it for inference on the grounds that it gives an acceptable fit. We interpret this to mean that the actual distribution generating *z* is *g*_{Z}, but that *f*_{Z} is accepted because |*T*_{Z}|*d* where *T*_{Z} is constructed for the worst case misspecification (27). Uncertainty of inference is now evaluated with respect to *g*_{Z} with *ɛ*≠0, but conditioning on the event |*T*_{Z}|*d*. Since we are now allowing for misspecification, we expect this to increase uncertainty relative to the standard true model inference, but *ɛ* is unlikely to be too large because the null hypothesis that *ɛ*=0 has been accepted by a goodness-of-fit test.

For this discussion to make sense, we need to ensure that the parameter *θ* retains its meaning under both *f*_{Z} and *g*_{Z}. In the terminology of Section 4, this means that we adopt Royall and Tsou's (2003) assumption that *θ*^{INT}=*θ*^{INF}, so that *θ*=*θ*_{gZ} and *E*_{f}(*u*_{Zs}_{Z})=0. We now consider calculating the confidence interval (61) after accepting that *f*_{Z} is a ‘working model’. Our conjecture is that, to attain the same confidence coefficient (coverage), this interval needs to be widened to allow for the extra uncertainty through relaxing the status of *f*_{Z} from a true model to a working model. We shall find a factor *k*1 such that the coverage of in this broader sense remains at least 1−*α*.

Similarly, when we can only observe incomplete data, we could describe *f*_{Y} as a working model if *f*_{Y} gives an adequate fit to the observed sample of values of *y*. The difference now is that a good fit of *f*_{Y} no longer implies that *ɛ* is necessarily small. In the simple pattern mixture model that was mentioned in Section 6.1, for example, the observed values of *t* may give an excellent fit to the normal distribution that is required by *f*_{Y}, and yet *ɛ* may be large. If *ɛ* is large, then inference from may be severely biased.

Since a good fit of *f*_{Z} to data on *z* necessarily implies a good fit of *f*_{Y} to the corresponding values of *y*, we argue that the extra uncertainty that is implied by interpreting *f*_{Y} as a working model is as great as or greater than the extra uncertainty that is implied by interpreting *f*_{Z} as a working model. We therefore evaluate the factor *k* that was defined above and use this as a lower bound when basing inference on a working model on *y*. If we are not willing to make the strong assumption that *f*_{Y} is the true model, and merely rely on its credentials as a working model, the actual error when we estimate *φ* by may be larger, and possibly substantially larger, than this calculation implies.

This is the key idea of this section. We can never know that our model is ‘correct’; the best that we can hope for is that it gives a good description of the data. The problem is that with data only on *y* we can never test *f*_{Z} fully, because of the identifiability problems that were discussed above. Instead we formulate our uncertainty on the assumption that *f*_{Z} gives a good fit to the (unobserved) data on *z* and use this as a lower bound to the actual uncertainty that we suffer when *f*_{Y} is used as a model for the data on *y*. This argument leads to a confidence interval that is wider than expression (61), and hence less misleading than the naïve procedure which makes no allowance at all for model uncertainty.

To study *k*, we need the joint distribution, under *g*_{Z}, of the pivot *S* and the test statistic *T*_{Z} with *u*_{Z} in equation (27), which is

where *λ* is the relative efficiency for estimating *φ*=*d*^{T}*θ* as defined in expression (5). As expected, if we can observe *z*, we would test for the presence of incomplete-data bias by calculating and from the same set of data and testing the significance of the difference.

When *ɛ*=0, both *S* and *T*_{Z} are asymptotically standard normal. Using equations (24), (1) and (4), the correlation between *S* and *T*_{Z} is (1−*λ*)^{1/2}. When *ɛ*≠0, suffers the first-order approximate bias *b* from equation (20), and hence the corresponding approximate mean of *S* is . Note that if *ɛ*=*O*(*n*^{−1/2}) then *b*_{S} is *O*(1). Similarly, the expected value of *T*_{Z} is (1−*λ*)^{−1/2}*b*_{S}. Hence, if terms of size *O*(*n*^{−1/2}) and smaller are ignored, *S* and *T*_{Z} are jointly asymptotically normal with

Thus the conditional distribution of *S* given *T*_{Z} is approximately

- (62)

To this order of approximation, the conditional distribution of *S* given *T*_{Z} does not involve the misspecification bias *b*_{S}. This argument applies for *any* misspecification function *u*_{Z} in *g*_{Z}, not just the worst case function (27) that is used in *T*_{Z}.

If we had actually observed *T*_{Z}, we could use this conditional sampling distribution of the pivot *S* to construct a conditional confidence interval for *φ*. With the same significance level *α* this would give

- (63)

If *T*_{Z}<*d*, the lower limit of interval (63) cannot be less than

- (64)

where

- (65)

Similarly, if *T*_{Z}>−*d*, the upper limit of expression (63) is at most expression (64) with the sign changed. Thus, if we assume that |*T*_{Z}|*d*, then a conservative confidence interval for *φ* is

- (66)

Since 0<*λ*1 and so , the second equality in equation (65) shows that

Comparing expression (66) with expression (61) we see that relaxing the status of model *f*_{Z} from a true to a working model has led to a wider interval, by a factor which depends on the value of *λ*, i.e. on the proportion of information that is retained in the incomplete data. The width of the interval, however, never increases by more than a factor of √2, which we can think of as ‘doubling the variance’, recalculating the usual confidence interval with the variance doubled to .

To see this in another way, if we could observe *z*, then we could estimate *φ* with variance (under *f*_{Z}) of . With incomplete data, the variance increases to . But, if *f*_{Z} is weakened to a working model, the expanded confidence interval (66) is the same as the conventional interval (61) but with the variance increased further to , which we could call the *pseudovariance*. From interval (65), the pseudovariance is

- (67)

The right-hand side of equation (67) splits the pseudovariance into the ordinary variance (assuming that *f*_{Z} is true), plus the effect of bias resulting from model uncertainty. Fig. 3 shows the total variance inflation factor in equation (67) in terms of *λ*. The first term is the dotted line and the second the broken line, giving the total as the full line. All three curves are decreasing functions of *λ*, as expected.

The rather informal argument leading to equation (65) is based on considering the upper and lower confidence limits separately. For a tighter bound, consider the (conditional) coverage probability of the two-sided confidence interval under the working model *f*_{Z}. This is

- (68)

Now define *k*^{*}=*k*^{*}(*λ*,*α*) as the unique solution of

Then the coverage probability (68) is at least 1−*α* if *k**k*^{*}. If *k*<*k*^{*} then the coverage falls below 1−*α* for at least some values of *b*_{S}.

Fig. 4 illustrates the values of *k*^{*} for *α*=0.05 and *α*=0.01. For each *λ*, *k*^{*} increases as *α* becomes more extreme but is always less than the curve for *k* in equation (65), which is also shown. In fact

- (69)

Of the three inequalities in expression (69), the first is attained when *λ*=1 (no loss of information), the second is attained in the limit as *α*0 and the third is attained when .

In summary, we have three asymptotic coverage statements. Firstly,

the conventional asymptotic confidence interval when *f*_{Z} is the true model. But if *f*_{Z} has the weaker status of a working model, defined by conditioning on the event |*T*_{Z}|*d* (so that *ɛ*=*O*(*n*^{−1/2}) for this event to happen with non-vanishing probability), then the conventional interval is no longer a confidence interval with this coverage, as

- (70)

for at least some possible misspecified distributions *g*_{Z}. But

- (71)

for *all* possible distributions *g*_{Z} within the asymptotic set-up that is being discussed. Of course this is a hypothetical calculation since *T*_{Z} is unobserved, but the expanded confidence limits in inequality (71) involve only the *y*s. Our argument is that in practice, when we only accept *f*_{Y} as a working model, our uncertainty limits for *φ* should be *at least as wide* as those in inequality (71).

The fact that, when *λ*=1, *k*^{*}=*k*=1 emphasizes the importance of the assumption that *θ*^{INT}=*θ*^{INF}. For complete data there is then no asymptotic penalty if we treat a working model as if it was a true model. But when *λ*<1 the distinction is important, as seen in inequality (70).

The same argument also applies in multiparameter problems. Suppose that we want to find a confidence region for *θ* itself, containing *m* components say. Let , and as before. Then the multivariate analogues of *S* and *T*_{Z} are

If *ɛ*=0 then var(*S*)=var(*T*_{Z})=*I*. The distribution of *S* then allows us to write

- (72)

where *U* is a random vector from *N*(0,*I*). The usual asymptotic confidence ellipsoid for *θ* is the set of all values of equation (72) that are consistent with the inequality *U*^{T}*U**d*, where *d* is now the (1−*α*)-quantile of the *χ*^{2}-distribution on *m* degrees of freedom.

When *ɛ*=*O*(*n*^{−1/2}) we allow for first-order bias in the same way as before to give the generalization of distribution (62) as *S*|*T*_{Z}∼*N*{(*I*−Λ)^{1/2}*T*_{Z},Λ}. Now we can write, conditional on *T*_{Z},

- (73)

The expanded confidence region is now the collection of all values of equation (73) that are consistent with the two inequalities and *U*^{T}*U**d*.

Fig. 5 shows two examples with *m*=2, *α*=0.05 and

or

In each of the two diagrams, the ellipses illustrate the two distinct components of the right-hand side of equation (73). The inner ellipse that is centred on the origin (drawn in bold) is the locus of the first term of equation (73) as *T*_{Z} varies over the circle . An ellipse corresponding to the values of the second term in equation (73) as *U* varies over the circle *U*^{T}*U*=*d* is then centred on each point of the first ellipse, to give the collection of ellipses that are drawn with light lines. The outer envelope of these ellipses is the conservative confidence region for *θ*.

Of the two larger concentric ellipses that are drawn with bold lines in each graph of Fig. 5, the inner is the conventional confidence region for *θ* given by *S*^{T}*S*=*d*. The outer is the region for *θ* that is defined by *S*^{T}*S*=2*d*. Note that the envelope allowing for all possible values of the bias that are consistent with the acceptance region is contained within the conventional confidence ellipse but with the variances doubled.

The eigenvalues for these two examples show that *k*_{min}=√2 in the second case but not the first. This is confirmed in Fig. 5. In Fig. 5(a) we see that the envelope is everywhere inside the outer ellipse, but in Fig. 5(b) we see that the envelope touches the outer ellipse at exactly four points.