### Discussion on the paper by Spiegelhalter, Best, Carlin and van der Linde

- Top of page
- Abstract
- References
- References in the discussion

**S. P. Brooks** ( *University of Cambridge* )

This is a wonderful paper containing a wide array of interesting ideas. It seems to me very much like a first step (and in the right direction) and I am sure that it will be seen as both a focus and a source of inspiration for future developments in this area.

As the authors point out, their *p*_{D} and the deviance information criterion (DIC) statistics have already been widely used within the Bayesian literature. Given this history and in the previous absence of a published source for these ideas, it is easy to misunderstand what *p*_{D} actually does. Certainly, before reading this paper, but having read several others which use the DIC, I thought that the *p*_{D}-statistic was a clever way of avoiding the problem that Bayesians have when it comes to calculating the number of parameters in any hierarchical model. Essentially the problem is one of deciding which variables in the posterior are model parameters and which are hyperparameters arising from the prior. However, *p*_{D} does not help us here and that is why we have Section 2.1 explaining that this choice is up to the reader. The authors refer to this as choosing the ‘focus’ for the analysis. Sadly, in many cases the calculation of *p*_{D} will be impossible for the focus of primary interest since the deviance will not be available in closed from (this includes random effects and state space models, for example), so this remains an open problem.

What *p*_{D}*does* do is to tell you, once you have chosen your focus, how many parameters you lose (or even gain?) by being Bayesian. The number of degrees of freedom (or parameters) in a model is clear from the (focused) likelihood. However, by combining the likelihood with the prior we almost always impose additional restrictions on the parameter space, effectively reducing the degrees of freedom of our model. Take the authors’ saturated model of Section 8.1, in which parameters *α*_{1},…,*α*_{56} are given a prior with some unknown mean *μ* and fixed variance *σ*^{2}. Clearly, in the limit as *σ*^{2} goes to 0, we essentially remove the 56 individual parameters *α*_{i} and effectively replace them with a single parameter *μ*. I guess that this is fairly obvious with hindsight as is the case with many great ideas. None-the-less it is a credit to the authors firstly for seeing it and, more importantly, for actually deriving a procedure for dealing with it.

This prior-induced parameter reduction can be clearly observed in Fig. 5 in which we plot the value of against log(*σ*^{2}) both for a hyperprior *μ* ∼ *N*(0,1000) and for *μ*=0 (the authors are unclear about which, if either, they actually use in Section 8.1). We can see that, as *σ*^{2} decreases, the effective number of parameters decreases to either 1 or 0 depending on whether or not *μ* itself is a parameter, i.e. which prior is chosen. It is interesting to note the rapid decline in *p*_{D} for variances between 1 and 0.01, but what is particularly interesting about this plot is that, as *σ*^{2} increases, *p*_{D} converges to a fixed maximum well below 56, the number of parameters in the likelihood. As an experiment, if we take *σ*^{2}=10^{30} or even the Jeffreys prior for the *μ*_{i}, a value for *p*_{D} exceeding 53.1 is never obtained (modulo Monte Carlo error). This suggests that we automatically lose three parameters just by being Bayesian, even if we are as vague as we could possibly be with our prior. Quoting Bernardo and Smith (1994), page 298, ‘every prior specification has *some* informative posterior or predictive implications …. There is no ‘‘objective’’ prior that represents ignorance.‘ Of course, the authors’Table 1 suggests that if we took the median as the basis for the calculation of *p*_{D} then we might obtain different results; indeed we seem to regain several parameters this way! Unfortunately, analytic investigation of the *p*_{D}-statistic is essentially limited to the case where we take to be the posterior mean, so we have little idea of the extent and nature of the variability across parameterizations. This choice is likely to have a significant effect on any inference based on the corresponding *p*_{D}-statistic and further (no doubt simulation-based) investigation along these lines would certainly be very helpful.

As well as the construction of the *p*_{D}-statistic, the paper also derives a new criterion for model comparison labelled the DIC. The authors provide a heuristic justification for the DIC, but there are clearly several alternatives. One obvious extension of the usual Akaike information criterion (AIC) statistic to the Bayesian context is to calculate its posterior expectation, EAIC=*D*(*θ*)+2*p* (rather than evaluating it at the posterior mode under a flat prior), or to take the deviance calculated at the posterior mean, i.e. taking . Of course, as with the DIC, posterior medians, modes etc. could also be taken and similar extensions could be applied to the corrected AIC statistic and the Bayesian information criterion for example. Further, the number of parameters in each of these expressions might be replaced by *p*_{D} to gain even more potential criteria. Table 4 gives the posterior model probabilities and posterior-averaged information criteria (based on *p*, rather than *p*_{D}), including DIC, for autoregressive models of various orders fitted to the well-known lynx data (Priestley (1981), section 5.5). We note the broad agreement between the DIC, EAIC and EAIC_{c} (as is common in my own experience and, I think, expected by the authors), but that EBIC locates an entirely different model. We note also that the posterior model probabilities correctly identify the fact that two models appear to describe the data well and it is the only criterion to identify correctly the existence of two distinct modes in the posterior.

Given the number of approximations and assumptions that are required to obtain the DIC it can only really be used as a broad brush technique for discriminating between obviously disparate models, in much the same way as any of the alternative information criteria suggested above might be used. However, in many realistic applications there may be two or more models with sufficiently similar DIC that it is impossible to choose between the two. The only sensible choice in this circumstance is to model-average (see Section 9.1.3). Burnham and Anderson (1998), section 4.2, suggested the use of AIC weights and these are also given in Table 4 together with the corresponding weights for the other criteria. Essentially, these are obtained by subtracting from each AIC the value associated with the ‘best’ model and then setting

where ΔAIC(*k*) denotes the transformed AIC-value for model *k*. These weights are then normalized to sum to 1 over the models under consideration.

Note the distinct differences between the weights and the posterior model probabilities given in Table 4, suggesting that only one or the other can really make any sense. We note here that similar comparisons have been made in the context of other examples. In the context of a log-linear contingency table analysis, King (2001), Table 2.5, found that two models have posterior probability 0.557 and 0.057 but corresponding DIC weights of 0.062 and 0.682 respectively. Similar examples in which the DIC and posterior model probabilities give wildly different results are provided by King and Brooks (2001). Do the authors have any feel for why these two approaches might give such different results? Which would they recommend be used and do they have any suggestions for alternative DIC-based weights for model averaging which might lead to more sensible results? Surely, the only sensible approach is to calculate posterior model probabilities via transdimensional Markov chain Monte Carlo methods. When, then, do the authors suggest that the DIC might be used? What, in practical terms is the question that the DIC is answering as opposed to the posterior model probabilities?

The incorporation of the DIC-statistic into WinBUGS 1.4 ensures its ultimate success, but I have grave misgivings concerning the blind application of a ‘default’ DIC-statistic for model determination problems particularly given its heuristic derivation and the series of essentially arbitrary assumptions and approximations on which it is based. The authors ‘recommend calculation of DIC on the basis of several different estimators’. The option to choose different parameterizations is not available in the beta version of WinBUGS 1.4; will it be added to later versions? What about options for the all-important choice of focus? What do the authors suggest we do when the same parameterization is not calculable for all models being compared? Could not the choice of parameterization for each model adversely influence the results, particularly for models with large numbers of parameters (where a small percentage change in *p*_{D} might mean a large absolute change in the corresponding DIC)?

The paper, like any good discussion paper, leaves various other open questions. For example: why take ��_{θ|y}[*d*_{Θ}] in equation (9) and not the mode or median; how should we decide when to take to be the mean, median, mode etc. as this will surely lead to different comparative results for the DIC; when is *p*_{D} negative and why; in an entirely practical sense, how does model comparison with the DIC compare with that via posterior model probabilities and why do they differ—can both be ‘correct’ in any meaningful way? On page 613, the authors write ‘*p*_{D} and DIC deserve *further investigation* as tools for model assessment and comparison’ and I would certainly agree that they do. I have very much enjoyed thinking about some of these ideas over the past few weeks and I am very grateful to the authors for the opportunity and motivation to do so. It therefore gives me great pleasure to propose the vote of thanks.

**Jim Smith** ( *University of Warwick, Coventry* )

I shall not address technical inaccuracies but just present four foundational problems that I have with the model selection in this paper.

- (a)
Bayesian models are designed to make plausible predictive statements about future observables. The predictive implications of all the prior settings on variances in the worked examples in Section 8 are unbelievable. They do not represent carefully elicited expert judgments but the views of a vacuous software user. Early in Section 1 the authors state that they want to identify succinct models ‘which appear to describe the information [about wrong ‘‘true’’ parameter values (see Section 2.2)?] in the

*data* accurately’. But in a Bayesian analysis a separation between information in the data and in the prior is artificial and inappropriate. For example where do I input extraneous data used as the basis of my prior? When do I stop calling this data (and so include it in

*D* (·)) and instead call it prior information? This forces the authors to use default priors. A Bayesian analysis on behalf of a remote auditing expert (

Smith, 1996) might require the selection of a prior that is robust within a

*class* of belief of different experts (e.g.

Pericchi and Walley (1991)). Default priors can sometimes be justified for simple models. Even then, models within a selection class need to have compatible parameterizations: see

Moreno *et al.* (1998). However, in examples where ‘the number of parameters outnumbers observations’—they claim their approach addresses—default priors are unlikely to exhibit any robustness. In particular, outside the domain of vague location estimation or separating variance estimation (discussed in Section 4), apparently default priors can have strong influence on model implications and hence selection.

- (b)
Suppose that we need to select models whose predictive implications we do not believe. Surely we should try to ensure that prior information in each model corresponds to predictive statements that are comparable. Such issues, not addressed here, are considered by

Madigan and Raftery (1991) for simple discrete Bayesian models. But outside linear models with known variances this is a difficult problem. Furthermore it is well known that calibration is a fast function (

Cooke, 1991 ). In particular apparently inconsequential deviations from the features of a model ‘not in focus’ tend to dominate

*D* (

*θ* ) and

*D* ( *θ* ) . A trivial example of this occurs when we plan to forecast

*X*_{2} having observed an independent identically distributed

*X*_{1} =0.01 which under models M1 and M2 have respective Gaussian distributions

*N* (100,10 000) and

*N* (0,0.001). Then, for most priors, model M1 is strongly preferred although its predictions about

*X*_{2} are less ‘useful’ (Section 2.2). The authors’ premise that all the models they entertain are ‘wrong’ allows these calibration issues to bite theoretically even in the limit, unlike their asymptotically consistent rivals. The authors, however, do no more than to acknowledge the existence of this core difficulty after the example in Section 8.3.

- (c)
Suppose that problems (a) and (b) do not bite. Then the ‘vector of parameters of focus’ (POF) will have a critical influence on any ensuing inference. How in practice do we specify this? The authors state without elaboration that this ‘should depend on the purpose of the investigation’ (Section 9.2.2). But it appears that in practice the POF is calculated on ‘computational grounds’, their software capability driving their inference. The high influence of the choice of the POF is illustrated in the example in Section 8.2. Here models 4 and 5 are predictively identical but model 5 has a significantly smaller deviance information criterion DIC than model 4. The authors conclude that ‘the extra mixing parameters are worthwhile’: why? In what practical sense is this helpful? This example illustrates that the unguided choice of the POF will often be inferentially critical. Incidentally in this example the order of DIC is not (as stated) consistent with the thickness of tails of the sample distribution, the thickest-tailed distribution being model 4.

- (d)
But ignoring all these difficulties there still remains the acknowledged choice of (re)parameterization governing the choice of

which initially we shall assume to be the mean. Consider the case when the POF

*θ* is one dimensional with strictly increasing posterior distribution function

*F* (

*θ* |

*y* ), and

*G* is a distribution function of a random variable with mean

*μ* . Then the reparameterization of

*θ* to

has

*E* (

*φ* )=

*μ* . Thus

(or

is arbitrary within the range of

*D* (·). Thus, contrary to Section (5.1.4), the choice of parameterization of

*θ* with non-degenerate posterior will always be critical. But no

*general* selection guidance is given here. In observation (c) of Section 2.6 the authors suggest the use of the posterior median instead of the mean if this can be calculated easily from their output: not a solution when the POF is more than one dimensional. Even familiar transforms of marginal medians to contrasts and means or means and variances to means and coefficients of variation will not exhibit the required sorts of invariance.

There may be theoretical reasons to use DIC but I do not believe that this paper gives them. So my suggestion to a practitioner would be: if you must use a formal selection criterion do not use DIC. I second the vote of thanks.

The vote of thanks was passed by acclamation.

**Aki Vehtari** ( *Helsinki University of Technology* )

The authors mention that the deviance information criterion DIC estimates the expected loss, with deviance as the loss function. This connection should be emphasized more. It should be remembered that the estimation of the expected deviance was Akaike's motivation for deriving the very first information criterion AIC (Akaike, 1973). In prediction and decision problems, it is natural to assess the predictive ability of the model by estimating the expected utilities, as the principle of rational decisions is based on maximizing the expected utility (Good, 1952) and the maximization of expected likelihood maximizes the information gained (Bernardo, 1979). It is often useful to use other than likelihood-based utilities. For example, in classification problems it is much more meaningful for the application expert to know the expected classification accuracy than just the expected deviance value (Vehtari, 2001). Given an arbitrary utility function *u*, it is possible to use Monte Carlo samples to estimate and , and then to compute an expected utility estimate as

which is a generalization of DIC (Vehtari, 2001).

The authors also mention the known asymptotic relationship of AIC to cross-validation (CV). Equally important is to note that the same asymptotic relationship holds also for NIC (Stone (1977), equation (4.5)). The asymptotic relationship is not surprising, as it is known that CV can also be used to estimate expected utilities with Bayesian justification (Bernardo and Smith (1994), chapter 6, Vehtari (2001) and Vehtari and Lampinen (2002a)). Below some main differences between CV and DIC are listed. See Vehtari (2001) and Vehtari and Lampinen (2002b) for full discussion and empirical comparisons. CV can use full predictive distributions. In the CV approach, there are no parameterization problems, as it deals directly with predictive distributions. CV estimates the expected utility directly, but it can also be used to estimate the effective number of parameters if desired. In the CV approach, it is easy to estimate the distributions of the expected utility estimates, which can for example be used to determine automatically whether the difference between two models is ‘important’. Importance sampling leave-one-out CV (Gelfand *et al.*, 1992; Gelfand, 1996) is computationally as light as DIC, but it seems to be numerically more unstable. *k*-fold CV is very stable and reliable, but it requires *k* times more computation time to use. *k*-fold CV can also handle finite range dependences in the data. For example, in the six-cities study, the wheezing statuses of a single child at different ages are not independent. DIC, which assumes independence, underestimates the expected deviance. In *k*-fold CV it is possible to group the dependent data and to handle independent groups and thus to obtain better estimates (Vehtari, 2001; Vehtari and Lampinen, 2002b).

**Martyn Plummer** ( *International Agency for Research on Cancer, Lyon* )

I congratulate the authors on their thought-provoking paper. I would like to offer one constructive suggestion and one criticism.

Firstly, I have a proposal for a modified definition of the effective number of parameters *p*_{D}. Starting from the Kullback–Leibler information divergence between the predictive distributions at two different values of *θ*

I suggest that *p*_{D} be defined as the expected value of *I*(*θ*^{0},*θ*^{1}) when *θ*^{0} and *θ*^{1} are independent samples from the posterior distribution of *θ*. This modified definition yields exactly the same expression for *p*_{D} in the normal linear model with known variance. In general, it should give a similar estimate of *p*_{D} when *θ* has an asymptotic normal distribution. This version of *p*_{D} can also be decomposed into influence diagnostics when the likelihood factorizes as in Section 6.3. It has the theoretical advantages of being non-negative and co-ordinate free. A practical advantage is that *p*_{D} can be estimated via Markov chain Monte Carlo sampling using two parallel chains by taking the sample average of

where the superscript denotes the chain to which each quantity belongs. The Monte Carlo error of this estimate is easily calculated and the difficulties discussed by Zhu and Carlin (2000) can thus be avoided.

For exponential family models, *I*(*θ*^{0},*θ*^{1}) can be expressed in closed form and there is no need to simulate replicate observations *Y*_{rep}. When the scale parameter *φ* is known, the expression for *p*_{Di} simplifies to

This gives a surprising resolution to the problem of whether to use the canonical or mean parameterization to estimate *p*_{D}.

On a more negative note, I am not convinced by the heuristic derivation of the deviance information criterion DIC in Section 7.3. I followed this derivation for the linear model of Section 4.1, for which it is not necessary to make any approximations. The term with expectation 0, neglected in the final expression, is . Adding this to DIC gives an expected loss of *p*+*p*_{D} which is not useful as a model choice criterion. I am not suggesting that the use of DIC is wrong, but a formal derivation is lacking.

**Mervyn Stone** ( *University College London* )

The paper is rather economical with the ‘truth’. The *truth* of *p*^{t}(*Y*) corresponds fixedly to the *conditions* of the experimental or observational set-up that ensures independent future replication *Y*_{rep} or internal independence of *y*=**y**=(*y*_{1},…,*y*_{n}) (not excluding an implicit concomitant *x*). For *p*^{t}(*Y*)≈*p*(*Y*|*θ*^{t}),*θ* must parameterize a scientifically plausible family of alternative distributions of *Y* under those conditions and is therefore a *necessary*‘focus’ if the ‘good [true] model’ idea is to be invoked: think of tossing a bent coin. Changing focus is not an option.

Any connection of *p*_{D} with cross-validatory assessment would need truth as *p*^{t}(**y**)=*p*^{t}(*y*_{1})…*p*^{t}(*y*_{n}). If *l*= log (*p*) is an acceptable measure of predictive success, is a one-out estimate of . Multiplied by −2, this connects with equation (33) only when the *θ*-model is true with *Y*_{1},…,*Y*_{n} independent.

Extending Stone (1977) to the posterior mode for prior *p*(*θ*), with *n* large, where

and *l*(*θ*)= log {*p*(*θ*)}. If is negative definite, the typically non-negative penalty Π(**y**) is smaller for the posterior mode than for the maximum likelihood estimate. For the maximum likelihood estimate, gives Π(**y**) estimating *p*^{*}, but the general form probably gives Ripley's *p*^{*}.

If Section 7.3 could be rigorously developed (the use of *E*_{Y} does look suspicious!), another connection (via equation (33)) might be that DIC ≈−2*A*. But, since Section 7.3 invokes the ‘good model’ assumption and small for the Taylor series expansion (i.e. large *n*), such a connection would be as contrived as that of *A* with the Akaike information criterion: why not stick with the pristine (nowadays calculable) form of *A*—which does not need large *n* or truth, and which accommodates estimation of *θ* at the independence level of a hierarchical Bayesian model? If sensitivity of the logarithm to negligible probabilities is objectionable, Bayesians should be happy to substitute a subjectively preferable measure of predictive success.

**Christian P. Robert** ( *Université Paris Dauphine* ) and **D. M. Titterington** ( *University of Glasgow* )

A question that arises regarding this thought challenging paper was actually raised in the discussion of Aitkin (1991), namely that the data seem to be used *twice* in the construction of *p*_{D}. Indeed, *y* is used the first time to produce the posterior distribution *π*(*θ*|*y*) and the associated estimate . The (Bayesian) deviance criterion then computes the posterior expectation of the *observed* likelihood *p*(*y*|*θ*),

and thus uses *y* again, similarly to Aitkin's posterior Bayes factor

This repeated use of *y* would appear to be a potential factor for overfitting.

It thus seems more pertinent (within the Bayesian paradigm) to follow an integrated approach along the lines of the posterior *expected* deviance of Section 6.2,

because this quantity would be strongly related to the posterior *expected* loss defined by the logarithmic deviance,

advocated in Robert (1996) and Dupuis and Robert (2002) as an intrinsic loss adequate for model fitting. In fact, the connection between *p*_{D}, the deviance information criterion and the logarithmic deviance would suggest the use of this loss to compute the estimate plugged in *p*_{D} as the intrinsic Bayes estimator

where the last expectation is computed under the predictive distribution. Not only does this make sense because of the aforementioned connection, but it also provides an estimator that is completely invariant to reparameterization and thus avoids the possibly difficult choice of the parameterization of the problem. (See Celeux *et al.* (2000) for an illustration in the set-up of mixtures.)

**J. A. Nelder** ( *Imperial College of Science, Technology and Medicine, London* )

My colleague Professor Lee has made some general points connecting the subject of this paper to our work on likelihood-based hierarchical generalized linear models. I want to make one specific point and two general ones.

- (a)
Professor Dodge has shown that, of the 21 observations in the stack loss data set, only five have not been declared to be outliers by someone! Yet there is a simple model in which no observation appears as an outlier. It is a generalized linear model with gamma distribution, log-link and linear predictor

*x*_{2} + log (

*x*_{1} )* log (

*x*_{3} ). This gives the following entries for

Table 2 in the paper

(I am indebted to Dr Best for calculating these). It is clearly better than the existing models used in

Table 2 .

- (b)
This example illustrates my first general point. I believe that the time has passed when it was enough to assume an identity link for models while allowing the distribution only to change. We should take as our base-line set of models at least the generalized linear model class defined by distribution, link and linear predictor, with choice of scales for the covariates in the last named.

- (c)
My second general point is that there is, for me, not nearly enough model checking in the paper (I am assuming that the use of such techniques is not against the Bayesian rules). For example, if a set of random effects is sufficiently large in number and the model postulates that they are normally distributed, their estimates should be graphed to see whether they look like a sample from such a distribution. If they look, for example, strongly bimodal, then the model must be revised.

**Anthony Atkinson** ( *London School of Economics and Political Science* )

This is an interesting paper which tackles important problems. In my comments I concentrate on regression models: the points extend to the more complicated models at the centre of the authors’ presentation.

It is stressed in Section 7.1 that information criteria assume a replication of the observations; in regression this would be with the same *X*-matrix. But, the simulations of Atkinson (1980) showed that, to predict over a different region, higher values of the penalty coefficient than two in equation (36) are needed. Do the authors know of any analytical results in this area?

Information criteria for model selection are based on aggregate statistics. Fig. 4 shows an alternative and more informative breakdown of one criterion into the contributions of individual observations than that given by Weisberg (1981). However, it does not show the effect of the deletion of observations on model choice. Atkinson and Riani (2000) used the forward search to analyse the stack loss data, for which symmetrical error distributions were considered in Section 8.2. Their Fig. 4.28 shows that the square-root transformation is the only one supported by all the data. The forward plot of residuals, Fig. 3.27, is stable, with observations 4 and 21 outlying. This diagnostic technique complements the choice of a model using information criteria calculated over a set of models that is too narrow.

An example of model choice potentially confounded by the presence of several outliers is provided by 108 observations on the survival of patients following liver surgery from Neter *et al.* (1996), pages 334 and 438. There are four explanatory variables. Fig. 6 shows the evolution of the added variable *t*-tests for the variables during the forward search with log(survival time) as the response: the evidence for the importance of all variables except *x*_{4} increases steadily during the search. Atkinson and Riani (2002) modify the data to produce two different effects. The forward plots of the *t*-tests in Fig. 7(a) show that now *x*_{1} is non-significant at the end of the search. The plot identifies the group of modified observations which have this effect on the *t*-test for *x*_{1}. Fig. 7(b) shows the effect of a different contamination, which makes *x*_{4} significant at the end of the search.

The use of information criteria in the selection of models is a first step, which needs to be complemented by diagnostic tests and plots. These examples show that the forward search is an extremely powerful tool for this purpose. It also requires many fits of the model to subsets of the data. Can it be combined with the appreciable computations of the authors’ Markov chain Monte Carlo methods?

**A. P. Dawid** ( *University College London* )

This paper should have been titled ‘Measures of Bayesian model complexity and fit’, for it is the models, not the measures, that are Bayesian. Once the ingredients of a problem have been specified, any relevant question has a unique Bayesian answer. Bayesian methodology should focus on specification issues or on ways of calculating or approximating the answer. Nothing else is required.

Classical criteria overfit complex models, necessitating some form of penalization, and this paper lies firmly in that tradition. But with Bayesian techniques (Kass and Raftery, 1995) overfitting is not a problem: the marginal likelihood automatically penalizes model complexity without any need for further adjustment. In particular, Bayesian model choice is consistent in the ‘good model’ case (Dawid, 1992a). In Section 9.2.5 the authors brush aside the failure of their deviance information criterion procedure to share this consistency property; but should we not seek reassurance that a procedure performs well in those simple cases for which its performance can be readily assessed, before trusting it on more complex problems?

I contest the view (Section 9.1.3) that likelihood is relevant only under the good model assumption: from a decision theoretic perspective, we can always regard the ‘log-loss’ scoring rule *S*(*p*,*y*):=− log {*p*(*y*)} as a measure of the inadequacy of an assessed density *p*(·) in the light of empirical data *y* (Dawid, 1986). Moreover, when *y* is a sequence *y*^{n}=(*y*_{1},…,*y*_{n}) of not necessarily independent or identically distributed variables, we have

- (41)

the *i*th term measuring the performance of the Bayesian probability forecast for *y*_{i} on the basis of analysis of earlier data only (Cowell *et al.* (1999), chapters 10 and 11). This representation clearly demonstrates why unadjusted marginal likelihood offers a valid measure of model fit: each ‘test’ observation *y*_{i} is always entirely disjoint from the associated ‘training’ data *y*^{i−1}. If desired, we can generalize this prequential formulation of marginal likelihood by inserting other loss functions (Dawid, 1992b) or using other model fitting methods (Skouras and Dawid, 1999). Such procedures exhibit a natural consistency property even under model misspecification (Dawid, 1991; Skouras and Dawid, 2000).

One place where a Bayesian might want a measure of model complexity is as a substitute for *p* in the Bayes information criterion approximation to marginal likelihood, e.g. for hierarchical models. But in such cases the definition of the sample size *n* can be just as problematic as that of the model dimension *p*. What we need is a better substitute for the whole term *p* log (*n*).

**Andrew Lawson and Allan Clark** ( *University of Aberdeen* )

We would like to make several comments on this excellent paper.

Our prime concern here is the fact that the deviance information criterion DIC is not designed to provide a sensible measure of model complexity when the parameters in the model take the form of locations in some ℛ-dimensional space. In the spatial context, this could mean the locations of cluster centres or, more generally, the components of a mixture. Clearly the averaging of parameters in these contexts is nonsensical but is a fundamental ingredient of DIC's penalty term . Even if an alternative measure of central tendency is used it remains inappropriate to average over configurations where locations in the chosen space are parameters (e.g. cluster detection modelling in spatial epidemiology (McKeague and Loiseaux, 2002; Gangnon and Clayton, 2002). In the case of the Bayes information criterion, however, it might be possible to replace the penalty *p* ln (*n*) by an average number of parameters (in a reversible jump context) such as , where *p* is the number of parameters and *n* the sample size. This would at least approximately accommodate the varying dimension but would not require the averaging of parameters (as compared with DIC). This was suggested in Lawson (2000).

The second point of concern is the relationship of the goodness of fit to convergence of the Markov chain Monte Carlo samplers for which DIC is designed. If posterior marginal distributions are multimodal then the conventional convergence diagnostic will fail (as they will usually find too much variability in individual chains), and also DIC will average over the modes.

We are also somewhat concerned and puzzled by the results for the Scottish lip cancer data set. In Table 1, excepting the saturated model, the largest penalty terms are for the exchangeable model and not those with either spatial or spatial and exchangeable components. We also note that it is not strictly appropriate to fit a spatial-only model without the exchangeable component.

Finally we note that alternative approaches have recently been proposed (Plummer, 2002).

**José M. Bernardo** ( *Universitat de València* )

This interesting paper discusses rather polemic issues and offers some reasonable suggestions. I shall limit my comments to some points which could benefit from further analysis.

- (a)
The authors point out that their proposal is not invariant under reparameterization and show that differences may be large. The use of the median would make the result invariant in one dimension, but it is not trivial to extend this to many dimensions. An attractive, general invariant estimator is the

*intrinsic* estimator obtained by minimizing the

*reference* posterior expectation of the intrinsic loss

(

Bernardo and Suarez, 2002 ) defined as the

*minimum* logarithmic divergence between

and

*p* (

*x* |

*θ* ). Under regularity conditions and moderate or large samples, this is well approximated by (

*E* [

*θ* |

**x** ]+

*M* [

*θ* |

**x** ])/2, the average between the reference posterior mean and mode. Other invariant estimators may be obtained by minimizing the posterior expectation of

obtained from either a proper subjective prior or an improper prior which, as the reference prior, is obtained from an algorithm which is invariant under reparameterization.

- (b)
The authors use ‘essentially flat’ or ‘weakly informative’ priors, i.e. conjugate-like priors with very small parameter values. This is dangerous and is *not* recommended. There is no reason to believe that those priors are weakly informative on the parameters of interest. Indeed, these limiting proper priors can have hidden undesirable features such as strong biases (cf. the Stein paradox). Moreover, they may approximate a prior function which would result in an improper posterior and using a ‘vague’ proper prior in that case does not solve the problem; the answer will then typically be extremely sensitive to the hyperparameters chosen for the vague proper prior and, since the Markov chain Monte Carlo algorithm will converge because the posteriors are guaranteed to be proper, one might not notice anything wrong. If full, credible, subjective elicitation is not possible then one should use formal methods to derive an appropriate reference prior.

- (c)
The authors’ brief comment (in Section 9.2.4) on the calibration of the deviance information criterion DIC is too short to offer guidance. With Bayes factors, we have a direct interpretation of the numbers obtained. The Bayesian reference criterion (

Bernardo, 1999 ) is defined in terms of natural information units (and may also be described in terms of log-odds). Is there a natural interpretation for DIC?

- (d)
The important particular case of nested models is not discussed in the paper. Would the authors comment on the behaviour on DIC in that case (and hence on their implication on precise hypothesis testing)? For instance, what is DIC's recommendation for the simple canonical problem of testing a value for a normal mean? It seems to me that, like Akaike's information criterion or the Bayesian reference criterion (but not the Bayes information criterion or Bayes factors), DIC would avoid Lindley's paradox. Is this so?

**Sujit K. Sahu** ( *University of Southampton* )

This impressive paper shows how the very complicated business of model complexity can be assessed easily by using Markov chain Monte Carlo methods. My comments mostly concern the foundational aspects of the methods proposed and the interrelationship of the deviance information criterion DIC and other Bayesian model selection criteria.

The paper provides a long list of models and the associated *p*_{D}, the effective number of parameters. In each of these cases *p*_{D} is interpreted nicely in terms of model quantities. However, there is an unappealing feature of *p*_{D} that I would like to point out in the discussion below.

Consider the set-up leading to equation (23). Assume further that *A*_{1}=**1**,*C*_{1}=1 and *C*_{2}=*τ*^{2}. Thus the likelihood is *N*(*θ*,1) and the prior is *N*(0,*τ*^{2}). Then equation (23) yields that

Assuming *τ*^{2} to be finite it is seen that *p*_{D} increases to 1 as *n* ∞ . The unappealing point is that the effective number of parameters is larger for larger sample sizes; conventional intuition suggests otherwise. The number of unknowns (i.e. the effective number of parameters) should decrease as more data are obtained under this very simple static model. In spite of the authors’ views on asymptotics or consistency, this point deserves further explanation as it is valid even when small sample sizes are considered.

In Section 9.1 the relationship between DIC and other well-known Bayesian model selection criteria including the Bayes factor is discussed. Although DIC is not to be viewed as a formal model choice criterion (according to the authors), it is often (and it will be) used to perform model selection; see for example the references cited by the authors. In this regard a more precise statement about the relationship between the Bayes factor and DIC can be made. I illustrate this with the above simple example taken from the paper.

Assume that the observation model is *N*(*θ*,1) and the prior for *θ* is *N*(0,*τ*^{2}). Suppose that model 0 specifies that *H*_{0}:*θ*=0 and model 1 says that *H*_{1}:*θ*≠0. I assume that both *n* and *τ*^{2} are finite and thus avoid the problems with interpretation of the Bayes factor and Lindley's paradox. Using the Bayes factor, model 0 will be selected if

In contrast, DIC selects model 0 if

Clearly, if DIC selects model 0 then the Bayes factor will also select model 0. It is also observed that the Bayes factor allows for higher -values without rejecting the simpler model. In effect DIC is seen to have the much discussed poor behaviour of a conventional significance test which criticizes the simpler null hypothesis too much and often rejects it when it should not.

**Sylvia Richardson** ( *Imperial College School of Medicine, London* )

I restrict my comments on this far-reaching paper to the use of the deviance information criterion DIC for choosing within a family of models and the behaviour of *p*_{D} as a penalization.

My first remark concerns the spatial example of Section 8. The DIC-values for the ‘spatial’ and the ‘spatial plus exchangeable’ models are nearly identical. Thus, the authors resort to external pragmatic considerations for preferring the simpler model, while the more complex one is not penalized.

Two cases of Gaussian mixtures were simulated (one replication): a well-separated bimodal mixture (bimod), 0.5 *N*(−1.5,0.5)+0.5 *N*(1.5,0.5), and an overlapping skewed bimodal mixture (skew): 0.75 *N*(0, 1) + 0.25 *N*(1.5, 0.33), each with 200 data points.

In the clear-cut bimod case, DIC(*k*) is lower for *k*=2, with a small incremental increase in both *E*(*D*|*y*,*k*) and *p*_{D} as extra components are being fitted (Table 5). In the more challenging skew case, the pattern of DIC-values shows that this data set requires more than two components to be adequately fitted, but the values of DIC and *p*_{D} stay surprisingly flat between three and six components. Note that the predictive density plots conditional on *k*=3,4,5 are completely superimposed (Fig. 8), indicating that more than three components can be considered as overfitting the data, in the sense that they give alternative explanations that are no better but involve increasing numbers of parameters.

Table 5. Performance of DIC for mixture models with different numbers of components | *Results for the following values of k:* |
---|

| *k = 2* | *k = 3* | *k = 4* | *k = 5* | *k = 6* |
---|

*Bimod (n = 200)* | | | | | |

DIC(*k*) | 566.7 | 567.7 | 568.5 | 569.2 | 570.0 |

*E* ( *D* | *y* , *k* ) | 563.4 | 563.7 | 564.1 | 564.5 | 565.0 |

*p*_{D} | 3.3 | 4 | 4.4 | 4.7 | 5 |

*Skew (n = 200)* | | | | | |

DIC(*k*) | 545.5 | 535.9 | 535.5 | 535.7 | 535.8 |

*E* ( *D* | *y* , *k* ) | 540.3 | 530.1 | 530.0 | 530.2 | 530.4 |

*p*_{D} | 5.2 | 5.8 | 5.5 | 5.5 | 5.4 |

*North–south (n = 94)* | | | | | |

DIC(*k*) | 110.5 | 110.9 | 110.9 | 110.5 | 110.8 |

*E* ( *D* | *y* , *k* ) | 94.2 | 91.9 | 89.6 | 87.7 | 86.2 |

*p*_{D} | 16.3 | 19.0 | 21.3 | 22.8 | 24.6 |

The second situation is that of spatial mixture models proposed in Green and Richardson (2002) in the context of disease mapping. DIC was calculated by focusing on area-specific risk. Referring, for example, to the simple north–south (two-component) contrast defined in that paper, we find that DIC stays stable as *k* increases, decreasing *E*(*D*|*y*,*k*) values being compensated by increasing *p*_{D}. On the basis of a mean-square error criterion between the estimated and the underlying risk surface, a deterioration of the fit would be seen with values of 0.14, 0.15 and 0.16 for *k*=2,3,4 respectively.

Thus *p*_{D} acts as a sufficient penalization only in the simplest case. In other cases, DIC does not distinguish between alternative fits with increasing number of parameters.

**Peter Green** ( *University of Bristol* )

I have two rather simple comments on this interesting, important and long-awaited paper.

The first concerns using basic distribution theory to give a surprising new perspective on *p*_{D} in the normal case, perhaps identifying a missed opportunity in exposition.

Consider first a decomposition of data as focus plus noise:

where *X* and *Z* are independent *n*-vectors, normally distributed with fixed means and variances, and var(*Z*) is non-singular. The deviance is

and so

- (42)

using the standard expression for the expectation of a quadratic form. Several results in the paper have this form, possibly in disguise. However,

yielding the much more easily interpretable

- (43)

This allows a very clean derivation of examples in Sections 2.5 and 4.1–4.3. For example, in the Lindley and Smith model we have , and so

as in equation (21) of the paper.

Turning now to hierarchical models, consider a decomposition into *k* independent terms

where all *Z*_{i} are normal, and var(*Z*_{k}) is non-singular. These represent all the various terms of the model: fixed effects with priors, random effects with different structures, errors at various levels; again all means and variances are fixed. Then for any level *l*=1,2,…,*k*−1 we may take the sum of the first *l* terms as the focus and the rest as noise.

Version (42) of *p*_{D} above is then not very promising:

but expression (43) gives the more compelling

- (44)

Thus *p*_{D} has generated a decomposition of the overall degrees of freedom *n*=Σ_{l} tr{var(*Y*)^{−1}var(*Z*_{l})} into non-negative terms attributable to the levels *l*=1,2,…,*k*, just as in frequentist nested model analysis of variance. (We must take care with improper priors in using expression (44), and terms should be treated as limits as precisions go to 0.) Of course, expressions (43) and (44) fail to hold with unknown variances or with non-normal models, but the observations above do provide further motivation for accepting *p*_{D} as a measure of complexity, and suggest exploring more thoroughly its role in hierarchical models.

My second point notes that the paper has no examples with discrete ‘parameters’. Conditional distributions in hierarchical models with purely categorical variables can be computed by using probability propagation methods (Lauritzen and Spiegelhalter, 1988), avoiding Markov chain Monte Carlo methods, so that *p*_{D} is again a cheap local computation. Presumably marginal posterior modes would be used for . Certainly this is a context where *p*_{D} can be negative. Can connections be drawn with existing model criticism criteria in probabilistic expert systems?

The following contributions were received in writing after the meeting.

**Kenneth P. Burnham** ( *US Geological Survey and Colorado State University, Fort Collins* )

This paper is an impressive contribution to the literature and I congratulate the authors on their achievements therein. My comments focus on the model selection aspect of the deviance information criterion DIC. My perspectives on model selection are given in Burnham and Anderson (2002), which has a focus on the Akaike information criterion AIC as derived from Kullback–Leibler information theory. A lesson that we learned was that, if the sample size *n* is small or the number of estimated parameters *p* is large relative to *n*, a modified AIC should be used, such as AIC_{c}=AIC+2*p*(*p*+1)/(*n*−*p*−1). I wonder whether DIC needs such a modification or if it really automatically adjusts for a small sample size or large *p*, relative to *n*. This would be a useful issue for the authors to explore in detail.

At a deeper level I maintain that model selection should be multimodel inference rather than just inference based on a single best model. Thus, model selection to me has become the computation of a set of model weights (probabilities in a Bayesian approach), based on the data and the set of models, that sum to 1. Given these weights and the fitted models (or posterior distributions), model selection uncertainty can be assessed and model-averaged inferences made. The authors clearly have this issue in mind as demonstrated by the last sentence of Section 9.1.3. I urge them to pursue this much more general implementation of model selection and to seek a theoretical or empirical basis for it with DIC.

There is a matter that I am confused about. The authors say ‘… we essentially reduce all models to non-hierarchical structures’ (third page), and ‘Strictly speaking, nuisance parameters should first be integrated out …‘ (Section 9.2.3). Does this mean that we cannot make full inferences about models with random effects? Can DIC be applied to random-effects models? It seems so on the basis of their lip cancer example (Section 8.1). Can I have a model with fixed effects *τ*, random effects *φ*_{1},…,*φ*_{k}, with postulated distribution *g*(*φ*|*θ*),*θ* as fixed effects (plus priors on all fixed effects) and have my focus be all of *τ*,*φ* and *θ*? Thus, I obtain shrinkage-type inferences about the *φ*_{i}; I do not integrate out the *φ* (AIC has been adapted to this usage).

The authors make a point (page 612) that I wish to make more strongly. It will usually not be appropriate to ‘choose’ a single model. Unfortunately, standard statistical model selection has been to select a single model and to ignore any selection uncertainty in the subsequent inferences.

**Maria DeIorio** ( *University of Oxford* ) **and Christian P. Robert** ( *Université Paris Dauphine* )

Amidst the wide scope of possible extensions of their paper, the authors mention the case of mixtures

which is quite interesting, as it illustrates the versatility of the deviance information criterion DIC under different representations of the same model.

In this set-up, if the *p*_{j}s are known, the associated *completed* likelihood is

- (45)

Therefore, conditional on the latent variables **z**=(*z*_{1},…,*z*_{n}), and setting the saturated deviance *f*(*x*) to 1, define

where (under proper identifiability constraints; see Celeux *et al.* (2000)). The *integrated* DIC is then

where Pr(**z**|**x**) can be approximated (Casella *et al.*, 1999).

A second possibility is the *observed* DIC, DIC_{2}, based on the observed likelihood, which does not use the latent variables **z**. (We note the strong dependence of DIC on the choice of the saturated function *f* and the corresponding lack of clear guidance outside exponential families. For instance, if *f*(*x*_{i}) goes from the marginal density to the extreme alternative where both *θ*_{1} and *θ*_{2} are set equal to *x*_{i}, DIC_{2} goes from −31.71 to 166.6 in the following example.)

A third possibility is the *full* DIC, DIC_{3}, based on the completed likelihood (45) when it incorporateś**z** as an additional parameter, in which case the saturated deviance could be the normal standardized deviance, although we still use *f*(*x*)=1 for comparison.

The three possibilities above lead to rather different figures, as shown by Table 6 for the simulated data set in Fig. 9; Table 6 exhibits in addition a lack of clear domination of the mixture (*k*=2) *versus* the normal distribution (*k*=1) (second column), except when **z** is set to its true value (third column) or estimated (last column). Note that, for the full DIC, *p*_{D} is far from 102; this may be because, for some combinations of **z**, the likelihood is the same. (This also relates to the fact that **z** is not a parameter in the classical sense.)

Table 6. Comparison of the three different criteria DIC _{1} , DIC _{2} and DIC _{3} for a simulated sample of 100 observations from 0.5 �� (5,1.5)+0.5 �� (7.5,8) with a conjugate prior *θ*_{1} ∼ �� (4,5) and *θ*_{2} ∼ �� (8, 5), and of DIC based on the true complete sample ( **x** , **z** ) and DIC for the single-component normal model (with an �� (6, 5) prior and a variance set of 6.07) | *Results for the following models:* |
---|

| *Normal* | *Complete,* | *Integrated,* | *Observed,* | *Full,* |

| (*k*=1) | [*DIC* | **z**] | *DIC*_{1} | *DIC*_{2} | *DIC*_{3} |

DIC | 465.1 | 413.5 | 462.6 | 457.6 | 447.4 |

ΔDIC | — | −51.6 | −2.5 | −7.5 | −17.6 |

*p*_{D} | 0.99 | 1.96 | 2.27 | 1.98 | 28.06 |

**David Draper** ( *University of California, Santa Cruz* )

The authors of this interesting paper talk about Bayesian model assessment, comparison and fit, but—if their work is to be put seriously to practical use—the real point of the paper is Bayesian model choice: we are encouraged to pick the model with the smallest deviance information criterion DIC among the class of ‘good’ models (those which are ‘adequate candidates for explaining the observations’). (It is implicit that somehow this class has been previously specified by means that are not addressed here—would the authors comment on how this set of models is to be identified in general?) However, in the case of model selection it would seem self-evident that *to choose a model you have to say to what purpose the model will be put*, for how else will you know whether your model is sufficiently good? We can, perhaps, use DIC to say that model 2 is better than model 1, and we can, perhaps, compare with ‘the number of free parameters in *θ*‘ to ‘check the overall goodness of fit’ of model 2, but we cannot use the authors’ methods to say whether model 2 is sufficiently good, because the real world definition of this concept has not been incorporated into their methods. It seems hard to escape the fact that specifying the purpose to which a model will be put demands a decision theoretic basis for model choice; thus (Draper, 1999) I am firmly in the camp of Key *et al.* (1999).

See Draper and Fouskakis (2000) and Fouskakis and Draper (2002) for an example from health policy that puts this approach into practice, as follows. Most attempts at variable selection in generalized linear models conduct what might be termed a benefit-only analysis, in which a subset of the available predictors is chosen solely on the basis of predictive accuracy. However, if the purpose of the modelling is to create a scale that will be used—in an environment of constrained costs, which is frequently the case—to make predictions of outcome values for future observations, then the model selection process must seek a subset of predictors which trades off predictive accuracy against data collection cost. We use stochastic optimization methods to maximize the expected utility in a decision theoretic framework in the space of all 2^{p} possible subsets (for *p* of the order of 100), and because our predictors vary widely in how much they cost to collect (which will also often be true in practice) we obtain subsets which are sharply different from (and much better than) those identified by benefit-only methods for performing ‘optimal’ variable selection in regression, including DIC.

**Alan E. Gelfand** ( *Duke University, Durham* ) **and Matilde Trevisani** ( *University of Trieste* )

The authors’ generally informal approach motivates several remarks which we can only briefly develop here. First, in Section 2.1, we think that better terminology would be ‘focused on *p*(*y*|*θ*)‘ with ‘interest in the models for *θ*‘, as in, for example, the example in Section 8.1 where there is no *θ* in the likelihood for any of the given models. Even the example in Section 8.2, where *θ* does not change across models, emphasizes the focus on *p*(*y*|*θ*) since *f*(*y*) depends on the choice of *p*. So, here, a relative comparison of the models depends on the choices made for the *f*s. Without a clear prescription for *f* (once we leave the exponential family), the opportunity exists to fiddle the support for a model.

Though the functional form of the Bayesian deviance does not depend on *p*(*θ*), DIC and *p*_{D} will. With the authors’ hierarchical specification,

the effective degrees of freedom will depend on *p*(*ψ*). But, also, under this specification, rather than *p*(*y*|*θ*), we can put a different distribution, *p*(*y*|*ψ*), in focus. Again, it seems preferable not to speak in terms of ‘parameters in focus’.

Moreover, since *p*(*y*|*θ*) and *p*(*y*|*ψ*) have the same marginal distribution *p*(*y*), a coherent model choice criterion must provide the same value under either focus. Otherwise, a particular hierarchical specification could be given more or less support according to which distribution we focus on. But let DIC_{1},*p*_{D1} and *f*_{1}(*y*) be associated with *p*(*y*|*θ*) and DIC_{2},*p*_{D2} and *f*_{2}(*y*) with *p*(*y*|*ψ*). To have DIC_{1}=DIC_{2} requires, after some algebra, that

Just as the functional form of *f*_{1}(*y*) depends only on the form of *p*(*y*|*θ*), the form for *f*_{2}(*y*) should depend only on *p*(*y*|*ψ*). Evidently this is not so. For instance, under the authors’ example in expression (2), *f*_{1}(*y*)=0. The above expression yields the non-intuitive choice

where *w*_{i}=*τ*_{i}/(*τ*_{i}+*λ*). This issue is discussed further in Gelfand and Trevisani (2002).

**Jim Hodges** ( *University of Minnesota, Minneapolis* )

This is a most interesting paper, presenting a method of tremendous generality and, as a bonus, a fine survey of related methods. I can think of a dozen models for which I would like to see *p*_{D}, but I shall ask for just one: a balanced one-way random-effects model with unknown between-group precision, in which each group has its own unknown error precision, these latter precisions being modelled as draws from, say, a common gamma distribution with unknown parameters. Thus the precisions will be shrunk as well as the means, and presumably the two kinds of shrinkage will affect each other. The focus could be either the means or the precisions, or preferably both at once.

One thing is troubling: the possibility of a negative measure of complexity (Section 2.6, comment (d)). Hodges and Sargent (2001) is linked (shackled?) to linear model theory, in which complexity is defined as the dimension of the subspace of ℜ^{n} in which the fitted values lie. In our generalization, the fitted values may be restricted to ‘using’ only part of a basis vector's dimension, because they are stochastically constrained by higher levels of the model's hierarchy. (Basing complexity on fitted values may remove the need to specify a focus, although, if true, this is not obvious.) In this context, zero complexity makes sense: the fitted values lie in a space of dimension 0 specified entirely by a degenerate prior. Negative complexity, however, is uninterpretable in these terms. The authors attribute negative complexity to a poor model fit, which suggests that *p*_{D} describes something more than the fitted values’ complexity *per se*. Perhaps the authors could comment further on this.

**Youngjo Lee** ( *Seoul National University* )

It is very interesting to see the Bayesian view of Section 4.2 of Lee and Nelder (1996), which used extended or *h*-likelihood and in which we introduced various test statistics. For a lack of fit of the model we proposed using the scaled deviance

with degrees of freedom *E*(*D*_{r}), estimated by where as in Sections 4.3 and 5.4 of this paper. We considered a wider class of models, which we called hierarchical generalized linear models (HGLMs) (see also Lee and Nelder (2001a, b)), but some of our proofs hold more widely than this, so that, for example, Section 3.1 of this paper is summarized in our Appendix D, etc. For model complexity the authors define in equation (9) the scaled deviance

*D*_{r} and *D*_{m} are the scaled deviances for the residual and model respectively, whose degrees of freedom add up to the sample size *n*. We are very glad that the authors have pointed out the importance of the parameterization of *θ* in forming deviances. We extended the canonical parameters of Section 5 to arbitrary links by defining the *h*-likelihood on a particular scale of the random parameters, namely one in which they occur linearly in the linear predictor. In HGLMs the degrees of freedom for fixed effects are integers whereas those for random effects are fractions. Thus, a GLM has integer degrees of freedom *p*_{m}=rank(*X*) because is 0 in Section 5, whereas the estimated degrees of freedom of *D*_{m} in HGLMs are fractions. Lee and Nelder (1996) introduced the adjusted profile *h*-likelihood eliminating *θ*, and this can be used to test various structures of the dispersion parameters *λ* discussed in the examples of Section 8: see the model checking plots for the lip cancer data in Lee and Nelder (2001b). Lee and Nelder (2001a) justified the simultaneous elimination of fixed and random nuisance parameters. It will be interesting to have the Bayesian view of the adjusted profile *h*-likelihood.

**Xavier de Luna** ( *Umeå University* )

This interesting paper presents Bayesian measures of model complexity and fit which are useful at different stages of a data analysis. My comments will focus on their use for model selection. In this respect, one of the noticeable contributions of the paper is to propose a Bayesian analogue, the deviance information criterion DIC, to the Akaike information criterion AIC and TIC. Both DIC and TIC are generalizations of AIC. The former may be useful in a Bayesian data analysis, whereas the frequentist criterion TIC has the advantage of not requiring the ‘good model’ assumption discussed by the authors.

Such ‘information-based’ criteria use measures of model complexity (denoted *p*^{*} or *p*_{D} in the paper). It should, however, be emphasized that models can be compared without having to define and compute their complexity. Instead, out-of-sample validation methods, such as cross-validation (Stone, 1974) or prequential tests (Dawid, 1984) can be used in wide generality. Moreover, to use an estimate of *p*^{*} in a model selection criterion, some characteristics of the data-generating mechanism (DGM)—‘true model’ in the paper—must be known. For instance, depending on the DGM either AIC-type or Bayes information type criteria are asymptotically optimal (see Shao (1997) for a formal treatment of linear models). Thus, when little is known about the DGM, out-of-sample validation provides a *formal* and general framework to perform model selection as was presented in de Luna and Skouras (2003), in which accumulated prediction errors (defined with a loss function chosen in accordance with the purpose of the data analysis) were advocated to compare and choose between different model selection strategies. When many models are under scrutiny, out-of-sample validation may be computationally prohibitive and generally yields high variability in the selection of a model. In such cases, different model selection strategies based on *p*^{*} (making—implicitly or explicitly—diverse DGM assumptions) can be applied to reduce the dimension of the selection problem. Accumulated prediction errors can then be used to identify the best strategy while making very few assumptions on the DGM.

**Xiao-Li Meng** ( *Harvard University, Cambridge, and University of Chicago* )

The summary made me smile, for the ‘mean of the deviance − deviance of the mean’ theme once injected a small dose of excitement into my student life. I was rather intrigued by the ‘cuteness’ of expressions (3.4) and (3.8) of Meng and Rubin (1992), and seeing a Bayesian analogue of our likelihood ratio version certainly brought back fond memories. My excitement back then was short lived as I quickly realized that all I of a well-known variance formula. Let *D*(*x*,*μ*)=(*x*−*μ*)^{2} be the deviance, a case of *realized discrepancy* of Gelman *et al.* (1996); then

- (46)

Although equation (46) is typically mentioned (with *μ* set to 0) for computational convenience, it is the back-bone of the theme under quadratic or normal approximations, or more generally with log-concave likelihoods, beyond which assumptions become much harder to justify or derive. (Obviously, equation (46) is applicable for posterior or likelihood averaging by switching *x* and *μ*.)

Section 1 contained a small puzzle. I wondered why Ye (1998) was omitted from the list of ‘the most ambitious attempts’, because Ye's ‘data derivative’ perspective goes far beyond the independent normal model cited in Section 4.2 (for example, it addresses data mining). It also provides a more original and insightful justification than normal approximations, especially considering that Markov chain Monte Carlo sampling is most needed in cases where such approximations are deemed unacceptable.

Section 2.1 presented a bigger puzzle. The authors undoubtedly would agree that a statement like ‘In hierarchical modelling we cannot uniquely define a ‘‘posterior’’ or ‘‘model complexity’’ without specifying the level of the hierarchy that is the focus of the modelling exercise’ is tautological. Surely the ‘posterior’ and thus the corresponding ‘model complexity’ depend on the level or parameter(s) of interest. So why does the statement become a meaningful motivation when the word posterior is replaced by ‘likelihood’? There is even some irony here, because hierarchical models are models where there are unambiguous and uncontroversial *marginal* likelihoods—both *L*(*θ*|*y*)=*p*(*y*|*θ*) and *L*(*φ*|*y*)=*p*(*y*|*φ*) in Section 2.1 are *likelihoods* in the original sense.

Although limitations on space prevent me from describing my reactions when reading the rest, I do wish that DIC would stick out in the dazzling AIC—TIC alphabet contest, so we would all be less compelled to look for UIC (*unified or useful information criterion*?) ….

The **authors** replied later, in writing, as follows.

We thank all the contributors for their wide-ranging and provocative discussion. Our reply is organized according to a number of recurring themes, but constraints on space mean that it is impossible to address all the points raised. Echoing Brooks's opening remarks, our hope is that discussants and readers will be sufficiently inspired to pursue the ideas proposed in this paper and to address some of the unresolved issues highlighted in the discussion.