SEARCH

SEARCH BY CITATION

As an applied statistician, one of the complaints I make about my more theoretically minded colleagues is that many of my problems that are solved by them in principle are not solved in practice. This complaint is sometimes addressed to Bayesians (Senn, 2011), but frequentists are not exempt (Senn, 1998). (In fact, my subjective impression is that they are frequently worse.) One of the challenges I make to anybody telling me how to do better is ‘don't tell, do’. Unfortunately, I cannot do better than Gelman and Shalizi (2013, henceforth GS) with the examples they provide. Thus, if I am to avoid being hypocritical I have to concede that the solution is beyond (my) criticism.

What I am reduced to doing is raising the hackneyed quibble, ‘it may work in practice but does it work in theory?’. A common claim for Bayesian inferences is that it is a theory of everything. Clearly, however, much of what GS are doing is not covered by the standard theory of coherent Bayesian updating of prior to posterior probability statements using data. Model checking (at least) has to be added to the mix. Of course, one has to be careful here; to be Bayesian means many different things to different people, and Jack Good famously determined that Bayesians came in 46,656 varieties (Good, 1983, pp. 20–21). Perhaps there is no standard theory of what it is to be Bayesian.

If, however, model checking is an essential part of the (or a) Bayesian mix, it raises the question as to what the status is of the ‘final’ analysis that GS produce. (For simplicity, I will only consider the analysis of the 2008 voting data but the argument carries over to the more complicated cases.) Consider another Bayesian, one who so firmly believed in the model that GS eventually chose that he or she had no doubts whatsoever as to its veracity. Suppose that his or her prior distribution under the model had been the same as that of GS. The posterior ‘statement’ is now the same: can they both be valid? Well, perhaps in this case, there would be very little to choose between them in terms of validity. This is because in the end GS accepted a rather richer model. The varying intercepts model is a special case of the varying slopes model: in Bayesian terms, one might say that it corresponds to taking the random slopes model but having a completely informative prior that the variance of the slopes is zero. By moving from the intercepts model to the slopes model GS have clearly added some uncertainty (the reverse of what one expects examining data to do according to the Bayesian account!). So perhaps everything is (as it turns out) approximately all right. They were in danger of using a prior distribution that was too informative. Model checking has saved them from the error and because they have not fallen into the trap their inferences are now reasonable.

However, things might have turned out differently. GS must believe this is possible, otherwise it is hard to see why they started where they did. Suppose that instead the model checking had revealed no or little problem. In that case they might have been tempted to use the random intercepts model. However, although the data might be compatible with the simpler (intercept) model they would be compatible with some value of the richer (slopes) model. Proceeding to use the intercept model as if one always knew it were true must underestimate the uncertainties.

Then I worry about the predictive checks they are undertaking – analogous to what Good (1983) calls the ‘device of imaginary results’. What exactly is going on here? They seem to assume that we agree that there is a strong distinction between prior distribution and data. Is it not supposed to be a strength of the Bayesian approach that prior and data are exchangeable? Consider that notorious example of frequentist irrationality, analysis of sequential trials. Suppose we have a trial with 200 patients and decide to look after 100 patients. Denote by D1 the data of the first 100, and by D2 the data of the second 100. Let P0 be the posterior distribution after updating based on D1 alone and P2 the posterior distribution after seeing D1 and D2. Then the following schematic algebra of Bayesian inference applies:

  • display math(1)

or

  • display math(2)

In (1) the inference is performed in two steps and P1 carries out the dual role of being the posterior distribution after seeing D1 and the prior distribution before seeing D2. In (2) we see D1 and D2 together and reach P2 without need of P1. What I worry about is how this pans out when model checking is added to the mix. GS take the point of view that the way in which we judge the adequacy of a prior distribution is by constructing predictive distributions for data sets of the same size as the data we have added to the prior. So this seems to imply that in case (1) we compare D1 to P1 and compare D2 to P2 (if the model survives this far!), and in case (2) we compare D1 + D2 to P2. To take the argument further, suppose that we have as many steps as there are data points, in the spirit of Philip Dawid's prequential inference (Dawid, 1997); does GS model checking get us where we want to be?

Perhaps it does. Perhaps coherent model checking can be added to coherent updating. Perhaps, however, inference is a much messier business than the builders of grand systems suppose. Of course, GS might argue that I am being unfair here, that I am mixing up two kinds of prior probability: the P0 sort, which is not based on local data but on vague notions, and the P1 sort, which is based (partially) on data. However, I suspect that some members of many of the 46,656 tribes would find this a very slippery slope to tread.

So, to sum up, I am convinced that what GS achieve is excellent applied statistics. I am not convinced by their explanation as to why it works. It works in practice, but does it work in theory?

References

  1. Top of page
  2. References