“The Bayesian approach can have a clarifying effect on one's thinking about evidence.” (Koehler and Saks, 1991, p 364)
Traditional training in statistical methods for those who go on to become practicing biological anthropologists has focused primarily on classical hypothesis testing. This is apparent in both textbooks geared toward anthropologists in general (Thomas, 1986; Madrigal, 1998; Bernard, 2011) and specialized texts for biological anthropologists (Slice, 2005; D'Aoãut and Vereecke, 2011). While Bayes' Theorem may be mentioned in passing in introductory statistics courses, this is typically restricted to examples of such limited interest that the student has little motivation to recall the theorem, and even less motivation to assume that there may be future value in having learned about Bayes' Theorem. In Bayesian terms, the prior probability that the student will retain Bayes' Theorem is quite low. In contrast, the student and eventual practitioner is likely to learn about confidence intervals, Type I and Type II errors in hypothesis testing, and P-values, and to blithely assume that what they have learned represents the near totality of what is available and useful within modern statistical practice. This represents an unfortunate omission of Bayesian methods and inference.
Bayesian methods and inference are particularly helpful for creating estimates and uncertainties about those estimates without asymptotic approximation, and for incorporating prior information with data to generate problem-specific distributions in a systematic and logical way. Such methods obey the likelihood principle (unlike classical inference), generate interpretable answers in terms of a probability distribution, readily accommodate missing data and complex parametric models, and allow comparison between models. This is not to say that Bayesian methods and inference are appropriate in all contexts: there is no single best practice for selecting prior distributions, and Bayesian methods often have high computational costs. However, these drawbacks do not explain why biological anthropologists in the Americas have largely chosen to ignore Bayesian methods, while these tools have become popular and useful elsewhere. Courgeau (2012) gives a very complete account of the use of both frequentist and Bayesian methods within the broader social sciences, and McGrayne's (2011) popular history of Bayes' Rule (another name for the theorem) gives insight into why Bayesian methods have only fairly recently come to the fore.
In large measure, the recent increase in Bayesian applications across diverse fields and around the world has occurred because of the development of computer simulation methods and related software (Geyer, 1992; Gilks et al., 1996; Gamerman, 1997; Lunn et al., 2009, 2000; Brooks et al., 2011) that remove the computational burden from the user. Our goals here are to explain Bayesian principles in a way that make their applicability understandable and straightforward, to provide concrete examples of computer simulations and statistics grounded in Bayes Theorem that address questions relevant to biological anthropology, and to simultaneously review how Bayesian methods and inference have been used in biological anthropology to date. We first review maximum likelihood estimation to establish some terminology, and then use a simple example of Bayes' postulate (Bayes' Theorem with a uniform prior) to examine: 1) the likelihood, prior and posterior, and 2) differences between highest posterior density (HPD) regions and confidence intervals. We move on to Bayes Theorem and how it can be used to 1) create new priors (sequential use), 2) generate predictive densities for new samples, 3) evaluate competing models (Bayes' factor), and 4) estimate normally distributed parameters. We then delve into computer simulation, Bayesian statistics, and freeware applications, reserving the “nuts and bolts” of different methods for simulating values out of various distributions for the Appendix.
The final sections of the article illustrate various Bayesian methods using published and practical “toy” examples from bioarchaeology and from forensic anthropology. The bioarchaeology examples involve modeling mortality and accounting for uncertainty in age estimates in paleodemography, and using full posterior density distributions to address disease prevalence, specificity, and sensitivity in paleopathology. The forensic anthropology examples use Bayesian methods to address the analysis of commingled remains and issues of identification in closed population mass disasters. The forensics section also includes a discussion of the potential problems that arise when conditional probabilities are transposed in evidentiary settings and when prior probabilities are misinterpreted. We then conclude with a brief review of the frequentist–Bayesian debate and texts that focus on Bayesian inference.
MAXIMUM LIKELIHOOD ESTIMATION
We briefly review maximum likelihood estimation in this section as a prelude to examining Bayesian inference. Throughout this article we use a simple bioanthropological example based on Mays and Faerman's (2001) data on sex identification for 13 infants from two Romano-British cemeteries. The sample size is small because sex identifications were made using ancient DNA (aDNA), but ultimately Mays and Faerman wanted to estimate the proportion of males among all infants from these two sites. The authors suspected that these infant burials were the result of infanticide, and thus that the proportion of males (which can be written as θ) for all infants buried at the two sites would differ from the expected proportion among living neonates. Their aDNA identifications indicated that 9 of the 13 individuals were males and only four were females.
The likelihood of obtaining certain parameter values given observed outcomes lies at the core of statistical inference. If we actually knew the value of θ, then we could find the binomial probability of getting 9 males out of 13 individuals. For example, if θ = 0.5, then that probability is . However, since we in fact do not know the value of θ, we must estimate it. In maximum likelihood estimation, we can refer to the likelihood of a specific value for a parameter ( ) conditional on the observed data (the fact that 9 of 13 individuals were observed to be males). The likelihood is defined as proportional to the probability of obtaining the data conditional on the specific value of θ, or in the particular case from Mays and Faerman: . This definition of a likelihood follows Fisher's (1922, p 310) succinct definition:
The likelihood that any parameter (or set of parameters) should have any assigned value (or set of values) is proportional to the probability that if this were so, the totality of observations should be that observed.
Figure 1 shows the likelihood of the possible parameter values (proportion of males) given 9 males observed out of 13 individuals as equal to from the binomial distribution, where:
Because the likelihood is only defined up to a multiplicative constant of proportionality (or an additive constant for the log-likelihood), we can re-scale Eq. (1) by dropping the binomial coefficient and writing the log-likelihood in place of the likelihood:
Setting Eq. (3) equal to zero and solving for gives x/n, the well-known maximum likelihood estimate. Taking the inverse of the negative of the second derivative of Eq. (3) and evaluating that at = x/n gives:
the variance of the estimate of . If one is willing to assume that the normal distribution forms a reasonable approximation to the binomial, then x/n times the square root of Eq. (4) gives a 95% confidence interval. Later we will refer to this as the “asymptotic confidence interval.” We can also consider an “exact” confidence interval (Clopper and Pearson, 1934) that is “based on inverting equal-tailed binomial tests” (Agresti and Coull, 1998, p 119) and an approximate confidence interval that adds 1.92 males and 1.92 females to the counts and then applies the asymptotic normal equation (see Agresti and Coull, 1998 for the justification of this approach). The 95% asymptotic, exact, and Agresti–Coull confidence intervals for Mays and Faerman's data are 0.4414–0.9432, 0.3857–0.9091, and 0.4204–0.8765, where the intervals were obtained using the package “binom”(Dorai-Raj, 2009) within the program “R” (R Development Core Team, 2013).
Bayes' postulate allows one to begin with an initial belief that events are equiprobable and then modify this belief after observing data. In his original example, Bayes (1763) attempted to determine the position of a tossed ball on a table by counting the number of times subsequently tossed balls fell to the left (as versus the right) of the initially tossed ball. Each tossed ball could fall anywhere from the extreme left side of the table, “0,” to the extreme right side, “1,” and was equally likely to be at any given point along the left to right continuum. This is a special case of Bayes' Theorem in which the prior is a uniform distribution (whereas Bayes' Theorem allows for many different prior distributions). As Stigler (1986, p 361) points out, it was really Laplace in his 1774 publication who most fully developed “‘Bayesian' ideas” using the uniform prior, but the terms Bayes' Theorem and “Bayesian” have stuck because of historical priority.
While Mays and Faerman took a frequentist approach to analyzing their data, the same data also can be used to infer the proportion of males for all infant deaths from these two Romano-British sites following a Bayesian approach. This problem is essentially no different than Bayes' (1763) original example of tossed balls. In the Mays and Faerman example, θ, the proportion of males among dead infants, takes the place of the position of the initially tossed ball. The initial proportion, like the initial ball, can be anywhere from “0” to “1,” and it is equally likely to be at any given point along the continuum. The counts of infants sexed as male and as female among the 13 sexed individuals take the place of Bayes' subsequent balls falling to the left (as versus right) of the initial ball. Four important components in using Bayes postulate (and Bayes Theorem) are the likelihood, the prior, the posterior, and HPD. These are defined below with examples from the Mays and Faerman data.
The likelihood, prior, and posterior
Because the likelihood is only defined up to a multiplicative constant of proportionality (or an additive constant for the log-likelihood), we can consider the following re-scaling for Eq. (1):
This rescaling, often called the normalized likelihood, follows from the fact that there are possible states for the x variable (from 0 to n).
The prior is simply the initial probability we assign to any particular value for prior to observing or at least analyzing our data (summarized in the likelihood) that 9 of 13 individuals were sexed as male. In his postulate, Bayes adopted a uniform prior such that for where we use the subscript “s” to mean a specific value of and to mean a probability density function. This is a “proper prior” in the sense that the prior integrates to 1.0 across the defined range of (from zero to one, inclusive). We will work with other probability density functions where particular values of the function rise above 1.0. This often causes confusion for readers used to dealing with probabilities (which are constrained between 0 and 1 inclusive) and not with probability density functions. Probability density functions are constrained only in that they cannot take negative values and that they must integrate to 1.0. The posterior is the probability we arrive at after evidence or data is taken into account. Bayes' postulate allows us to reverse the conditioning on the probability shown in Eq. (1) so that:
Equation (6) is equivalent to the scaled likelihood already given in Eq. (5). Equation (6) is referred to as a “posterior probability density function.” Because of the integration in the denominator, the posterior integrates to 1.0. For any given value of the function gives the probability density after (posterior to) modifying the prior by the likelihood. Figure 2 shows a plot of the entire posterior density for the proportion of males in our current example. As noted above, probability density functions can have values that exceed 1.0. This is certainly true for the case shown in Figure 2 where the highest density is equal to about 3.3 at the mode and where for values between about 0.48 and 0.86 the density is greater than 1.0.
Of confidence and HPDs
At this point in the analysis, a frequentist would probably calculate a 95% confidence interval to characterize or evaluate the posterior density for (such as the three confidence intervals given in the previous section on maximum likelihood estimation) so we need to consider how the comparable problem is addressed in a Bayesian analysis. One issue both frequentists and Bayesians must deal with is the asymmetry of the posterior density function in Figure 2 such that the posterior mean (equal to 0.6667) is less than the posterior mode (equal to 9/13, or 0.6923). We can find left and right tail areas of 0.025 in order to try to establish a 95% confidence interval for , but because of the asymmetry this will lead to including a region which has lower posterior density values. In Figure 2, the 95% confidence interval is shown collectively by the light gray and dark gray areas, where the light gray area represents the region with the lower density. We can address the issue of including a region with lower probability densities within the 95% confidence interval by instead forming the 95% HPD region. The 95% Bayesian HPD shown in Figure 3 for is from 0.436 to 0.885, with the excluded lower tail containing about 3.38% of the total posterior density and the excluded upper tail about 1.62%. This HPD has probability densities that range from a low of 0.577 on both sides to a high of 3.278 at the mode. In contrast, the equal-tailed interval shown in Figure 2 has probability densities that fall to as low as 0.454 on the left as versus a low of 0.777 on the right. Ultimately, the HPD is narrower than the equal-tailed interval, which must be the case when there is asymmetry in the distribution.
The Bayesian HPD differs from the frequentist confidence interval in how it is interpreted, and the two are consequently not directly equivalent or comparable. The HPD is a probabilistic statement about the parameter, whereas a confidence interval is a probabilistic statement about potential replications of the “experiment.” As a consequence, confidence intervals do not mean what we commonly think them to mean. Kruschke (2010b, p 661) gives a succinct definition for the term “confidence interval” as used in frequentist statistical methods: “A confidence interval is merely the range of hypothetical parameter values we would not reject if we replicated the intended experiment many times.” Lee (2012, p xxi) in the preface to his book points out how the term “confidence interval” confused him when he “first learned a little statistics”:
…the statement that a 95% confidence interval for an unknown parameter ran from −2 to +2 sounded as if the parameter lay in that interval with 95% probability and yet I was warned that all I could say was that if I carried out similar procedures time after time then the unknown parameters would lie in the confidence intervals I constructed 95% of the time. It appeared that the books I looked at were not answering the questions that would naturally occur to a beginner, and that instead they answered rather recondite questions which no one was likely to want to ask.
Smith (1986, p 303) has made similar comments with regard to the difference between confidence intervals and Bayesian HPDs for the recombination fraction in genetic linkage analyses:
Careful books also warn that, though one would like to think that there is a 95% probability that the true value of the parameter will fall in this interval, this is not logically justified by the definition of “confidence interval.” The only substantial difference using Bayes is to show that this interval is in fact a “probability interval”; there is (approximately) a 95% probability that the true value lies in the interval.
To fully understand how a confidence interval for the binomial operates it is best to look at the “coverage probability” (Vollset, 1993; Agresti and Coull, 1998; Brown et al., 2001) for the interval. This interval is simply the proportion of times that is expected to fall within its nominal confidence interval. There are several different ways that frequentist confidence intervals could be assigned, including the three we mentioned in the section on maximum likelihood estimation. We follow Agresti and Coull (1998) in focusing on the Wald method (this is the asymptotic normal method), the exact method (Clopper and Pearson, 1934), and what they refer to as the modified Wald method, but that we refer to as Agresti–Coull following later publications (Brown et al., 2001; Miao and Gastwirth, 2004; Tobi et al., 2005).
Figure 4 shows plots of coverage probabilities against for the asymptotic, Agresti–Coull, and exact method 95% confidence intervals using a sample size of 13 following our example by Mays and Faerman (2001). The asymptotic method gives confidence intervals that are too narrow, having coverage probabilities at or slightly above 0.95 only in a tight band around values of 0.45 and 0.55 and at points 0.21 and 0.79 (coverage plots are symmetric around the center point of ). At , the coverage probability for the 95% confidence interval (which runs from 0.441 to 0.943, as mentioned in the section on maximum likelihood estimation) is 0.922. Conversely, the exact method confidence intervals are conservative in that they have higher coverage probabilities than they should at a given confidence level. At , the coverage probability is 0.970 for the exact 95% confidence interval (from 0.386 to 0.909, again see the section on maximum likelihood estimation). The Agresti–Coull interval also gives a coverage probability of 0.970 at (from 0.420 to 0.876), but the exact intervals are less conservative for other values of .
Unlike the “coverage probabilities” for the three frequentist-style confidence intervals, which are too narrow or too broad, a Bayesian HPD has the proper coverage probability even at this rather small sample size of 13 individuals. Figure 5 shows a plot of the realized coverage against nominal HPD regions ranging from 0.50 to 0.99 in increments of 0.01 at a sample size of 13. This figure was drawn based on 100,000 simulations which we describe in some detail as it makes the Bayesian model more explicit. The first step in the simulation was to generate 100,000 values between 0 and 1 from the uniform density, which simulates sampling from the prior distribution. The second step was to generate a binomial variate at a sample size of 13 using the 100,000 uniform deviates. These 100,000 pairs of simulated and x values (counts of from 0 to 13 males) then formed the “data” to be assessed. The third step was to loop through HPD values of from 0.50 to 0.99 in increments of 0.01, filling in a 14 row (from x = 0 to 13) by two column (lower and upper bounds) table of calculated intervals within that loop. Finally, the simulated values were compared to the tabled values at their paired x value and the proportion of values within their HPDs was assessed. For example, a simulated of 0.1137 and x value of two “males,” is counted as being within the 66% to 99% HPDs but not within the 50–65% HPDs because the lower HPD boundary is at 0.1139 for the 65% HPD and at 0.1124 for the 66% HPD when there are 2 males out of 13 individuals.
Thus far, we have considered Bayes' postulate and the roles of the likelihood, the prior, the posterior, and HPDs in Bayesian analysis. But how can we move from Bayes' postulate to Bayes' Theorem proper? This involves generalizing the prior distribution so that we can consider other prior distributions beyond the uniform. As pointed out above, Bayes' postulate uses a particular prior density: the uniform distribution between zero and one. This prior can be written as the beta probability density function , where the two “shape parameters” are equal to 1.0. This is a conjugate prior density, meaning that when “updated” by the likelihood, it produces a posterior density that has the same distributional form as the prior. The convenience of a conjugate prior is that the onus of integrating the denominator in Eq. (6) is removed, although in Bayes' postulate the denominator is simply (and see Stigler's, 1982, p 252 comments about Bayes' “scholium”). Raiffa and Schlaifer (1961, p xii), who first defined the term “conjugate prior,” noted:
…we can obtain a very tractable family of “conjugate” prior distributions by simply interchanging the roles of variables and parameters in the algebraic expression for the sample likelihood, and the posterior distribution will be a member of the same family as the prior. This procedure leads, for example, to the beta family of distributions in situations where the state is described by the parameter p of a Bernoulli process…
Following this procedure, we can consider prior densities other than the uniform, which moves us from Bayes' postulate to Bayes' Theorem or Rule (both monikers are used in the literature).
In our example, Bayes' Theorem is:
where represents the prior density. If the prior density in Eq. (7) is a beta density function with “shape” parameters and , then the posterior density is . In maximum likelihood estimation we know that the mean and mode both equal , so we can consider what values of and in the beta prior would reproduce the maximum likelihood mean or mode. If, as in Bayes' postulate, the beta prior is , then the posterior mean will be and the posterior mode will be . If the beta prior is , then the posterior mean will be but the posterior mode will be . The prior, known as Haldane's prior (after Haldane, 1932) is an example of an improper prior because the density goes to infinity at the borders of 0 and 1. One could compromise between and , and instead use as the prior, which gives a posterior mean of and a posterior mode of . This latter prior is known as a Jeffreys prior (after Jeffreys, 1946).
All three priors (Bayes' original uniform, Haldane's, and Jeffreys') can be referred to as uninformative priors because they are rapidly “dominated” by the likelihood function after examining even a little bit of data. The existence of multiple uninformative priors raises a chief criticism made against Bayesian inference: if one expresses lack of knowledge about a parameter by using a uniform prior on one scale (for example a proportion, ), then this lack of knowledge should apply on a transformed scale (e.g., the arcsin transformed proportion on 0, ). However, as we demonstrate below, there is scarcely any difference between using one uninformative prior and another once there is a modicum of data. Consider Jeffreys' prior. We decide to transform in order to reduce the asymmetry in Figure 2. The arcsin transformation (Bartlett, 1937), , is one such possible transformation. Figure 6 shows the posterior density for the same data as used in Figure 2, except that here the proportion of males is expressed in radians (from the arcsin transformation) rather than as a straight-scale proportion of males. Note that the asymmetry for the scaled likelihood in Figure 6 is much less than for Figure 2 (the light gray region is much reduced). Figure 6 also shows the uniform prior as a dashed line, which in the arcsin transformation has a density value of for any value between and . This is again a proper prior density in that .
Now suppose we want to move in the other direction from Figure 6 back to Figure 2 in order to see our analysis on the original proportion of males scale. If we undertake this task, then the uniform prior from the radian scale shown in Figure 6 becomes the beta distribution on the proportion of males scale, or Jeffreys' prior again. Similarly, Haldane's prior is a uniform distribution in the logit scale: . Thus, Bayes' original uniform prior is the uninformative prior for the binomial likelihood in the proportion scale, Jeffreys' prior is uninformative in the arcsin transformed scale, and Haldane's prior is uninformative in the logit scale.
If we have no prior knowledge about the proportion of males then it stands to reason that we also have no prior knowledge about the arcsin or logit transformed proportion of males, but in point of fact, there is scarcely any difference between using Jeffreys' prior and the uniform prior once data is added. We show this in the “triplot” (O'Hagan, 2004) of the standardized likelihood, Jeffreys' prior, and the posterior density in Figure 7. Note that under a uniform prior (on the proportion of males scale) the standardized likelihood and the posterior density coincide, as in Figures 1 and 2. In Figure 7 the Jeffreys' prior has barely nudged the posterior density away from the standardized likelihood. The point of this exercise is to demonstrate that a uniform prior becomes non-uniform with transformation of a parameter, a valid critique of a Bayesian approach that has no practical effect when the likelihood dominates the prior.
Sequential use of Bayes Theorem
Thus far we have generalized the uniform prior to consider other types of uninformative priors, but we can also use this generalization to create an informative prior. Mays and Faerman's (2001) study is not the only one to have used aDNA to assess the proportion of males among infants from Roman era sites. After examining their data, Mays and Faerman also included a count of three male infants and one female infant from the Beddingham Roman Villa (Waldron et al., 1999). This brought the total to 12 males and 5 females. Although they did not include aDNA data from the Late Roman Era site of Ashkelon (Faerman et al., 1998), this data (14 males and 5 females) could potentially be combined with the 12 males and 5 females to yield 26 total males and 10 females. Using Jeffreys' prior, the posterior density would then be a distribution. The 95% HPD for the proportion of males from this distribution is from 0.558 to 0.845. Courgeau (2010, 2012, p 114–116) describes how Laplace used data on live births from Paris and London and “inverse probability” to assess the probability that the sex ratio at birth for Paris was higher than for London. As this probability was nearly zero, Laplace opined that London, at 0.513, had the higher proportion of male births. The 95% HPD for the 26 males and 10 females from Roman sites excludes this value of 0.513, so similarly we might suspect that the Roman era sites provide more male infant deaths than expected from modern data.
In the above paragraph, we approached the analysis as if we started with Jeffreys' prior and then used the data on 36 infants (Faerman et al., 1998; Waldron et al., 1999; Mays and Faerman, 2001) to find the posterior density of . But we can also use yesterday's posterior as tomorrow's prior, so we could treat the data in the order that it arrived in the literature. Starting with Jeffreys' prior and the Faerman et al. (1998) data gives a posterior density of which we could use as a prior for Waldron et al.'s (1999) data to obtain a posterior of . This in turn could be used as a prior for Mays and Faerman's (2001) data to arrive at the posterior of . Alternatively, we could take the data in reverse order, again beginning with Jeffreys' prior but adding Mays and Faerman's (2001) data to obtain a posterior of , then adding Waldron et al.'s data (1999) data to arrive at a posterior of , and finally adding Faerman et al.'s (1998) data to again arrive at a posterior density of . Figure 8 shows how these two paths both arrive at the same answer. Panel A shows data incorporated in ascending order of date of publication, whereas Panel B shows descending order; both panels should be read from the top down.
The examples in the preceding section show how we can use a posterior density to create a new informative prior for analyzing additional data. We can also use a posterior density to generate probabilities of observing certain parameter values in a new sample. Forming the product of Eq. (1) with a posterior density and then integrating across from zero to one gives the predicted values if we were to obtain a new sample. For example, if we start with as our posterior density for (the proportion of males across the three studies from the previous section), and we then obtain a new dataset of 15 infants sexed by aDNA from a Roman site, the predictive density gives us the probabilities of observing 0, 1, 2, …, 14, or 15 males in the new sample. Using and for the “shape parameters” in the beta posterior density (26.5 and 10.5 in this example) we can write the joint probability for (the proportion of males) and for obtaining y number of males in a future sample of m individuals. This is the product of the binomial probability for obtaining y males out of m individuals given and the beta posterior density for . Integrating across from 0 to 1 gives the predictive probability distribution for y, which is:
where is the beta function, the integral of the beta density between 0 and 1. Equation (8) is the beta binomial distribution . Figure 9 shows our example using . While we will not have much further need to use the predictive distribution analytically in this article, it is a useful point of departure for demonstrating the Gibbs sampler (Casella and George, 1992). It also can be used to check on the reasonableness of a model. If the (posterior) predictive density does not do a good job of “predicting” the data originally used in the analysis, then the analysis itself is suspect.
While the posterior predictive density is useful for checking the reasonableness of a single model, the measure for choosing between competing models is Bayes factor. Kass and Raftery (1995) give a useful exposition of Bayes factors, and Kruschke (2010a) provides a detailed example for the binomial distribution. If we have two competing models that potentially could have generated an observed data set, then we can form a ratio of the probabilities of observing the data under each of the two models. This ratio of probabilities is referred to as a Bayes factor. It is identical to a likelihood ratio only when the two models are specified such that the parameter(s) of interest is (are) point values. If instead the model gives the parameter value(s) across an interval, then the parameter value must be “integrated out,” which for a beta prior leads to the beta binomial, as we saw above. If the quotient of the probabilities of Model 1/Model 2 is greater than one, then the first model is more strongly supported by the data being considered. Various scales have been offered for translating Bayes factors into statements about relative plausibility of models. For example, Jeffreys (1939, p 357) classified Bayes factors (which he called “K”) into six grades, running from grade 0 for “null hypothesis supported” to grade 6 for evidence against the null is “decisive.”
In the example we have been following, let's consider the simplest Bayes factor first: comparison of a uniform prior to a prior with a point mass of 1.0 on . For a uniform prior, the probability of getting any count from 0 up to 13 males out of 13 individuals is 1/14, so the probability of the observed data of 9 males out of 13 individuals is 0.0714. The probability of getting 9 males out of 13 individuals if follows directly from Eq. (1), and is about 0.099. The Bayes factor comparing a model with to that with a uniform prior is therefore 0.099/0.0714 or about 1.4, which demonstrates that a model with a point mass at 0.513 does a better job of providing the observed data than does a uniform prior.
Now let's consider Bayes' factors for models we created by sequentially using Bayes Theorem. The data of nine males and four females from Mays and Faerman's (2001) study was the last to arrive in the literature, so we could take as one model a prior based on Jeffreys' prior and the fact that 17 male and 6 female infants had been reported in the literature for Roman era infants. The beta distribution has a mean of and variance of , so the distribution has a mean of about 0.729 (proportion of males) and standard deviation of about 0.0889. For an alternative model, we might presume that the mean proportion was 0.513 and the standard deviation was 0.1. This would correspond to a prior. The ratio of , which is the Bayes factor for comparing these two models, is about 1.671. A Bayes factor of 1.671 classifies as a grade 1, or “not worth more than a bare mention,” on Jeffreys' scale. If, on the other hand, we consider all of the data from Faerman et al. (1998), Waldron et al. (1999), and Mays and Faerman (2001) under a Jeffreys prior and compare that to the probability of getting the same data if was exactly equal to 0.513, then we have in the numerator and the binomial probability for getting 26 males out of 36 individuals (with ) in the denominator. This gives a Bayes factor of about 3.5, which classifies on Jeffreys' scale as a grade 2, or “evidence against the null is substantial.”
Interpretively, it is useful to look at the Bayes factor in a bit more detail. As Kass and Raftery (1995) show, Bayes Theorem can be used to manipulate Bayes factor so that we have:
where the term on the left is the posterior odds for Model 1 conditioned on the observed data against Model 2 conditioned on the same data. The first term on the right side of the “equals” sign is the Bayes factor, or the odds of getting the data under Model 1 versus under Model 2. The final term is the prior odds placed on the two models. Typically, one assumes that , canceling out the final term, so that the posterior odds are equal to Bayes factor. Equation (9) looks very similar to one that is frequently used in forensic work, where the posterior odds of two hypotheses are equal to the likelihood ratio times the prior odds for the hypotheses. In the forensic setting the two competing hypotheses are typically “guilty” versus “not guilty” (see for example (Lucy, 2005, p 112–114)). As we pointed out above, whenever models are such that parameters are exact points, then the Bayes factor is the same thing as a likelihood ratio.
A normally distributed parameter
Application of Bayes Theorem is certainly not limited to data that has a discrete distribution. In this section, we turn to the problem of estimating the stature for A.L. 288-1 (“Lucy”), a problem that can be addressed using both a “likelihoodist” approach (Sober, 2002) or a Bayesian approach. Konigsberg et al. (1998) gave the classical calibration estimator for stature from femur length as , which with A.L. 288-1's femur length of 281 mm gives an estimated stature of 1081 mm. Classical calibration in this context finds the stature at which the observed femur length is most likely to have occurred. They also gave an “integrated mean squared error” from this method equivalent to 2,369. In the Bayesian setting, we would refer to the integrated mean squared error as the “data variance,” as it reflects our uncertainty in the estimated stature due to the imperfect correlation between stature and femur length. The “data variance” is equal to , where is the variance of stature, and r is the correlation of stature with femur length within a reference sample (for which Konigsberg et al. used 2,053 modern humans). Note that when the correlation between femur length and stature is 1.0, the data variance for the estimated stature is 0. Putting the estimated stature and its data variance together we can write , meaning that the data from A.L. 288-1's femur length and the reference sample imply that A.L. 288-1's estimated stature is normally distributed around a mean of 1,081 mm with a variance of 2,369.
Using a “likelihoodist” approach, we would stop at this point and use as our best estimate. In contrast, a Bayesian approach would combine this information from data with a prior density for stature. Lee (2012, p 40–42) gives a good presentation of how the normal distribution is the conjugate prior for a normal likelihood. For the purposes of illustration, we could use the reference sample stature distribution as the prior, in which case we have the normal distribution , where 1,725 mm is the mean stature and 7,270 is the variance of stature. This is not a reasonable prior in the “real world,” as we know that overall Lucy is quite small as compared to modern humans. Continuing with the Bayesian approach, we use the inverse of the data and prior variances, each of which is referred to as a “precision.” The posterior precision is simply the sum of the data precision and the prior precision, which gives , or a posterior standard deviation of 42.3 once we invert the posterior precision and take the square root. The posterior mean is equal to the weighted average of the data mean (1,081 mm) and the prior mean (1,725 mm), where the weights are the respective relative precisions (i.e., the data precision and the prior precision each divided by the posterior precision). This gives , which agrees with the estimate obtained using “inverse calibration” summarized in Konigsberg et al.'s (1998) Table 2. Because the reference sample size is large (n = 2,053), we ignore the fact that the variances, correlation, and means are estimated rather than known. Konigsberg et al. (2006) show how Gibbs sampling, a type of computer simulation, can be used to form the full posterior density of stature for an individual when only a small reference sample is available.
COMPUTER SIMULATION AND BAYESIAN STATISTICS
In the simple examples given above, we were able to analytically compute and use full posterior probability distributions to perform Bayesian inference. While using a conjugate prior on relatively simple models like the binomial and a univariate normal removes the onus of numerical integration, there are many more complicated problems that cannot be so easily handled. In cases where the prior and posterior do not take the same form, and/or when the probability distributions of interest are multivariate or otherwise complex, we may give up solving analytical equations and replace symbolic or numerical integration with computer simulation. This shift is responsible for much of the fluorescence of Bayesian analysis in the last two decades (Gilks et al., 1996; Lunn et al., 2009, 2000; Gelman et al., 2004; Sturtz et al., 2005; Kéry, 2010; Kruschke, 2010a,b; Ntzoufras, 2011; Kéry and Schaub, 2012), as we noted in the introduction. Complex Bayesian analyses use sampling techniques based on Monte Carlo methods to estimate, rather than calculate, the posterior distribution. In the Appendix, we illustrate five Bayesian simulation methods, the first of which, approximate Bayesian computation (ABC), uses acceptance sampling to approximate or build a posterior distribution. The four subsequent simulation methods—the Metropolis sampler, slice sampling, adaptive rejection sampling, and Gibbs sampling—are used within Markov Chain Monte Carlo (MCMC) methods to sample from conditional distributions.
Each method described in the Appendix has different strengths and weaknesses, and is useful and appropriate for different kinds of problems or questions. Conveniently, a number of freely available specialized software packages for applying MCMC methods behave as expert systems that attempt to utilize the most appropriate sampling scheme, thus removing the decision of which sampling simulation method is most appropriate from the user. Less conveniently, the expert systems do not typically reveal how they selected the most appropriate simulation method. Lunn et al. (2000, p 328) explain the hierarchical logic used by one of the expert systems, WinBUGS. While the software automatically chooses among simulation methods, we believe it is nevertheless important to understand how these different models work. Our illustrations in the Appendix continue using the aDNA sexing for Roman Era infants example, even though we were able to apply direct analytical methods to it. The advantage of staying with this example is that computer simulation results in the Appendix can be compared to the analytical results.
Using OpenBUGS for paleoanthropological analysis
For all of the very modest examples of Bayes Theorem up to this point, we have been using the statistical computing and graphics environment, R (R Development Core Team, 2013). While it is certainly possible to build MCMC models in R, as reflected in recent years in the AJPA (Fürtbauer et al., 2013; Gillespie et al., 2013; Séguy et al., 2013), it is more typical for researchers to use one of the freely available packages to apply MCMC (Millard and Gowland, 2002; Barik et al., 2008; Matsuda et al., 2010; Babb et al., 2011; Matauschek et al., 2011; Yang et al., 2012; Gilmore, 2013; Muchlinski et al., 2013; Raaum et al., 2013; Zinner et al., 2013). Among the freely available packages are: BATWING, BayesTraits, BEAST, JAGS, MrBayes, OpenBUGS, Stan, STRUCTURE, and WinBUGS. Many of these packages are focused around phylogenetics or population structure, but others (JAGS, OpenBUGS, Stan, WinBUGS) are quite general. We give a brief example from one of these more general packages (OpenBUGS) using published summary statistics by Uhl et al. (2013). Uhl et al. give the vector of means and the variance–covariance matrix for log scale measurements of body mass, three humeral measurements, and three femoral measurements from 600 modern humans. They then proceed to use profile likelihood and Bayesian methods to estimate body mass for KNM WT-15000 (“Nariokotome boy”). The “profile likelihood” is simply the likelihood function “after removal of nuisance parameters” (Brown, 1993, p 2), which in this case are the mean measurements and variance–covariance terms within the reference sample. We revisit Uhl et al.'s (2013) analysis here starting from their summary statistics and using OpenBUGS in place of the direct calculations to reevaluate body mass estimates.
Figure 10 shows the OpenBUGS code for running this example, which is identical to the code for WinBUGS. Both OpenBUGS and WinBUGS are descendants of BUGS, but we use OpenBUGS (version 3.2.2 rev 1063, July 15, 2012) here because it is open source and under continuous development, whereas WinBUGS is no longer under development and is consequently “frozen” at version 1.4.3 (August 6, 2007). Figure 10 contains a “model” and “data.” The model first states two different priors, one of which can be commented out in order to reproduce either the profile likelihood results or the Bayesian results from Uhl et al. (2013). The remaining code gives the multivariate regression of log measurements (humerus minimum midshaft diameter, humerus epicondylar breadth, femur anterior-posterior midshaft diameter, and femur medial-lateral midshaft diameter) on log body mass for the 600 modern humans. The “data” statement contains the log measurements for KNM WT-15000 and “tau.” “Tau” is the inverse of the residual variance–covariance matrix among bone measurements after “regressing out” log body mass. This matrix as well as the regressions was found directly from the summary statistics from Uhl et al. (2013). With access to the raw data one could also model this matrix as having a Wishart prior with low prior precision (see Appendix 2 in the work by Konigsberg et al., 2006) in order to include the uncertainty in calculating the multivariate regression. With 600 cases this would have little effect.
Figure 11 shows the output from 10,000 iterations in OpenBUGS, under both an uninformative prior and an informative prior using data from 600 modern humans for four bone measurements for KNM WT-15000. Because the prior is conjugate to the likelihood (the multivariate normal), the sampling is direct and the program does not actually use MCMC sampling. As a consequence, the issues of “burn-in” and autocorrelations within the “chain” (which we take up for later analyses in OpenBUGS) can be ignored. In Figure 11 the full posterior densities are shown using kernel density estimation, and the 0.025, 0.5, and 0.975 quantiles from the 10,000 iterations are shown with vertical dashed lines. For comparison, the reported quantiles by Uhl et al. (2013) are shown using filled points.
The quantiles from OpenBUGS are virtually identical with the quantiles published by Uhl et al. (2013), differing by at most 0.8 kg. After examining an allometry diagnostic and z-scores from Darroch and Mosimann (1985) for six shape variables (the four here as well as the humerus maximum mid-shaft diameter and the vertical head diameter of the femur), Uhl et al. (2013) commented on the fact that the humerus maximum midshaft diameter for KMN WT-15000 appeared to be “too large.” We can examine this by looking at the (posterior) predictive distribution for this measurement in a model where all six variables depend on body mass and we take an uninformative prior for body mass. On 10,000 iterations, the predicted value for the humerus maximum midshaft diameter was greater than or equal to the observed value of 29.9 mm only 1.67% of the time, showing that the observed value of 29.9 mm is quite extreme if KMN WT-15000 followed the same allometry for this variable (relative to the remaining variables) as seen in modern humans.
BAYES IN BIOARCHAEOLOGY
As has hopefully become apparent from the studies cited thus far, biological anthropological applications of Bayesian methods to human genetic, epidemiological, and demographic questions are not restricted to living populations but can address archaeological and paleontological populations as well. More or less Bayesian approaches have become common in paleodemography over the past decade or so, and are also making an appearance in the paleopathology literature. Here, we examine the use of Bayesian inference in estimating the age-at-death structure for a bioarchaeological sample, using a toy example to illustrate how to estimate hazard parameters while accounting for uncertainty in skeletal age estimation at the same time. We also examine the use of Bayesian analysis to form full posterior densities to estimate disease prevalence for a bioarchaeological sample, using another toy example to illustrate exactly how Bayes' Theorem can fit within paleopathology studies.
Bayesian modeling of mortality and uncertain age estimates
Konigsberg and Frankenberg (1992) noted that much of prior paleodemographic analysis had an ersatz Bayesian flavor that followed from conditioning “age on stage.” The problem with these previous approaches is that they essentially hid the prior distribution for age-at-death. This problem of a hidden prior was implicit in Bocquet-Appel and Massett's (1982) insightful critique of paleodemography a decade earlier, where they noted that paleodemographic age-at-death distributions tended to mimic those of the reference sample on which age determination methods were based. Konigsberg and Herrmann (2002) used MCMC to fit hazard models to skeletal data, but they did not include a prior distribution for the hazard parameters. As a result, their approach was decidedly non-Bayesian and was instead a stochastic expectation-maximization approach (Diebolt and Ip, 1996) that led to maximum likelihood estimates of the hazard parameters. Possibly as an over-reaction to the somewhat Bayesian past of age estimation in paleodemography, the “Rostock volume” (Hoppa and Vaupel, 2002) took a decidedly frequentist approach to the field.
More recently, a number of applications in paleodemography have taken a more explicitly Bayesian approach (Chamberlain, 2000; Gowland and Chamberlain, 2002; Millard and Gowland, 2002; Bocquet-Appel and Bacro, 2008; Caussinus and Courgeau, 2010; Séguy et al., 2013). Of these, only Caussinus and Courgeau (2010) and Séguy et al. (2013) take a fully Bayesian approach by applying MCMC to produce the full posterior distributions for the proportions of individuals in each age class. The MCMC can also provide the posterior densities for age-at-death for each individual as a by-product. This is not trivial, as one would hope that bioarchaeological analyses would take into account the uncertainty in skeletal age estimation, as Konigsberg and Holman (1999) have argued is appropriate.
For the remainder of this section, we consider a “toy” example of obtaining the posterior density for the baseline and senescent components of mortality in a Gompertz model using a six-stage osteological age “indicator.” For the purposes of illustration, we make several simplifying assumptions. To model progression through an age “indicator” system with six ordered states we assume that the progression to the next highest state follows a log-normal distribution with means at 2.8, 3.2, 3.4, 4.0, and 4.6 and a common standard deviation of 0.45. This translates into modal transition ages of 13.4, 20.0, 24.5, 44.6, and 81.2 years on the straight scale. These transitions are based on Todd phase scores for 422 males from the Terry Anatomical collection where the six stages are Todd I–III, IV–V, VI, VII–VIII, IX, and X (Katz and Suchey's, 1986 “T2” scoring). Next we assume a Gompertz mortality model starting at age 15 years with and . Assuming that we have 200 skeletons each of whom aged according to the log-normal model and then died-off following the specified Gompertz hazard, rounding to integers we should have obtained 5, 17, 18, 88, 61, and 11 skeletons in each of the six stages. Starting with these counts and the transition analysis parameters, the maximum likelihood estimates of the two hazard parameter values are and .
Figure 12 shows the OpenBUGS code for this “toy” example. Soliman et al. (2012) show that the gamma distribution is an appropriate prior for the senescent component hazard in a Gompertz model, but they only do so after assuming a discrete prior for the baseline mortality. In their example they assume that values of from 0.05 up to 0.50 in steps of 0.05 are a priori equally likely. A diffuse gamma distribution can also be used as a prior for the baseline mortality. Indeed, the example from “ReliaBUGS,” a package within OpenBUGS, which uses the function dgpz() for the Gompertz distribution takes diffuse gamma distributions as priors for both parameters. While this works well when ages are known (as is the case in the example from “ReliaBUGS”) we found that it did not work when ages are estimated with considerable uncertainty. We consequently used the dloglike() function in OpenBUGS which implements the “zeroes trick” used in WinBUGS. This simply requires writing the log-likelihood from the Gompertz model on each case, which is shown in the code as logLike[i].
The model shown in Figure 12 starts by making random draws on a gamma distribution with “shape” and “rate” parameters both equal to 1.0 to obtain a value for the baseline and senescent mortality hazard parameters. Then for each of the 200 individuals, an age-at-death is simulated out of the current Gompertz model. In order for these simulated ages to represent actual ages-at-death, 15 years (the starting age of the Gompertz model) must be added to each. This age is then converted to the log-scale as that is the scale for the “transition analysis.” Next, for each of the individuals the probability that they are in each of the six phases is found from the transition analysis based on their current log age (the “phi” in Fig. 12 is the standard normal cumulative density function). Finally, the observed phase data for each individual is shown as distributed by their individual current multinomial distributions.
Figure 12 also shows the “data” that underlie the model. These are the observed phases, the mean (log) transition ages, and the common (log) standard deviation of transition ages. OpenBUGS uses the slice sampler (see the Appendix) for sampling from the two hazard parameters as well as the 200 ages-at-death. Because the slice sampler is being used within Gibbs sampling there is considerable autocorrelation in the output and we consequently must discard the initial iterations and retain only some of the subsequent iterations. We “post-processed” the chain from OpenBUGS in the R package “coda” (for “convergence diagnosis and output analysis,” see Plummer et al., 2006). In the interest of time and space we do not report this here. Ultimately we ran 10,000 iterations in one chain as a “burn-in” and then retained 10,000 values of both of the hazard parameters (as well as some age distribution information described below). The 10,000 retained values were simulated using a “thinning” interval of 500 so that in actuality 5 million iterates were formed after the burn-in.
Figure 13 shows kernel density plots of the 10,000 retained values from the posterior distribution for the baseline and senescent components of mortality along with the maximum likelihood estimates drawn as log-normal distributions. Figure 13 also shows the prior distributions used in the MCMC procedure. Of particular interest in Figure 13 is the fact that maximum likelihood estimation leads to a far more peaked distribution for the baseline hazard than in MCMC. This is a result of the baseline hazard being near its lower boundary of 0.0. In contrast, the distributions for the senescent component of mortality are more comparable.
One great advantage of using this method to model the posterior densities of mortality is that it also allows us to form posterior densities of age-at-death for individuals. This can be done by “monitoring” the age iterates for a specific individual within each of the six phases. Figure 14 plots the kernel density estimates of these posterior age distributions along with the posterior age distributions obtained from the product of the maximum likelihood Gompertz model with the “transition analysis” model (see for example Fig. 1 in DiGangi et al., 2009). These are converted to true densities by dividing within each phase by the integral of the function across age. The posterior densities obtained by the MCMC method are similar to the ones from integration, but the MCMC posterior densities for age-at-death are “heavier” in the tails because the MCMC method accounts for the fact that the hazard parameters are being estimated (the integration method assumes that both the baseline and the senescent hazard parameter values are known). While the assumption of known hazard parameter values may be justified in the forensic setting, it is rarely if ever justifiable in the bioarchaeological setting. Another feature not explored in this example is the ability to include uncertainty in the transition analysis parameters themselves. As the logit and probit models (the basis for transition analysis) are standard models used within MCMC programs, this could be added in real analyses.
Bayesian analysis of paleopathology
Just as Bayesian approaches in paleodemography can incorporate the uncertainty in age estimates, similar approaches in paleopathology can incorporate the uncertainty due to small samples and/or to incomplete bony expression of particular disease states. Byers and Roberts (2003) show how to use “Bayes' Theorem in Paleopathological Diagnosis” but they do not sketch how to conduct a Bayesian analysis. Specifically, they show how to calculate a posterior probability but do not show how to form the full posterior density for a parameter, such as prevalence. We illustrate how to carry out a Bayesian analysis that includes forming the full posterior density for prevalence using another “toy” example. We presume that there is some disease that produces a specific bone lesion with a frequency of 80% in affected individuals and that individuals who do not have the disease will never form these lesions. Out of a bioarchaeological sample of 43 individuals we observe four individuals who have the specific bone lesion.
Byers and Roberts show how to combine an informative prior (disease prevalence from previous studies) with likelihoods (in our case, the information that 80% of the diseased individuals will form the specific bone lesion) to form the posterior probability that an individual had a particular disease. But we can proceed more directly to estimating the prevalence in our small sample of 43 individuals. Given that the lesion can only be formed in those with the disease (with a probability of 0.8), and given that we observed 4 out of 43 individuals with the lesion, we can solve for the prevalence at about 0.1163, or that 5 out of the 43 individuals had the disease. But this algebraic estimate ignores the fact we have uncertainty in the prevalence due both to the small sample size and the fact that we surmised that 80% of the sick would develop the lesion. All that we did was divide one proportion (4/43) that serves as the probability of observing the lesions by another proportion (0.8) that serves as the probability that a sick individual would express the lesion (the “sensitivity” of the lesion to the presence of the disease).
The picture changes if we now note that our statement that 80% of the sick will form the bone lesion was based on a small study of 50 skeletons from individuals known to have had the disease, and if we also account for the fact that our estimate of the lesion frequency in the bioarchaeological sample is based on only 43 skeletons. If we want to form the full posterior density for the disease prevalence in our bioarchaeological sample then we need to recognize that this is the ratio of a beta density for the lesion frequency in the bioarchaeological sample against a beta density for the lesion frequency in (sick) individuals in the reference sample. Figure 15 shows the posterior densities for disease prevalence assuming that a large reference sample is available for calculating the likelihood (i.e., the frequency of lesions in known sick individuals), that a sample of only 50 individuals is available, and that a sample of only five individuals is available. In all three cases, we assume Jeffreys' prior for the probability that individuals in the bioarchaeological sample will have bone lesions (so the posterior density is ) as well as for the probability that sick individuals in the reference sample will have skeletal lesions.
This latter assumption leads to a posterior density of when there are 50 individuals and when there are five individuals. When the reference sample size is large (even 100 skeletons is quite close to the asymptotic result) then the posterior density for prevalence in the bioarchaeological sample comes directly from the posterior density. Specifically, if is a variable representing the disease prevalence in the bioarchaeological sample while is the transformation that leads to observed lesions, then the posterior density for prevalence is . When the reference sample size is smaller, the posterior density for the prevalence in the bioarchaeological sample is a ratio of beta distributions (Pham-Gia, 2000) with in the numerator. Figure 15 shows that, at least for this example, the posterior density for prevalence in the bioarchaeological sample is not much affected by the reference sample size. This is encouraging, but the example is quite artificial, and it certainly does not account for the large biases that can occur if the disease is expressed differently in the bioarchaeological sample relative to the reference sample.
The above example assumed a specificity of one such that individuals who do not have the disease will never form the bone lesion, but specificity often can be less than one (meaning that at least some disease-free individuals show the bone lesion). Boldsen (2001) considers such a case, presenting a novel application that uses constrained non-linear regression to estimate the prevalence of leprosy in three bioarchaeological samples and to assess the sensitivities and specificities of seven osteological “markers” that may be related to the presence of leprosy. His model is a special case of what is known as the Hui–Walter paradigm or model (Hui and Walter, 1980; Johnson et al., 2001; Toft et al., 2005; Berkvens et al., 2006). In one form of this model, two or more “diagnostic” tests with unknown sensitivities and specificities are applied across two or more samples with differing prevalence for the disease to which the tests relate. The data within this model form a three dimensional array with one dimension having a length of two (for presence versus absence of the osteological “marker”), one dimension having a length equal to the number of osteological “markers,” and one dimension having a length equal to the number of samples.
In the Hui–Walter model there are 2T+P (Pouillot et al., 2002) parameters to estimate where T is the number of traits and P is the number of populations. The “two times” is for the sensitivity and specificity of each “test” (marker) while the addition of P is for the number of prevalances to be estimated. In Boldsen's setting with three archaeological samples and seven osteological “markers,” there are 17 parameters to be estimated. The original Hui–Walter model assumes that the “markers” are independent conditional on the unknown disease statuses, and thus there are degrees of freedom available from the data (Pouillot et al., 2002). Boldsen (2001) makes the stronger assumption that the “markers” are independent without the requirement of conditioning on unknown disease status. The degrees of freedom from the data in Boldsen's analysis are which leaves four degrees of freedom with which to estimate the 17 parameters. Positive degrees of freedom are a necessary condition for the model to be “likelihood identifiable,” or in other words this is a necessary condition for there to be a unique local maximum to the likelihood function. But having positive degrees of freedom alone is not a sufficient condition. Using a method detailed by Jones et al. (2010), it is possible to show that Boldsen's (2001) model is not likelihood identified, but if cross-classifications of pathologies are included in the data then there are identifiable models. Such models can be fit using the program “TAGS” (http://www.epi.ucdavis.edu/diagnostictests/QUERY.HTM) which uses the method of maximum likelihood. Joseph et al. (1995) described a Gibbs Sampler that can also be used to fit the Hui–Walter model. This model, as well as more complicated models that include dependence between pathologies, has been fit using MCMC software (Branscum et al., 2004; Engel et al., 2006; de Clare Bronsvoort et al., 2010). Such models could be fit in paleopathological analyses provided the cross-classifications of pathologies are available.
WHY FORENSIC ANTHROPOLOGISTS SHOULD BE BAYESIANS, BUT SELDOM ARE
In our experience, forensic anthropologists are among the strongest adherents of classical hypothesis testing approaches. This seems puzzling on the surface, as the questions that arise in forensic anthropology rarely relate to questions of whether or not to reject a null hypothesis (see for example Taroni et al., 2010). Instead, such questions pertain to the relative strengths of competing hypotheses or to the characterization of posterior densities of one or more parameters of interest. The fascination with P-values in forensic anthropology is no doubt a result of past training as well as stasis in the field. The use of “subjective probability” (see in particular Chapter 2 by Courgeau, 2012) may make non-frequentist approaches appear un-scientific and generally unappealing, and the use of prior information (even when it is completely justifiable, as in Brenner and Weir, 2003) may appear problematic to forensic scientists. The Scientific Working Group for Forensic Anthropology's (SWGANTH) “product” for statistical methods is relatively mute on the subject of Bayesian statistics, although it does note under “unacceptable practices” the “ad hoc formulation of priors when using Bayesian statistics.” Finally, MCMC seems to be off the table as far as forensic anthropologists are concerned. Computer simulation and the law seem to be an uncomfortable mix, despite the fact that MCMC can make complicated Bayesian networks easily interpretable using directed graphs.
To the above mentioned problems we can add the current court position in the UK regarding the use of likelihood ratios and Bayesian inference for non-DNA expert testimony. The appeal in the case of “R v T” (names were redacted in the court proceedings) brought into question expert testimony on a match between a suspect's shoe and footprints at a crime scene. The entire December 2012 issue of Law, Probability, and Risk was devoted to a discussion of this case. While opinions were divided among authors, Thompson (2012, p 347) rather emphatically stated that he considered the court's position a “judicial blunder”:
I will say at the outset that I think R v T is an inept judicial opinion that creates bad law. The opinion went awry because the justices who wrote it misunderstood a key aspect of the evidence they were evaluating. The justices sought to achieve laudable goals, but their misunderstanding of basic principles of inductive logic, and particularly Bayes' theorem, led them to exclude a type of expert evidence that, in general, is helpful and appropriate in favour of an alternative type… that is fundamentally inconsistent with the goals the court sought to achieve. The case has already received severe criticism and will inevitably come to be seen for what it is—a judicial blunder.
In light of these various problems and objections, we continue to be surprised by the extent to which at least some forensic anthropologists are “closet Bayesians.” Konigsberg et al. (2008) referred to the use of the percentile method for age estimation in forensic anthropology, where the percentiles of ages within stages in a reference sample are used to estimate ages in a target sample, as a “hidden Bayesian” approach. Konigsberg et al. (2009, p 84) in an article on “estimation and evidence in forensic anthropology: sex and race” stated:
We consider forensic anthropologists as being implicit Bayesians because they often do bring prior information to their cases, though this information is typically implicit, unstated, and not quantified.
One of our goals within forensic anthropology has been to make the use of Bayesian inference explicit (see for example Ross and Konigsberg, 2002). In this section, we give examples of how to make Bayesian inference explicit in the analysis of commingled remains and in victim identification in mass disasters involving closed populations. We also spend some time clarifying the role of prior information in forensic anthropology and the ways in which conditional probabilities may be incorrectly transposed and misinterpreted in trial settings.
Analysis of commingled remains
The analysis of commingled skeletal remains involves two related problems, one being the ability to pair left and right antimeres from the same individual and the other being the ability to re-associate different elements from the same individual. We begin with the problem of attempting to re-associate antimeres (Lyman, 2006; Byrd, 2008; Nikita and Lahr, 2011; O'Brien and Storlie, 2011) and then turn to the problem of re-associating different skeletal elements. We show in particular for the analysis of bilateral elements the danger of only using a Bayesian approach.
Re-associating bilateral elements
Previous studies that have dealt with the problem of re-associating paired elements have generally used one or more measurements on antimeres from a proposed pair and compared the asymmetry from those measurements to the distribution of asymmetry measurements obtained from a reference sample of paired bones. O'Brien and Storlie (2011) have recently taken an approach to this problem that is implicitly Bayesian, but they do not indicate that their “Refit Probability” is actually a posterior probability based on assuming an equal prior of pairing a given left bone (or alternatively a given right bone) to each of the right bones (or alternatively each of the left bones). Further, they refer to this posterior probability as being a likelihood, which further muddies the waters.
O'Brien and Storlie (2011) form vectors of differences between left and right homologous measurements on a given element within a reference sample of known pairs. They then use a multivariate normal density from this reference data to find the point densities of getting vectors of differences between all pairs of bones within their target sample and place these in a matrix of left bones against right bones. O'Brien and Storlie then convert this matrix into both a row-stochastic matrix (by dividing row elements by their associated row totals) and into a column stochastic matrix (by dividing column elements by their associated column totals).
Literally what O'Brien and Storlie have done is found in their two stochastic matrices the posterior probability conditioning on the left side bones that the various right side bones came from the same individual, and conversely conditioning on the right side bones that the various left side bones came from the same individual. O'Brien and Storlie's Bayesian approach is problematic both because the prior probabilities are not well justified and because their approach does not allow for the possibility that some recovered bones may not have their antimeres present in the sample. In the following section, we will assume equal priors not by conditioning on particular bones as O'Brien and Storlie have implicitly done, but rather by enumerating all the possible “configurations” (ways to group bones into individuals).
In O'Brien and Storlie's approach some of the posterior probabilities for matches may be quite high, even though the likelihood (based on asymmetry measured for a possible pair) is quite low. This is a common problem from discriminant function analysis that is addressed by calculating a “typicality probability” (Konigsberg et al., 2009; Ousley et al., 2009; Ousley and Jantz, 2012). Here, the typicality probability is the probability of observing an amount of left/right asymmetry between two potential pairs that is equal to or greater than some value given the distribution of asymmetry between known pairs in the reference sample. Calculating typicality probabilities is a frequentist approach that is preferred here over O'Brien and Storlie's ill-advised Bayesian approach. In the following section, we skirt the problem of having to pair antimeres and calculate typicality probabilities by assuming that there is absolutely no left/right asymmetry (or that only one side is under consideration), that there are no missing bones, and that the number of individuals is known.
Re-associating different elements
Adams and Byrd (2006) give an example of the analysis of “small-scale commingling” using data on the recovered remains of two US soldiers from a military helicopter that crashed in Vietnam during 1969. In an analysis of the humeri and femora they wish to determine whether a left humerus should be associated with the femora already associated with “Individual 1” or whether they should be associated with “Individual 2.” Adams and Byrd use inclusion versus exclusion within a 90% prediction interval for the regression of a composite humerus variable on a femoral measurement, where the regression is formed using data on 139 individuals. This approach does not allow us to make probabilistic statements, and it is not clear what percentage should be used in making statements of inclusion or of exclusion. Ideally one would want exclusion based on a very wide or high percentage interval and inclusion based on a very narrow or low percentage interval.
We can give the analysis a more probabilistic bent, though it requires access to summary statistics from the reference data not given in the original publication. The correlation coefficient and regression parameters provided in the original publication are insufficient to characterize the bivariate normal assumed in fitting the regression. We can, however, extract the raw data using Data Thief (Tummers, 2006) and Adams and Byrd's plot in their Figure 1, and recalculate the regression in order to characterize the bivariate normal. While our regression is similar to the published version (our correlation was 0.906 versus Adams and Byrd's 0.903, and our regression parameters were 0.853 and 1.213 versus Adams and Byrd's 0.844 and 1.251) it is not identical. This is primarily because of overlapping points that were difficult to visualize. Still, the basic reference data are similar enough to prove instructive.
From the captured reference data we have means at 3.864 and 4.511, variances of and , and a covariance of for the femur and humerus variables in the reference data. Adams and Byrd give an observation for a humerus from the helicopter of 4.541, which is the log of the sum of two humeral head measurements, rather than using the average of the two log-scale measurements, which would be a Darroch and Mosimann size variable (Darroch and Mosimann, 1985; Jungers et al., 1995). They also give the log maximum diameter for the femoral head from Individual 1 and from Individual 2 of 3.826 and 3.867. Using the femoral head variable from Individual 1 to calculate the 90% prediction interval for the humerus variable (which we calculated for the recovered statistics as from 4.417 to 4.541), they (Adams and Byrd, 2006, p 67) then show that the actual humerus value of 4.541 is “located on the upper boundary of a 90% prediction interval.” Conversely, the observed humerus variable is within the 90% prediction interval (which we calculated as 4.451–4.576) based on Individual 2's femur variable. The authors (Adams and Byrd, 2006, p 67) conclude that “Based on this objective technique, the humerus was included with the remains designated as Individual 2 via exclusionary sorting.” In other words, because they find that the humerus measurement is on the 90% prediction interval border when using Individual 1's femur as the predictor, they feel that they can exclude Individual 1 as the source of the humerus. The humerus is then included with Individual 2 because there is no other individual remaining and because its value does fall within the 90% prediction interval based on Individual 2's femur.
The above technique is certainly replicable, but its objectivity is debatable. One can fairly arbitrarily form many “inclusions” by making the prediction interval quite wide. A 95% interval would have included both Individual 1 and Individual 2 as possible sources, which is presumably the reason that Adams and Byrd instead used a 90% prediction interval. One can also arbitrarily form many “exclusions” by making the prediction interval narrower. A 53.3% prediction interval would have placed Individual 1 on the prediction interval border and Individual 2 well above the border. The chief problem with the current approach is that it violates what is known as the “likelihood principle.”
Lee (2012, p 221) gives a very succinct and understandable definition of this principle:
The nub of the argument here is that in drawing any conclusion from an experiment only the actual observation x made (and not the other possible outcomes that might have occurred) is relevant. This is in contrast to methods, by which, for example, a null hypothesis is rejected because the probability of a value as large or larger than that actually observed is small…
In the current context, what is wanted is the probability of obtaining the observed data (a humerus value of 4.541) if the femur value was 3.867 (Individual 2's value) as versus the probability of getting the humerus value if the femur value was 3.826 (Individual 1's value). It does us no good to concern ourselves with the concept of humerus measurements that are “more extreme” than the one actually observed.
In our analysis of the data from Adams and Byrd (2006), we consequently want to find the ratio of the probability of getting a humerus value of exactly 4.541 if the femur value was 3.876 to the probability of obtaining that humerus value if the femur value was 3.826. These probabilities can be found using a t distribution with N − 2 degrees of freedom. This ratio of probabilities (with Individual 2 as the predictor in the numerator and Individual 1 in the denominator) is 2.965. This ratio can be referred to as a likelihood ratio, as it is the ratio of the likelihood of obtaining the data (the humerus measurement) if Individual 2 was the source of the humerus divided by the likelihood of the data if Individual 1 was the source. As previously, the likelihood is proportional to the probability of obtaining the observed data given a certain hypothesis or set of parameter values. The likelihood ratio in this setting is the same thing as the Bayes factor we saw earlier. The likelihood ratio literally means that the datum (the humerus value) was 2.965 times more likely to have arisen from an individual with a femur measurement of 3.876 than from an individual with a femur measurement of 3.826.
Another way of stating Bayes Theorem, and one that is used quite frequently in forensic sciences, is that the posterior odds are equal to the likelihood ratio times the prior odds. The prior odds are straightforward in this example. Without recourse to any data or observations, the humerus in question has a probability of one-half of being from Individual 2 and one-half of being from Individual 1, so the prior odds are 1:1. As a consequence, the posterior odds are equal to the likelihood ratio, so the posterior odds are 2.965 “in favor” of the humerus being from Individual 2. The posterior odds can be converted to a posterior probability by dividing the odds by the quantity one plus the odds. This gives a posterior probability of about 0.75 that Individual 2 was the source of the humerus.
It is useful to extend this example to a situation where the prior odds are not 1:1, which we can do by imagining that the helicopter contained three individuals. We also introduce an additional complication that stature is known for the three individuals. This segues the example into the problem of forming identifications. For this example, we use summary statistics on stature, femur length, and humerus length from Konigsberg et al. (1998) to simulate data from three individuals and then use this simulated data together with the summary statistics to perform the analysis. We constructed two different simulations, an “easy” one in which the three resulting individuals had very different statures, and a “difficult” one in which Individuals 2 and 3 differ in stature by only about 3 cm.
Table 1 shows the results of two simulation runs, where the first is the easy case of very different statures, and the second is the difficult case where two individuals are of similar stature. The table also gives the predicted long bone lengths given the statures. In both cases we need to permute bones against individuals in order to generate posterior probabilities for various bone configurations (bone pairs against individuals). There are six ways to permute the humerii against the three individuals and likewise there are six ways to permute the femora, which leads to 36 ( ) possible ways to configure the bones against individuals. Within each configuration we find the bivariate normal density value within individuals by finding the probability of getting the “assigned” bone lengths given the predicted (from stature) lengths and the residual variance–covariance matrix after “regressing out” stature. If we assume that the individuals are unrelated, then we can multiply these three probability values to get the likelihood of a particular configuration. We assume that all 36 configurations are a priori equally probable, so each receives a prior probability of 1/36.
Table 1. Two Simulations of Three Individuals Each
Humerus length” and “Femur length” were simulated using the conditional distributions on stature (from the regressions on stature) while the predicted values are the point predictions from the regression of the bones on stature.
Figure 16 shows a plot of the posterior probabilities for each of the 36 configurations in ascending order. The top three posterior probabilities in ascending order are 0.0053, 0.0431, and 0.9517. We ignore all but the two configurations with the highest posterior probabilities. These are labeled in Figure 16 where the column to the left lists the humerii as A, B, and C, and the column to the right lists the femora as D, E, and F. Each row of these small text matrices represents an individual, with Individual 1 as the top row, Individual 2 as the middle, and Individual 3 as the bottom row. The configurations with the two highest posterior probabilities both have the humerii assigned in the same way but differ in whether femora D and E are assigned to Individuals 1 and 2, respectively (with a posterior probability of 0.9517), or to Individuals 2 and 1, respectively (with a posterior probability of 0.0431). On the basis of these results, one might decide to have humerus A and femur E typed for DNA, and find that these bones have different genotypes. This information now enters into the likelihood calculation so that any configurations that pairs A with E would have an overall likelihood of 0.0. This then increases the highest posterior probability from 0.9517 to 0.9945.
Figure 17 shows a comparable plot of the posterior probabilities but this time from the “difficult” simulation where two of the three individuals have similar statures. Now there are four configurations that have posterior probabilities greater than 0.026, which in ascending order are 0.1525, 0.1974, 0.2171, and 0.3546. Note from Figure 17 that the configuration with the second highest posterior probability is indeed the correct configuration. Upon examining these four configurations one can see that humerus A and femur D are always assigned to Individual 1, but the pairing of humerii B and C with femora E and F, and their identification with either Individual 2 or Individual 3, cannot be resolved. To resolve this commingling one would minimally need DNA from one of the remaining humerii (B or C) and one of the remaining femora (E or F), and to form the identifications one would also need an ante-mortem DNA sample from Individual 2 or Individual 3.
DNA exclusions (eliminations) are simpler to work with than DNA inclusions (matches), so we presume here that the analyst obtained an ante-mortem DNA sample from Individual 3 and post-mortem samples from humerus B and femur F (which are paired together only in the fourth highest posterior probability). The lab results come back with a match between Individual 3 and femur F but an exclusion for humerus B against Individual 3. The match will typically be given as a likelihood ratio which is the inverse of the genotype frequency in the “population at large.” This likelihood ratio is based on the idea that the DNA match has a probability of 1.0 if the ante-mortem and post-mortem samples are from the same individual and a probability equal to the population frequency of the genotype if the samples are from different individuals.
For the moment we use this likelihood ratio for inclusion very conservatively, treating it only as the lack of an exclusion. In other words, we cannot necessarily use the DNA match between Individual 3 and femur F to assign femur F to this individual because we do not know whether the remaining individuals might have the same genotype. Using only exclusions, the configurations that previously had the 4th and 3rd highest posterior probabilities (shown at index 33 and 34 in Fig. 17) are now eliminated because both pair humerus B with femur F, and the configuration with the highest posterior probability (index 36 in Fig. 17) must be eliminated because it places humerus B with Individual 3. This leaves only the (previously second highest posterior probability) configuration at index 35, which now has a posterior probability of 0.9162. Using the inclusion of femur F with Individual 3 (in other words, jettisoning any configurations that do not have this femur with this individual), the posterior probability rises from 0.9162 to 0.9562.
Identification in a “closed population” mass disaster
The previous section touched on the “victim identification” problem when dealing with commingled remains. Here we expand on victim identification, making the simplifying assumption that there is no commingling of remains so that we are solely attempting to make identifications. Rather than working directly with posterior probabilities as in the previous section, we will work with likelihood ratios and prior odds as this is much more common in the disaster victim and personal identification literature (Goodwin et al., 1999; Adams, 2003b; Brenner and Weir, 2003; Alonso et al., 2005; Christensen, 2005; Lin et al., 2006; Steadman et al., 2006; Prinz et al., 2007; Kaye, 2009; Budowle et al., 2011; Butler, 2011; Hartman et al., 2011; Abraham et al., 2012; Montelius and Lindblom, 2012; Jackson and Black, 2013). As mentioned above, likelihood ratios from DNA are typically reported as the inverse of the population frequency for the matched (between ante-mortem and post-mortem) genotype, although this is only possible when a “direct reference” ante-mortem sample is available (so that the numerator is equal to 1.0). Brenner and Weir (2003, p 174) define a direct reference as “a known biological relic of a victim” from which an ante-mortem DNA sample can be obtained. When the post-mortem sample is correctly matched to the direct reference sample then the likelihood is 1.0. This is simply a statement that the probability of getting the observed post-mortem DNA data is 1.0 if the source for that sample was the same individual that provided the ante-mortem sample. The transpose is not necessarily true. It does not necessarily follow that the post-mortem sample must be from the same individual that provided the ante-mortem reference sample because the two samples match. If another individual or individuals is/are a potential source for the post-mortem sample (because they also match the DNA profile) then the transpose is less than 1.0.
We frame this problem of victim identification in terms of calculating the prior probability of a correct identification and defining likelihood thresholds for identifications using an example from Adams (2003a). Adams studied the diversity of dental pathology data using methods very similar to those applied to mtDNA sequence data. He found based on dental records from 29,152 individuals that the most common pattern, present in about 13% of his cases, was for all 28 teeth (he excluded third molars) to be “virgin” (i.e., non-carious and unrestored). If we were dealing with a “closed population” disaster with 100 individuals then we might expect 13 of the individuals to have no dental pathology. Mundorff (2008, p 127) defines a “closed population” as one “where the number and names of the victims involved are known,” while Kontanis and Sledzik (2008, p 318) define it specifically within the transportation industry as one “where accurate passenger manifests are available soon after the accident.” If we further assume that we have complete and accurate ante-mortem and post-mortem data for all 100 individuals, then whenever we correctly match ante-mortem to post-mortem data for individuals with “virgin” teeth the likelihood will be 1.0, but the likelihood ratio will be . This is the inverse of the frequency of “virgin” teeth which is 12/99 rather than 13/100 because the denominator in the likelihood ratio excludes the current individual who already appears in the numerator with a likelihood of 1.0.
Prior probability of a correct identification
Following Budowle et al. (2011), we use v to represent the number of, as yet, unidentified victims. Note that there is also precedent for using to represent the number of as yet unidentified victims (Brenner and Weir, 2003). Using v as the number of victims, the prior probability of a correct identification is , while the prior probability of an incorrect identification is . The prior odds of an identification (i.e., the ratio of the prior probability of a correct identification to the prior probability of an incorrect identification) are:
With each identification made, the number of unidentified victims can be decreased by one, a point also alluded to by Brenner and Weir (2003) when they indicate that the prior odds “continually increase as new victim identifications are made.”
Continuing with our example of a closed population disaster with 100 individuals, the prior odds of a correct identification are 1/99. When multiplied by the likelihood ratio for an ante-mortem to post-mortem match for “virgin” teeth this gives the posterior odds as 1/12. The posterior odds can in turn be converted to the easily interpretable posterior probability of . As there are thirteen individuals within the 100 individuals known to have had “virgin” teeth, a post-mortem match has a 1/13 chance of leading to a correct identification. This calculation is not in keeping with the more conservative way in which DNA evidence is ordinarily handled. There the denominator in the likelihood ratio is the “match probability” based on a database, such as using the product rule with the Combined DNA Index System (Budowle and Moretti, 1999; Budowle et al., 2001).
As an example of this more conservative approach, Steadman et al. (2006) found that a particular pattern of dental pathology observed in a forensic case was never observed in a database sample of 29,152 individuals. They used a match probability of 1/29,153 in analyzing their case, on the presumption that had one additional case been sampled it would have yielded a match. Now if we presume that this very rare dental pathology pattern is known to belong to one of the 100 victims but we have incomplete ante-mortem information from the remaining victims, under this more conservative approach we must allow for the possibility that there was another occurrence of the pattern. The likelihood ratio given an ante-mortem to post-mortem match is now 29,153, which when divided by 99 gives posterior odds of about 294.5 and this converts to a posterior probability of about 0.9966. In contrast, if we knew with surety from the ante-mortem data that there was only one individual among the 100 with this pattern of dental pathology, then the likelihood ratio would be infinite and the posterior probability on the identification would be 1.0.
More on priors and the problem of transposed conditionals in forensic science
We have not yet commented on the role of prior information in forensic anthropology, although this is clearly an area for potential confusion. Our use of the prior above, which is very common in the mass fatality identification literature for “closed” populations, is uncontroversial. But in expert testimony the introduction of prior odds for as versus against an identification should be avoided at all costs. For the expert to “wander” into dealing with prior odds brings the expert into the position of commenting on the particulars of the identification external to the forensic anthropological evidence. As Koehler and Saks (1991, p 371) note: “Where presumptions exist, they should be provided by the law, not by expert witnesses.” Using only the likelihood ratio from the forensic anthropological evidence should protect the expert from this error, but unfortunately it is all too easy for the likelihood ratio to be misinterpreted by “transposing” the conditioning (Thompson and Schumann, 1987; Evett, 1995; National Research Council Committee on DNA Technology in Forensic Science, 1996; Foreman et al., 2003).
Evett (1995, p 129) gives a clear illustration of a transposed conditional probability:
The probability that an animal has four legs if it is a cow is one
does not mean the same thing as:
The probability that an animal is a cow if it has four legs is one.
A much more convoluted example can be found in a discussion in the 2003 movie “Pirates of the Caribbean: The Curse of the Black Pearl,” in which Murtogg demonstrates very circuitously to Mullroy that the probability that the ship the Black Pearl would have black sails is 1.0, while the probability that a ship with black sails would be the Black Pearl is much lower. Both misreading of a conditional probability and transposition of conditioning in a likelihood ratio can occur when a forensic anthropologist is acting as an expert witness. We give an extended example of likely scenarios under which such errors can occur when a forensic anthropologist is acting in this role.
Let's presume that the skeleton of an unidentified individual is found in a clandestine grave. A partial DNA profile from the Combined DNA Index System (CODIS) is obtained from the individual and submitted to the FBI to be checked against the National Missing Person DNA Database. This produces a “hit” for a missing person sample obtained from a hairbrush (the DNA on which was confirmed using a biological sample from a sibling). The missing person's disappearance was circumstantially linked to another individual who stands to be charged with murder if the identification can be “made.” As a consequence, the partial profile from the skeleton is checked against a larger database and ultimately a likelihood ratio of 10,000 is obtained.
The prosecution is concerned that this likelihood ratio may not be high enough, and on finding that dental records are available from the missing person the prosecutor obtains an expert witness to examine the dentition from the skeleton and the dental records of the missing person. The expert witness finds a perfect match, with all teeth being free of restorative work and the mandibular right first and second molars having occlusal fillings. On submitting this dental pattern to the Joint POW/MIA Accounting Command's program Odontosearch II (http://www.jpac.pacom.mil/index.php?page=odontosearch), the expert witness finds that 29 out of 37,955 individuals have this particular pattern. The expert witness consequently reports a likelihood ratio of 1,308.8. At trial, the expert witness testifies that “the match is nearly 1,309 more likely if the skeleton and the missing person represent the same individual than if they were two different individuals.” Wishing to clarify this statement, the prosecutor asks “so based on the dental evidence, it is nearly 1,309 more likely that the skeleton is that of the missing person than that it is from someone else?” But one could only make such a leap from the likelihood of the data under two different hypotheses to the posterior odds of those hypotheses given the data by assuming that the prior odds on the identification are “evens” (1/1). The prosecutor would never have been asked to prosecute a case based on an identification which was as likely to be correct as incorrect. This form of transposition is referred to as the “prosecutor's fallacy” (first named so by (Thompson and Schumann, 1987)) because that is the usual source for the error.
The defense attorney who has had access to the expert witnesses' report has seen that the likelihood ratio was obtained by inverting the match probability. The defense attorney realizes that 29 out of 37,955 individuals in the Odontosearch database matched on dental pathology to the skeleton from the clandestine grave. The grave was located in a rural area just on the outskirts of a city with a population of one million individuals. The defense attorney consequently multiplies the frequency of the dental pathology pattern in Odontosearch, which is , by one million to arrive at a predicted 764 individuals from the city that would have had a matching pattern. From this standpoint, the defense attorney argues that the posterior odds that the skeleton from the clandestine grave represents the same individual as the missing person (as opposed to one of the 764 individuals presumed to have the same dental pattern) are 1:764. This form of logic was referred to by Thompson and Schumann as the “defense attorney's fallacy” because that also was the usual source. This fallacy is not so much an outright transposition as it is a misreading of the evidence within the framework of the trial. The fact that there are potentially many other individuals who match the dental pattern ignores the fact that there is other evidence that moved the case forward. In making this argument the defense attorney has chosen to ignore the DNA evidence that was central to the initial putative identification.
The role of prior information can become particularly confusing in the forensic setting because there is almost always prior information that enters into the denominator of likelihood ratios. This can be seen in Schneider's (2007, p 240) statement that “Match probabilities or likelihood ratios are based on assumptions about the ethnic origin of the suspect or an unknown perpetrator.” In an example of calculating a likelihood ratio, Evett and Weir (1998) write “because the victim described her assailant as Caucasian, it is appropriate to use estimates of allele proportions from a Caucasian database.” In contrast, the National Research Council report (1996, p 29) contains the following proscription, “Usually, the subgroup to which the suspect belongs is irrelevant, since we want to calculate the probability of a match on the assumption that the suspect is innocent and the evidence DNA was left by someone else.” In any event, it is important to understand that the assumption of a prior for the denominator of a likelihood establishes the “population at large” that contains all individuals who could potentially match the evidentiary information. This is not the same thing as assuming a prior probability or prior odds for an identification. In a legal proceeding, this prior is in the domain of the court, while in disaster victim identification the prior may be well-defined for “closed population events” and much less clear for “open” events. In either case, the identification (at which point prior odds of the identification come into play) will be above the level of any one “identification modality.”
In a critique of Steadman et al. (2006), Ferreira and Andrade (2009) confuse the issue of a prior on an identification with a prior for the denominator calculation in a likelihood ratio. Steadman et al. (2006) gave the likelihood ratio as:
for a case with a putative identification to an individual who was a male, where and are, respectively, the probabilities that an actual male and that an actual female would be sexed as male. Steadman et al., on the basis of a study of the Terry Anatomical Collection, gave these probabilities as 0.9884 and 0.0194. They then gave the prior probability in the “population at large” of being male as being equal to the prior probability of being female (equal to 0.5). They took this prior after an examination of the sex ratio in the National Crime Information Center's missing person database, ultimately arriving at a likelihood ratio of 1.9615. Ferreira and Andrade (2009) argued that the likelihood ratio in Eq. (11) should be:
This is only correct if the prior sex ratio for the “population at large” is such that the sample is entirely composed of females. In that case, and if osteologists were “perfect” at assessing sex from skeletal material, then the likelihood ratio would reach its maximum of infinity as opposed to the maximum of 2.0 under a prior sex ratio of 1:1.
It is our hope that the examples we have given in this article will spur biological anthropologists to move beyond any preliminary training they may have received concerning Bayes Theorem and consider using Bayesian approaches where appropriate in their future research. While a number of the examples we have examined, because they began with an uninformative prior, provide results that do not differ substantively from traditional maximum likelihood approaches, they do illustrate that a Bayesian approach provides a richer depiction of our current knowledge about parameters and models. In the example taken from Uhl et al. (2013) we were able to present the full posterior densities for body mass under an uninformative prior and under an informative prior (see Fig. 11) whereas the published analysis gave only the 0.025, 0.5, and 0.975 quantiles under these two priors. When a Bayesian approach produces rather different results from traditional frequentist approach, it often does so because of problems with the assumptions underpinning the latter approach. In our paleodemographic example, maximum likelihood estimation produced an asymptotic (log normal) density for the baseline hazard parameter that was too peaked because the baseline hazard parameter lay near its boundary of 0.0. In this case, rejecting the Bayesian approach because it requires prior information (we used a gamma distribution with “shape” and “scale” values both equal to 1.0) and instead using maximum likelihood estimation leads to a baseline hazard confidence interval that is too narrow. This is not because of a failure on the part of maximum likelihood estimation, but is instead a result of pressing the method into a situation where an assumption (that a parameter is not on or near a boundary) is violated.
It has been a quarter of a century since Berger and Berry (1988, p 166) wrote:
…common usage of statistics seems to have become fossilized, mainly because of the view that standard statistics is the objective way to analyze data. Discarding this notion, and indeed embracing the need for subjectivity through Bayesian analysis, can lead to more flexible, powerful, and understandable analysis of data.
We are not advocating for dispatching with initial training in “standard statistics,” as knowledge of frequentist concepts often makes Bayesian concepts clearer, and frequentist analyses are sometimes useful and appropriate. The Nobel Laureate economist Sims (2007, p 2) in a summer seminar paper “Bayesian methods in applied econometrics, or, why econometrics should always and everywhere be Bayesian” wrote in regard to training that: “…I think that full understanding of what confidence regions and hypothesis tests actually are will lead to declining interest in constructing them.” On a much less pessimistic note concerning the future of frequentist approaches, Samaniego (2010) presents frequentist and Bayesian approaches as a threshold problem where the benefits of one approach over the other may reach a tipping point. Samaniego provides examples where the frequentist approach can win out over Bayesian approaches, although in the end he characterizes himself as “unabashedly a ‘Bayesian sympathizer'.” He (Samaniego, 2010, p 62) also takes the occasional jab at frequentist methods, writing: “…let's take a look at the logical underpinnings of frequentist inference. Surprise, there are none!” But as he is quick to point out, a logically consistent method that produces poor answers is no boon.
There has been a long-standing debate over whether Bayesian inference can be taught effectively at an introductory level. The oft-cited exchange in The American Statistician's “Teacher's Corner” could be considered a bit of a draw, with Moore (1997) clearly supporting omitting Bayesian methods from such courses and Albert (1997) and Berry (1997) arguing strongly for their inclusion. The comments on these three articles were essentially split, with two authors (Freedman and Scheaffer) not supporting an extensive introduction of Bayesian methods, two (Lindley and Short) supporting introduction of such methods, and one author (Witmer) a bit on the fence. At that time, one of the chief impediments to teaching about Bayesian inference was the general lack of suitable textbooks and readily available software, with Albert's (1996) and Berry's (1996) introductory texts being about the only options. Since then Kruschke (2010a) has written an excellent introductory Bayesian text that incorporates the use of both R and BUGS, and there are now more advanced texts that focus on Bayesian inference and MCMC (Albert, 2009; Link and Barker, 2010; Christensen et al., 2011; Lesaffre and Lawson, 2012; Cowles, 2013). Examples in these more advanced texts do sometimes run to the anthropological, such as in Christensen et al.'s (2011, p 4) use of data on Ache armadillo hunting practices.
While training is an important issue, the ability for students to learn Bayesian methods ultimately should not be the criterion upon which we decide whether or not to incorporate such methods into our own research. Furthermore, the leap into Bayesian thinking may not be quite as great as anticipated. Stangl (1998, p 256–257) writes about her collaborations with “researchers trained in medicine and the social sciences,” which in theory includes the bulk of training that most biological anthropologists receive. She says of them:
While the statistical analyses they present in publications is nearly 100% classical, the statistical interpretations made in their day-to-day work is not. In daily conversations, debates, and statistical analyses, they rarely follow classical prescriptions for “legitimate” data analyses or give classical interpretations to their inference. In their day-to-day activities their thinking and the decisions they make based on this thinking are nearly 100% Bayesian. What appears on paper is not indicative of what goes on in their heads.
This brings us full circle to our original example using Mays and Faerman's (2001) data of nine males and four female infant deaths. In the section on sequential use of Bayes Theorem, we mentioned that Mays and Faerman essentially stopped their (frequentist) analysis at twelve males and five females which included the four individuals from Waldron et al. (1999). Though the data from Faerman et al. (1998) were certainly available, Mays and Faerman mentioned the male bias in that data, and did not include this additional data that would have brought the number of total male infant deaths to 26 and female deaths to 10. The apparent reason for their exclusion of this earlier data from any further chi-square calculations was the possibility that the sample used in the Faerman et al. study was derived from a brothel, based on archaeological context (the remains were recovered around a bath house) and a male bias among the remains (which they interpreted as female babies having economic value in the presumed context). Mays and Faerman made the somewhat subjective decision that the Faerman et al. data came from a different context and consequently should not be included. What appear to be frequentist studies often have subjective elements within, or subjective decisions attached to, them. These subjective elements or decisions are better served by seeking fuller explanations through Bayesian modeling, since choosing a Bayesian approach forces the researcher to make the subjective information explicit.
The authors thank the anonymous reviewers for their helpful comments on a previous draft. They also thank Trudy Turner both for her comments and for her patience when they pushed various deadlines to their utter limits. The reference sample data used for the paleodemographic example was obtained under National Science Foundation grant BCS97-27386 awarded to the first author.
Approximate Bayesian computation
Approximate Bayesian computation (ABC) is not the oldest Bayesian Monte Carlo method, but it is a good place to start our illustrations because in its simplest form it is straightforward. Pritchard et al. (1999), following Tavaré et al. (1997), were among the first to use ABC, although the method was not referred to as such until a few years later (Beaumont et al., 2002). The method is a form of acceptance sampling that can be used to “build” the posterior distribution of one or more parameters. ABC involves four steps that are repeated (iterated) a large number of times: 1) possible parameter values are simulated from a prior distribution, 2) data is simulated based on the parameter values from the first step, 3) summary statistics from the simulated data are compared to the summary statistics from the actual data, and 4) if the summary statistics from the simulated data are “close enough” to the summary statistics from the actual data, the parameters simulated in the first step are accepted. An advantage of the ABC method is that it does not require calculating likelihoods. Turner and Van Zandt (2012) give an example of ABC using the binomial distribution which we present here using the Mays and Faerman (2001) data.
To apply ABC, we simulated 10,000 values from the posterior density by: 1) drawing a uniform random deviate between 0 and 1 (i.e., a value of from the prior), 2) simulating a random binomial deviate at a sample size of thirteen (Mays and Faerman's, 2001 sample size) using the value from the first step, and 3) accepting the proposed value from the first step if it produced nine “males.” The top graph in Figure A1 shows a density histogram of the 10,000 values as well as the posterior density. The bottom graph in Figure A1 shows the empirical cumulative density function (Wilk and Gnanadesikan, 1968) from the 10,000 simulated values along with the cumulative distribution function.
The simulation we just described is rather inefficient. With 14 possible values for the binomial variable (number of “males” from 0 to 13) and sampling out of the uniform distribution, the probability of simulating nine “males” is just 1/14, or 0.0714 (as we saw when discussing Bayes' factor). Consequently, about 93% of the proposed values will be rejected. This means that to get 10,000 accepted values we would have to simulate about 143,000 initial values. One could relax the requirement that we must get nine “males” in order to accept a proposed value, and instead say that we would accept a value if it leads to the simulation of 8, 9, or 10 “males.” This has an acceptance probability of 3/14 so that about 21% of the simulated values would be accepted, but the posterior will be slightly heavier in the tails.
For continuous data problems and multiple parameter problems, the ABC potentially could reject virtually all simulated values, making it necessary to accept simulated values that are within a pre-established tolerance. For example, Pritchard et al. (1999) wanted to estimate multiple demographic parameters for past populations using data on modern Y chromosome microsatellite data. They used three summary statistics to compare the actual dataset to the dataset simulated using ABC: the average variance across eight loci for repeat numbers, the average heterozygosity, and the number of unique haplotypes. To test whether a simulated dataset should be accepted they formed the absolute difference between each pair of these three summary statistics (one calculated on the simulated dataset and the other on the actual dataset), divided each summary statistic by its respective actual dataset value, and then accepted a simulation if all three Euclidean distances were less than 0.1. They found that if they used a value lower than 0.1 then the posterior densities for the demographic and genetic parameters did not change appreciably.
ABC has been applied to a number of human genetic, epidemiological, and demographic questions, a few of which have been reported in the AJPA (see for example Boattini et al., 2013). Ray and Excoffier (2009) give a useful review of some such applications. A number of ABC packages for addressing genomic diversity and phylogeny are available in the “R” statistical/graphics program and as stand-alone programs. For example, coalescent simulations are widely available both as “stand-alone” programs (Laval and Excoffier, 2004; Ray et al., 2010) and within “R” as “rtree” in the contributed package “ape” (Paradis, 2012). Additionally, new ABC packages for addressing past population structure are proliferating. Tsutaya and Yoneda (2013) have recently written an ABC package in R called “WARN” (for “Weaning Age Reconstruction with Nitrogen isotope analysis”) that brings the ABC method into the realm of bioarchaeology. In general, ABC packages and programs tend to be quite specific in application. Available packages are useful for asking the same questions of different data, but if you are asking a new question, you must be prepared to write your own code to simulate data after sampling from prior distributions for the parameters. While ABC is useful for sampling from the posterior density for population genetic and demographic parameters, its use in model comparison has been called into question (Robert et al., 2011).
Unlike the relatively recent ABC method, the Metropolis sampler dates back to the middle of the last century (Metropolis et al., 1953), and belongs to the class of Markov Chain Monte Carlo (MCMC) methods (Geyer, 1992; Gilks et al., 1996; Brooks et al., 2011) that have revolutionized Bayesian analysis. As one of the simplest of the MCMC methods, the Metropolis sampler is useful for explaining the simulation and sampling processes used by these methods. The first MC in MCMC references the fact that because these methods use the current value in order to simulate the next value, the sequence of simulated values follows a Markov chain. In a Markov chain, a future value is related probabilistically to the current value. The second MC in MCMC references the fact that these methods use Monte Carlo simulation, or repeated random sampling to obtain numerical results. In the Metropolis sampler the next value in the sequence is offered as a “proposal” value that may or may not be accepted. We start with the simplest of the Metropolis samplers: the independence Metropolis sampler. In our example we will use Jeffreys' prior so that the effect of the prior can be seen in the Monte Carlo procedure.
The independence variant of the Metropolis sampler is so called because potential future values are drawn independently of the current value. The fact that future values are drawn independently of current values suggests that the independence sampler will not form a Markov chain, but because the independence sampler can “reject” a proposed value and stay at its current value it does form a Markov chain. To use the Metropolis sampler we need to be able to write the product of the likelihood and the prior up to a multiplicative constant. This gives:
where the first bracketed term is from the likelihood and the second is from the prior. To start the Metropolis sampler we randomly draw a value from the uniform density, which we can write as for “theta current,” and find from Eq. (A(1)). Now we pick another value from the uniform density, which we can write as for “theta proposal,” and find , again from Eq. (A(1)). We then form the ratio , and if the ratio is greater than 1.0 then we accept the proposed value as our new . What we have just described is simply a “maximizer,” and a very inefficient one at that. It will climb to the highest density and stay there without forming a sample of the posterior. In order to sample from the posterior density, we must allow the possibility for the sampler to make stochastic moves to lower densities. In order to do this, we draw on a uniform again whenever the ratio is less than one, and if this drawn random deviate u is less than the ratio, then we accept the proposed value.
Figure A2 shows the results of using the independence Metropolis sampler on the Mays and Faerman (2001) data. Panel A shows 100 iterations through this algorithm, with the horizontal dashed line representing the posterior mode from the posterior distribution based on nine males and four females and Jeffreys' prior of . Panel B shows the ECDF from 10,000 simulations along with the cumulative distribution function. The independence sampler thus appears to be effective in sampling from the posterior density in our “toy” example, but it can be very ineffective in real-world practical applications because it may reject proposal values so frequently that the sampler gets stuck on various values. Konigsberg and Herrman (2002, p 236) reported that they “had some success” in obtaining samples from the posterior distribution of age-at-death using the independence Metropolis sampler after observing an “age marker” and sampling from a uniform distribution of age. This method had success only because the amount of information from the “age marker” was paltry, so that a uniform distribution of age was an adequate proposal density.
A more flexible and useful form of the Metropolis algorithm is the random walk Metropolis sampler. Kruschke (2010a) devotes an entire chapter of his book to presenting a random walk Metropolis sampler using the uniform prior. In this method, a random “jump” from a symmetric distribution centered on zero is added to the current value rather than independently sampling a proposal value. Figure A3 shows an example of the random walk sampler that is much like the previous example except that a random “jump” from a normal distribution with a mean of zero is added to the current value in order to form the proposal value. The acceptance ratio is calculated as described above, a random uniform number is drawn, and if that number is less than the acceptance ratio then the proposal value is accepted. If the current value is near zero or one then the “jump” may be to a proposed value that is less than zero or greater than one, in which case the proposed value is automatically rejected.
Figure A3 shows 250 iterations of a Metropolis random walk with a standard deviation of 0.2 for the jump distribution in the top panel and the same procedure but with standard deviations of 0.01 and of 2.0 in the middle and bottom panels, respectively. The selection of an appropriate “spread” for the jump distribution is an important consideration, as the 0.01 standard deviation is too small and leads to very frequent acceptance of “baby steps” through the posterior distribution. Thus, the random walk has not even entered the area around the posterior mode in the middle panel of Figure A3 (where the standard deviation is 0.01). Conversely, the standard deviation of 2.0 is far too large so that the proposed “jump” is almost always rejected, and as a consequence the Metropolis sampler gets stuck for numerous consecutive iterations, as shown in the bottom panel of Figure A3. One of the many advantages of using specialized software (such as JAGS, OpenBUGS, and WinBUGS) to run the Metropolis sampler is that they automatically determine “jump widths” and can “tune” the widths so that the sampler works more effectively, removing this house-keeping chore from the user's responsibility. The advantage of Metropolis samplers overall is that they can be used to sample from very general distributions, including ones that have no boundaries, or ones that have only a single boundary such as at zero but that are “open” to the left (down to negative infinity) or to the right (up to infinity).
While the Metropolis sampler can adequately sample from conditional distributions, it is comparatively inefficient, and often too simple to deal effectively with large datasets and multiple parameter situations. To counter the inefficiency of the Metropolis sampler and to make Gibbs sampling (discussed in a later section) easier, Neal (2003) introduced a different method of sampling from densities (including posterior densities) that he refers to as “slice sampling.” Although we only consider univariate slice sampling here, Neal described the multivariate form. Slice sampling has the advantage that it always produces a new step through the space. Neal described two possible ways of performing slice sampling, the first being a “stepping-out” process followed by “shrinkage,” and the second being a “doubling” process. We describe the (first) stepping-out and shrinkage method here. Slice sampling is most easily explained, and hopefully best understood, using a graphical representation like the one we provide in Figure A4. All eight panels in this figure show the posterior density from which we are trying to sample, and proceeding in order through panels A through H represents the sequential steps in “stepping-out” and then “shrinkage.”
In panel A of Figure A4 we have a current value of equal to 0.6. This value of has a particular density value which we find from Eq. (A1), and then represent on the graph as a filled point. Next we pick a new point, shown as an unfilled point, from a uniform distribution between 0 and the density value at our current (the filled point). The “slice” is represented in panel A by the dotted line across the density at the randomly chosen unfilled point “slice height.” The “slice” is not actually seen or known, nor do we have a full picture of the density from which we are sampling. All we can do is evaluate points using Eq. (A1). Throughout the slice sampling shown in Figure A4 we will be using a “window” with a width of 0.1 along the abscissa to represent the size, or area, of individual steps that we will take in the “stepping-out” process.
As shown in panel B, this window is randomly placed such that it covers . In this particular example, the left boundary for the window was randomly placed at 0.576, and the right boundary is consequently at 0.676. Now we evaluate the densities (shown as filled points in panel B) at these two values on the abscissa (0.576 and 0.676) to determine whether the density values fall above or below the slice height. Slice height is again shown with a dotted line, and we have hatched the rectangular window up to the slice height. Panel B and all subsequent panels also show with a heavy vertical line. Because both the left and right densities fall above the slice height, we need to “step-out” in both directions.
Panel C shows “stepping-out” to the left. The original rectangle (from 0.576 to 0.676) is shown without hatching while the step to the left, down the window-width of 0.1 to a value of 0.476, is shown with a hatched rectangle. The density at a value of 0.476 is less than the slice height, so we are done stepping out to the left. Panel D shows that two steps to the right must be taken in order to obtain a “right-hand” density that is less than the slice height. Panel E shows the final sampling window after “stepping-out” is complete; this is the region from which we will sample a new potential value of . We “stepped-out” one window of 0.1 to the left from 0.576 (the initial left boundary) and two windows to the right from 0.676 (the original right boundary), so the region for sampling runs from 0.476 to 0.876.
Panels F through H show the process of “shrinkage” that leads to finally sampling a new value for . In panel F we randomly sample a value from the uniform between 0.476 and 0.876, obtaining a value of 0.54. The density at 0.54 (shown with a filled point) is below the slice height and consequently cannot be accepted. As the value of 0.54 is less than the current value of 0.6, the shaded region will need to shrink to the right (toward the current value). Panel G shows the new boundary and new sampling window that runs from 0.54 to 0.876. We now randomly sample another value from the uniform, this time between 0.54 and 0.876, obtaining a value of 0.747, as shown in panel H. This value gives a density above the slice height and is consequently accepted as the next value of . If, on the other hand, we had randomly sampled a value between 0.829 and 0.876, then we would have rejected the density and shrunk the shaded region to the left, shifting the right boundary to whatever value was drawn. As can be seen in panel H, the density from which we are sampling falls below the slice height (the top of the hatched sampling window) in the region between 0.829 and 0.876. Now the whole process can begin anew using as the new current value.
Slice sampling has the advantage that over the initial sampling events, the window width can be “tuned” so that it is appropriate for the target density. This is automatically handled within the software that implements “slice” sampling for MCMC analyses. The method is fairly easy to implement for univariate distributions, can be applied to multivariate distributions, and can adapt to some extent to dependencies between variables.
Adaptive rejection sampling
Adaptive rejection sampling is another sampling method that is more efficient than the Metropolis sampler and that is useful in applications of Gibbs sampling (described in the next section). It can be used to sample from any density which is log concave (down), though the restriction to log concave densities can be removed by adding a Metropolis step (Gilks et al., 1995). Log concavity of a density is just another way of saying that the density is unimodal and that the mode does not occur on a boundary. The method is extensively described by Gilks (1992), Gilks and Wild (1992), and Gilks (1996). Figure A5 shows graphically how adaptive rejection sampling works, using the same density as in Figure A5 but now calculated as a log density. We use the simpler derivative free version (Gilks, 1992) as opposed to the tangent version (Gilks and Wild, 1992) of adaptive rejection sampling in this illustration. This figure also makes extensive use of Wayne Zhang's “ARS” R code (available from http://actuaryzhang.com/seminar/seminar.html) which he modeled after Gilks' C code. As in the preceding illustration, separate panels, here A/B and C–F, are used to show sequential steps of the sampling process.
Panel A shows the same density we were dealing with in slice sampling, but the density is drawn on a log scale. We start with four initial arbitrary values of 0.1, 0.25, 0.5, and 0.9 for which we can evaluate the log densities, which are then plotted as four filled points. A “lower hull,” shown with a dashed line, can be formed by “connecting the dots” and placing vertical drop lines at the first and last points. An “upper hull,” shown with heavy solid lines, is then drawn by continuing the lower hull lines to either side of an intervening segment and finding the points of intersection. For example, the upper hull above the log densities at 0.5 and 0.9 is drawn by continuing the dashed line between the points at 0.25 and 0.5 and continuing the dashed vertical line from the point at 0.9 until they intersect. Similarly, the upper hull above the log density between the points at 0.25 and 0.5 is drawn by continuing the dashed lines between the points at 0.1 and 0.25 and between the points at 0.5 and 0.9. Panel A also shows a vertical line at a value of 0.687, which is a proposed value from the density. It is relatively easy to sample proposal values because the upper hull is piecewise exponential.
Panel B shows the same graph but in the original density scale; here the hulls are exponentials, instead of the line segments shown in the log scale Panel A. The upper hull above the density between 0.5 and 0.9 now extends beyond the cut-off we picked for scaling purposes. Had we included all of the upper hull, the density function would be compressed to a nearly flat line. The narrow rectangle at the proposed value of 0.687 will be used to test whether the proposed value can be accepted. This is potentially a two-step acceptance test whereby evaluation continues to the second step only if the proposed value fails the first step, a measure intended to cut down on the computations required. The short horizontal line crossing the rectangle represents a random uniform deviate between 0 and 1 that we have scaled so that it is appropriate for the graph. The very small filled portion of the rectangle is from a density value of 0.0 up to the lower hull, whereas the open part of the rectangle continues from the lower hull up to the upper hull.
If the random uniform deviate (the short horizontal line crossing the rectangle) falls within the filled portion of the rectangle, then the proposed value is accepted. This does not occur in our example in Panel B, and we must proceed to the second part of the acceptance test, which involves shifting the rectangle up to the lower hull so that the filled portion now represents the density region between the two hulls, as shown in Panel C. Once again our random uniform deviate does not fall within the filled portion of the rectangle, and the proposed value cannot be accepted. We then use the rejected proposal value calculation to “adapt” the upper and lower halls by inserting the evaluated deviate as a new point at the value of 0.687 and redrawing the halls, as shown in Panel D. Panel D represents the “adaptive” step of the sampler in that the upper and lower hulls now more narrowly bracket the log density function (compare panels A and D). We then start the process of sampling another proposed value over again. Panel E shows a new proposed value of 0.883 that also was not accepted, and as a consequence the hulls once again “adapt.” Finally, panel F shows a value of 0.778 that was accepted.
Gibbs sampling is the process of sampling sequentially from full conditional densities so that it is possible to evaluate the posterior density for anything of interest. It is an MCMC method appropriate to multivariate contexts and multiple parameter problems, and as mentioned in the above sections, incorporates additional types of sampling (such as slice or sampling from a known distribution) within its iterations. Thus far in our illustrations using the Mays and Faerman (2001) data, we have been trying to estimate a single parameter, , or the proportion of males. The Gibbs sampler is not appropriate for estimating here, because this is a univariate problem, but we can use Gibbs sampling to find the posterior predictive distribution (which we also arrived at analytically). Casella and George (1992) give as one of their examples of Gibbs sampling the evaluation of the predictive (posterior) density from a binomial model. As they point out, Gibbs sampling is not at all necessary here because the predictive density can be found directly from the beta binomial distribution. With that said, it is instructive to look at the Gibbs sampler in this context so that it can be readily compared to the known answer. The full conditional distributions for x (the number of males) and (the proportion of males) are:
where the “ ” symbol (referred to variously as a tilde or a “twiddle”) means “is distributed as.” For our example we use and values of 9.5 and 4.5, respectively, which corresponds to the observed data of nine males and four females and a Jeffreys prior of , and then simulate 50,000 values starting at a value of 0.5. The x values were simulated using the rbinom function in R, while the values were simulated using the rbeta function also in R. Figure A6 shows the predictive distribution for the number of males out of 13 individuals, where the vertical lines are from the proportion of simulated x values that fall into each count of males (from 0 to 13) and the unfilled points are from the beta binomial distribution. This figure illustrates that the simulated results from the Gibbs sampler and the analytical results from the beta binomial distribution match.