Inferences from DNA data: population histories, evolutionary processes and forensic match probabilities


Address for correspondence: Ian J. Wilson, Department of Mathematical Sciences, Meston Building, University of Aberdeen, Aberdeen, AB24 3UE, UK.


Summary. We develop a flexible class of Metropolis–Hastings algorithms for drawing inferences about population histories and mutation rates from deoxyribonucleic acid (DNA) sequence data. Match probabilities for use in forensic identification are also obtained, which is particularly useful for mitochondrial DNA profiles. Our data augmentation approach, in which the ancestral DNA data are inferred at each node of the genealogical tree, simplifies likelihood calculations and permits a wide class of mutation models to be employed, so that many different types of DNA sequence data can be analysed within our framework. Moreover, simpler likelihood calculations imply greater freedom for generating tree proposals, so that algorithms with good mixing properties can be implemented. We incorporate the effects of demography by means of simple mechanisms for changes in population size and structure, and we estimate the corresponding demographic parameters, but we do not here allow for the effects of either recombination or selection. We illustrate our methods by application to four human DNA data sets, consisting of DNA sequences, short tandem repeat loci, single-nucleotide polymorphism sites and insertion sites. Two of the data sets are drawn from the male-specific Y-chromosome, one from maternally inherited mitochondrial DNA and one from the β-globin locus on chromosome 11.

1. Introduction

Underlying a sample of deoxyribonucleic acid (DNA) sequence data is a complex pattern of dependences that reflects the ancestral relationships between the sequences. In the absence of recombination, these relationships can be represented by a genealogical tree for which each tip, or leaf, corresponds to a sequence at the present time. Moving towards the root of the tree corresponds to going backwards in time, and branches merge, or ‘coalesce’, when the corresponding DNA sequences last had a common ancestor. The root of the tree represents the most recent common ancestor (MRCA) of all the sequences in the sample.

Although the underlying genealogical tree is crucial to modelling the dependence structure of a DNA sample, it is effectively ignored by traditional methods of analysing DNA sequence data which, for example, are based on averaging pairwise statistics over all pairs in the sample. In recent years, however, important advances have been made towards the goal of fully likelihood-based statistical inference from population genetics data (Griffiths and Tavaré, 1994; Kuhner et al., 1995, 1998; Beerli and Felsenstein, 1999, 2001; Beaumont, 1999; Anderson et al., 2000; Bahlo and Griffiths, 2000; Stephens and Donnelly, 2000; Chikhi et al., 2001; Donnelly et al., 2001; Nielsen and Wakeley, 2001; Markovtsova et al., 2000a, b). The key developments underpinning these advances involve

  • (a) an increasingly flexible class of stochastic models for genealogical trees, based on the coalescent model of Kingman (1982), and
  • (b) computational techniques such as Markov chain Monte Carlo (MCMC) methods and methods based on importance sampling.

However, implementing these models and algorithms remains extremely challenging because of the complexity of the processes underlying the data, which include historical patterns of migration, mating behaviour and population growth, as well as mutation and selection. Moreover, if recombination cannot be ignored then there may be different genealogical trees for different segments of the sequences (Griffiths and Marjoram, 1997; Kuhner et al., 2000; Nielsen, 2000; Fearnhead and Donnelly, 2002).

Wilson and Balding (1998) developed a Metropolis–Hastings algorithm for completely linked ‘microsatellite’, or short tandem repeat (STR), loci and reanalysed the human Y-chromosome STR data set of Cooper et al. (1996), described later in Section 2.1. They found evidence for a relatively small effective population size for human males (point estimates around 3000) and for a short time since the most recent common ancestor, TMRCA, point estimates around 30 000−40 000 years. However, the modelling assumptions employed by Wilson and Balding (1998) were limited to the standard coalescent (Section 3.1.1) and the stepwise mutation model (Section 3.2.1), and the sensitivity of the final inferences to these modelling assumptions was not investigated.

Here, we extend the analyses of Wilson and Balding (1998) by permitting changes in population size, or structure or both. Inevitably, fully realistic models for the historical patterns of human mating and migration remain outside our grasp. However, we can implement models that generalize those of Wilson and Balding (1998), and hence explore sensitivity to their modelling assumptions, and which capture at least some, and possibly most, of the major underlying demographic effects.

Further, we extend the mutation model of Wilson and Balding (1998) to permit the analysis of a wide range of DNA sequence data for which recombination and selection can be neglected. For the parts of the human genome which are subject to recombination, its effects can often be ignored for sequences of up to a few thousand base pairs (bp). However, recombination rates appear to be highly variable so some much larger chromosome segments may be unaffected by recombination, and some short segments may be heavily affected (Goldstein, 2001).

Inferences about population histories and evolutionary processes are not only of intrinsic interest but are also crucial to the interpretation of genetic data in a wide range of applications, from conservation genetics (Beaumont, 2001) to mapping disease genes (Clayton, 2000). We illustrate this point by elaborating an application of our inferential framework to the assessment of DNA profile evidence. Specifically, we extend our MCMC algorithm by introducing an additional, no-data node, to obtain match probabilities for forensic identification by using mitochondrial DNA miniprofiles.

The present paper is addressed in part to statisticians who may be unfamiliar with some of the genetics terminology. To assist such readers we have included a very brief glossary of terms in Table 1, and each term included is highlighted in italics at first use. Table 1 also includes a list of genetics abbreviations. For further background reading, Hartl and Clark (1997) provide a popular introduction to population genetics, and more advanced material may be found in Balding et al. (2001).

Table 1.  Terminology and abbreviations used in the text†
  1. †The definitions are not fully general but are intended as a guide.

AllelePossible state of the DNA sequence (or feature derived from it) at a locus
Base pair (bp)Unit of DNA sequence length, equal to the number of nucleotides
ChromosomeCan here be regarded as a long DNA sequence
GenomeTotal genetic inheritance of an organism; the human genome consists of 23 chromosome pairs (one maternal and one paternal) plus mitochondrial DNA
HaplotypeAlleles at two or more loci on the same chromosome (cf. genotype: unordered allele pairs {maternal, paternal} at one or more loci)
Locus (plural loci)Specified site or short region on a chromosome
Mitochondrial DNACircular chromosome inherited maternally
MRCAMost recent common ancestor
MutationProcess that changes the allele at a locus
NucleotideDNA sequence unit which takes one of four types, denoted A, C, G and T
PolymorphismLocus at which more than one allele arises in a given population (cf. monomorphism: an invariant locus)
RecombinationExchange of DNA between maternal and paternal chromosomes (not mitochondrial DNA or Y) during the formation of sperm and egg cells
SelectionProcess whereby advantageous alleles tend to become more common and disadvantageous alleles less common (cf. neutral: not subject to selection)
SMMStepwise mutation model
SNPSingle-nucleotide polymorphism
STRShort tandem repeat
TMRCATime since most recent common ancestor
TransitionMutation involving a substitution either of a purine nucleotide (A or G) by the other or of a pyrimidine (C or T) by the other
TransversionSubstitution of a purine nucleotide by a pyrimidine, or vice versa
UEPUnique event polymorphism
YAPY-chromosome Alu polymorphism
Y-chromosomeSex-determining chromosome, borne only by males

2. Data

2.1. Y-chromosome haplotypes

STR alleles consist of a short DNA sequence motif, e.g. GATA, multiply repeated (Goldstein and Schlötterer, 1999). Cooper et al. (1996) reported the numbers of STRs at each of five loci on the non-recombining part of the human Y-chromosome, for 174 apparently unrelated men from East Anglia (UK). A further 23 men from northern Nigeria and 15 from Sardinia were also typed. Although the three groups sampled cannot be regarded as representative of all human populations, they include both African and non-African populations, which is important in view of the prominence of the theory of a recent African origin of modern humans (Relethford, 1998). A further advantage of this data set is that the STR unit is the same at each locus, which makes more plausible an assumption of a common mutation mechanism for all loci.

In addition to the STR data, Cooper et al. (1996) reported for each Y-chromosome the presence or absence of the so-called Alu insertion sequence at a particular locus, known as the Y-chromosome Alu polymorphism (YAP) locus described in Hammer (1994). In formulating likelihoods for the YAP data, we face an ascertainment problem which is common to many DNA sequence data sets. Cooper et al. (1996) chose to type this locus because it was known to be polymorphic in many human populations (and many other potential loci that were not known to be polymorphic were not typed). Inferences which can be drawn under these circumstances will differ in general from inferences which would be justified if the locus had been chosen to be typed ‘at random’.

The human Y-chromosome data recorded by Ruiz Linares et al. (1996) consist of five di-nucleotide STR loci (including one monomorphic locus), a tetranucleotide STR (one of those included in the study of Cooper et al. (1996)), the YAP locus and a single-nucleotide polymorphism (SNP) site. Data were obtained from 13 worldwide populations, although the sample sizes are often small (see Table 5 later). Ruiz Linares et al. (1996) noted substantial geographic clustering of the observed haplotypes and inferred that the MRCA of human Y-chromosomes cannot have been very recent, but they did not give an estimate of this time.

Table 5.  Sample sizes for the R96 human Y-chromosome data set (Ruiz Linares et al., 1996)†
Population or regionCodeSample sizeMean population proportions(%)
(a)(b)Splitting onlySplitting and growth
  1. †Numbers of chromosomes (a) used in our analyses, having no missing data at the two UEP sites, and (b) in the original data set. Also shown are the posterior mean effective population sizes, as percentages of the total, under the two models which allow population splitting (the prior means are 7.7% for each population).

  2. ‡CAR, Central African Republic.

CAR‡ pygmyPC12128.28.2
Zaire pygmyPZ9108.58.8
Africa totalAFR252624.725.1
Americas totalAME282822.221.8
East Asia totalASI353922.922.6
New GuineaNG667.67.6
Oceania totalOCE121222.422.7
Grand total 115121100.0100.0

2.2. β-globin sequences

Harding et al. (1997) analysed a subset of the data of Fullerton et al. (1994), consisting of 61 DNA sequences from the Melanesian population of Vanuatu. The 3000 bp region sequenced encompasses the β-globin gene. One end of this region was later identified as a recombination hot spot and 330 bp at that end were ignored to permit analyses based on an assumption of no recombination.

Harding et al. (1997) adopted the ‘infinite sites’ mutation model, which implies that at any one DNA site there has been no more than one mutation since the MRCA of the sample. The full data set was not consistent with the assumption, but Harding et al. (1997) discarded four sequences which seemed to have been affected by recombination, resulting in a 57-sequence data set that was consistent with infinite sites. In contrast with the geographic clustering of Y-chromosomes reported by Ruiz Linares et al. (1996), Harding et al. (1997) reported substantial haplotype diversity in the Vanuatu sample, with all known haplogroups (groups of similar haplotypes) represented in this sample from a single isolated location. Similarly, the estimate of Harding et al. (1997) of the effective size of the ancestral Vanuatu population is approximately the same as many estimates of the effective size of the entire human population, suggesting that the current geographic isolation is unimportant in explaining the observed genetic diversity. They obtained a point estimate of 895 000 years for the TMRCA of the Vanuatu β-globin sequences, which is longer than corresponding estimates based on worldwide Y-chromosome data, even when the fourfold larger population size of β-globin sequences is taken into account. (Each child receives two β-globin sequences from its parents, but on average only half a Y-chromosome. Under simple models, the expected TMRCA is proportional to the population size and so would be fourfold higher for β-globin sequences than for Y-chromosomes).

The infinite sites assumption adopted by Harding et al. (1997) is crucial to some methods of analysing DNA sequence data, since it implies that all mutations which have occurred are directly visible in the data; none have been ‘overwritten’ by a subsequent mutation. The assumption is not valid for many data sets. It is neither required nor advantageous under the framework that is developed here, but it can readily be implemented and we do so later (Section 5.3) to permit a comparison with the results of Harding et al. (1997).

2.3. Mitochondrial minisequences

The maternally inherited mitochondrial DNA has been widely used to infer aspects of human female population histories (Cann et al., 1987; Sykes, 1999). Because it exists in multiple copies throughout each cell, compared with just one copy of nuclear DNA from each parent, mitochondrial DNA is easier to type from small and/or degraded samples. In recent years Neanderthal and ancient Australian mitochondrial DNA have been successfully typed (Krings et al., 1997; Ovchinnikov et al., 2000; Adcock et al., 2001).

Mitochondrial DNA typing is also useful in forensic identification (Tully et al., 1996, 2001; Bataille et al., 1999), mostly for samples of shed hair, which are often recovered at crime scenes and which contain little or no nuclear DNA. In addition, mitochondrial DNA has been successfully typed from DNA poor sources such as saliva on stamps (Allen et al., 1998) and tooth root dentine (Pfeiffer et al., 1998). However, mitochondrial DNA suffers from a serious drawback in forensic settings: because of the absence of recombination, the mitochondrial DNA sequences of two apparently unrelated individuals can be identical, or very similar, owing to shared inheritance from a common maternal ancestor, possibly many generations in the past. This makes it difficult to assess the evidential weight of matching mitochondrial DNA profiles in a way which is fair but makes efficient use of the data.

The UK Forensic Science Service (FSS) employs mitochondrial DNA ‘minisequences’ consisting of 12 loci in the mitochondrial DNA control region that display a high level of genetic variation, while being relatively quick to type. The 12 loci comprise 10 SNP sites, a dinucleotide STR locus and a locus made up of multiple copies of the C-base (a poly-C region) (Tully et al., 1996). The poly-C locus was excluded from our analyses, both because there is little available information from which to formulate an appropriate mutation model and because it is not always utilized in forensic case-work owing to its high rate of within-individual variation. We analyse an FSS database of 297 minisequences, obtained from apparently unrelated UK residents, 152 with primarily European ancestry, 103 of Afro-Caribbean origin and 42 with Asian ancestry.

The ascertainment bias problem discussed in Section 2.1 arises again for mitochondrial DNA minisequences: the 12 loci were selected in part because preliminary studies indicated high variability at these loci.

3. Models

3.1. Demography

3.1.1. Standard coalescent

The coalescent is a stochastic model for the genealogical tree representing the ancestral relationships between a sample of n DNA sequences. The sequences are regarded as labelled, to avoid combinatorial complications, but they are not yet observed. The model has two attractive features: it is mathematically tractable, and it approximates the distribution of genealogical trees under an important class of neutral population genetics models, including the Wright–Fisher model of a random-mating population of constant size N, and the general exchangeable models of Cannings (1974). To recover these approximations, 1 unit of ‘coalescent’ time must be interpreted as N/σ2 generations, where σ2 denotes the variance in the number of ‘offspring’ of a sequence in the next generation. We assign σ2=1 here but note that the mating behaviour of men differs from that of women and σ2 for Y-chromosome sequences may be larger than the value that is appropriate for mitochondrial DNA sequences, and may possibly be much larger than 1.

Note that we use N for the number of chromosomes, and not the number of individuals: for the β-globin data, N sequences correspond to N/2 individuals; for Y-chromosome and mitochondrial DNA data, N sequences correspond to 2N individuals, N males and N females.

Coalescent time runs backwards, with time t0≡0 denoting the present, corresponding to the leaves of the tree, whereas tj, j ∈ {1,2,…,n−1}, denotes the time of the jth most recent coalescence event. In particular, tn−1 denotes the time of the root, or MRCA. Under the standard coalescent model, the between-coalescence intervals tj − tj−1 have independent exponential distributions:


for t>t. At each coalescence event, all pairs of extant lineages are equally likely to be the pair that coalesces.

Mutations in the standard coalescent model occur along the branches of the tree at the points of a homogeneous Poisson process with rate θ/2. Because of the coalescent time rescaling, this corresponds to a mutation rate of μθ/2N per locus per generation in a population of N sequences.

Two notable features of the standard coalescent model are

  • (a)the long period of time (on average, more than half TMRCA) in which the tree has just two lineages and
  • (b)the high variance in total tree height (the standard deviation (SD) is typically about 60% of the mean).

See Nordborg (2001) for a more detailed introduction to coalescent models.

3.1.2. Coalescent with population growth

The standard coalescent model is the special case λ(s)≡1 of the model in which equation (1) is replaced with


where Λ(t) is an increasing differentiable function with inline image. The model thus defined approximates the genealogy of a sample drawn from a random-mating population of size N λ(t) at time inline image generations ago (Hudson, 1991; Donnelly and Tavaré, 1995). Intuitively, an increment of coalescent time corresponds to more generations when the population size is large than when it is small.

One simple model for a change in population size is the ‘k-size coalescent’, which in the case k=2 is specified by


corresponding to a population of constant size N until time tg, after which it instantly attains its present size, Nα.

Pure exponential growth at rate r per generation corresponds to


where RNr and c is an arbitrary constant. When R>0, there are fewer recent coalescences than under the standard coalescent model (with the same expected total branch length, which can be achieved by an appropriate choice of c) and hence more mutations represented only once in the data. When R < 0 each coalescence event has positive probability that it does not occur in finite time, leading to clusters of sequences separated by an infinity of mutations.

Pure exponential growth is unlikely to provide a good model for global human population size, since recent high growth rates would imply a vanishingly small population size a few thousand years ago. Marjoram and Donnelly (1994) considered a two-parameter model for which


corresponding to a population of constant size N until Ntg generations ago, after which it grew at rate r per generation to reach its current size Nc, where


We adopt this model below, and for convenience we refer to it as the ‘coalescent with growth’, even though other formulations of population growth are possible. Under this model,


and the coalescence time distributions (2) become


The coalescent with growth model reduces to the standard coalescent model both when tg=0 and in the limit as R→0. In the examples below we rely on background information to justify an a priori assumption of Rgeqslant R: gt-or-equal, slanted0 but note that R<0 in our model does not lead to infinite coalescence times, as is the case for pure exponential growth.

3.1.3. Coalescent with population splitting

Human populations are often subdivided, in some cases by cultural barriers, but most obviously by geographical barriers or distance. The two Y-chromosome samples that were described in Section 2.1 are both divided into subsamples obtained from geographically distinct human subpopulations such that individuals are more likely to mate within their subpopulation than outside it. Modelling population subdivision may be crucial to the interpretation of the two data sets, but this is not necessarily the case: a relatively low level of migration can suffice largely to eliminate the effects of subdivision (Hartl and Clark, 1997), and there is evidence for large scale migrations throughout human history and prehistory (Cavalli-Sforza and Cavalli-Sforza, 1995).

One popular approach to modelling subdivision is based on the island model of Wright (1931). This is an equilibrium model in which each pair of subpopulations exchanges migrants at points of a homogeneous Poisson process of given rate; see Wilkinson-Herbots (1998) for further details and Bahlo and Griffiths (2000) and Beerli and Felsenstein (2001) for coalescent-based inference under this model. However, the equilibrium assumption underlying the island model is questionable for human populations. In addition, the number of migration parameters grows with the square of the number of populations, becoming unmanageably large for more than a handful of populations, and an assumption of a common migration rate is usually highly unrealistic.

A simple non-equilibrium model which allows changes in population structure is that employed by Weir and Cockerham (1984), which posits a random-mating ancestral population of size N until Nta generations ago, when it split into isolated subpopulations of equal sizes. We adopt this model with two extensions, to allow

  • (a) the ith subpopulation to have size Nαi, with Σ αi=1 and
  • (b) population bifurcations occurring at different times.

The process of subpopulation splits creates a population ‘supertree’, with one leaf for each subpopulation, which we model separately from the underlying genealogical tree. Fig. 1 illustrates a realization of a genealogy in four subpopulations under this ‘splitting’ model.

Figure 1.

Genealogy under the ‘splitting’ model of population subdivision: the single ancestral subpopulation split t3 (≡ ta) coalescent time units ago into two subpopulations, each of which subsequently split to result in four current subpopulations, whose sizes form proportions αi, i=1,…,4, of the total population size N; the subpopulation sample sizes are 3, 3, 6 and 4, corresponding to the number of leaves (terminal nodes) (the broken arrow in the left-hand margin indicates the direction of true time; coalescent time runs in the reverse direction); the figure has been contrived to avoid lineages crossing each other, but the subpopulations are not spatially ordered under our model and such crossings would normally be needed

For the coalescent approximation to the splitting model, the genealogical tree underlying the sample from each subpopulation is given by the k-size coalescent, introduced above at expression (3). This ‘coalescent with splitting’ rules out certain coalescence events (between sequences in different subpopulations) which would be permitted under the standard coalescent model. However, those coalescences which are permitted occur at a higher rate because of the smaller subpopulation sizes. Overall, the rate of coalescences can be either greater or less than under the standard coalescent model. However, if the subpopulation sizes are approximately equal, and the subsample sizes are also approximately equal, then the overall rate of coalescences is reduced by population splitting, and the expected TMRCA is increased.

The coalescent-with-splitting model of population structure remains unrealistic for many human populations, because it disallows the exchange of migrants between subpopulations after they split. Nielsen and Wakeley (2001) formulated a model which incorporates migration after a split, but their implementation is limited to two subpopulations, without population growth. Our simpler model captures much of the effect of population subdivision with relatively few additional parameters.

3.1.4. Coalescent with splitting and growth

In the analyses below, we implement a model which incorporates both the coalescent with growth (Section 3.1.2) and the coalescent with splitting (Section 3.1.3). A genealogy under this model is illustrated in Fig. 2. We restrict attention to the case of a common growth rate in all subpopulations and at all times after the start-of-growth time tg. The extension to growth rates which change after each split is straightforward in principle, but the data are only weakly informative about growth rates and so estimation may be poor.

Figure 2.

Genealogy under the ‘splitting-with-growth’ model: the first subpopulation split occurs at time ta (the horizontal axis represents population size; after time tg the total population size grows exponentially; other details are the same as for Fig. 1)

3.2. Mutation

3.2.1. Short tandem repeat loci

Since different STR loci are usually widely separated, it seems reasonable to assume independence of the mutation processes at distinct loci. The most widely adopted mutation model for an STR locus is the stepwise mutation model (SMM) in which the mutant allele differs from its parent by one repeat unit. Steps in each direction are equally likely, irrespective of the current allele length. Although there is evidence (Brinkmann et al., 1998; Kayser et al., 2000) for the SMM being close to the actual mutation process at STR loci, there is also evidence of deviations from the model, which we now briefly discuss.

Cooper et al. (1999) analysed Y-chromosome STR data by using coalescent models and reported evidence for a mutation bias, with mutations leading to increases in allele length more often than decreases. However, the signal in the data for such a bias lies in the skewness of the allele frequency distribution, and this may have other causes. The direct observations of Kayser et al. (2000) do show a non-significant excess of increases (10) over decreases (4), but Brinkmann et al. (1998) reported a slight excess in the other direction.

There is direct evidence for the strict one-step model being false: Kayser et al. (2000) observed one mutation (out of 14) which altered the allele length by two repeat units and Brinkmann et al. (1998) reported one two-step mutation and 22 one-step mutations. Pritchard et al. (1999) fitted a model to Y-chromosome STRs in which the change in the number of repeat units forms a geometric(p) random variable. They estimated p to be close to 1, in which case their model is difficult to distinguish from the SMM with a slightly higher mutation rate.

Both Kayser et al. (2000) and Brinkmann et al. (1998) reported an apparent correlation between allele length and mutation rate. In the former study the correlation was weak, and in the latter case an important contribution to the correlation seems to have arisen from an absence of mutations observed at loci with mean allele length under 10 repeats. No such loci are included in the two Y-chromosome data sets that we analyse below.

Extensions to the SMM to allow for mutation rate increasing with length, mutation bias or multistep mutation events can readily be implemented within our framework. For the reasons indicated above, and to keep the presentation as simple as reasonably possible, we choose not to implement any such extensions here and retain the SMM adopted by Wilson and Balding (1998).

The SMM has no equilibrium distribution, and so there is no natural prior distribution for the STR repeat number at the root of the genealogical tree. A prior which is uniform on the positive integers, although improper, leads to a proper posterior distribution and is adopted below. Although the SMM does not constrain allele lengths to be positive, we find that, conditional on the current allele sizes, the probability that an ancestral allele is assigned a non-positive length is negligible in practice.

3.2.2. Single-nucleotide polymorphism sites: recurrent mutation

We assume that insertions, deletions and translocations of nucleotides are sufficiently rare that they can be ignored for the timescales that are relevant here. Mutations thus consist only of the substitution of one nucleotide by another. The mitochondrial DNA minisequence sites are separated by many nucleotides, and so it is natural to assume that the substitutions at distinct sites are mutually independent. An SNP mutation model can then be specified by a continuous time Markov chain on the states A, C, G and T.

The simplest such model is the Jukes–Cantor model (Jukes and Cantor, 1969), in which all possible substitutions are equally likely, so that the chain has a uniform stationary distribution: inline image. Felsenstein (1981) introduced the F81 model, with an arbitrary stationary distribution: substitutions occur at rate β and the mutant nucleotide is A, C, G or T with probabilities πA, πC, πG and πT. Note that a nucleotide may be substituted for one of the same type, so that the effective mutation rate is less than β.

We adopt here the F84 model, an extension of the F81 model that is similar to the model of Hasegawa et al. (1985). It has been implemented since 1984 in the program DNAML in the PHYLIP suite of programs (described in Felsenstein and Churchill (1996)). Under the F84 model, a parameter α specifies the nominal rate of additional mutations restricted to be transitions. If the current nucleotide is either of the purines (A or G), then the mutant nucleotide is A or G with probabilities πA/(πA+πG) and πG/(πA+πG), and similarly for the pyrimidines. The stationary distribution is unaffected by the additional transitions, and the overall effective mutation rate is


per nucleotide per generation.

3.2.3. Unique event polymorphism loci

If θ (≡2Nμ)≪1 at a biallelic locus, it may be reasonable to assume that there was only one mutation event underlying the observed polymorphism. Although this unique event polymorphism (UEP) assumption is not required in our framework, if valid it can greatly reduce the size of the tree space that must be explored, allowing both computational efficiency and more precise inferences. These advantages are enhanced if it is known which of the two alleles is ancestral, in which case any two haplotypes with the non-ancestral state must be more closely related than two haplotypes with differing states.

The YAP (Section 2.1) was assumed by Cooper et al. (1996) to be a UEP. The Alu element is widely interspersed in the human genome and seems capable of being inserted at any location (Sherry et al., 1997), so two inserts at precisely the same location seem very unlikely.

For SNPs, many researchers assume the ‘infinite sites’ mutation model (Section 2.2) under which all SNPs are also UEPs. Since per site θ for humans is on average of the order of 10−3, this assumption seems reasonable provided that mutation is reasonably homogeneous over sites. In contrast with the YAP, for which the UEP assumption implies that the MRCA sequence must lack the Alu insert, the ancestral SNP state is usually unknown, though it is sometimes assumed to be the most common observed state.

4. Methods

4.1. Coalescent models as Bayesian priors

Coalescent models have revolutionized population genetics in the past decade by moving the emphasis away from prospective modelling of populations, which allows at best only crude comparisons with observed data, and towards a retrospective description of the genealogy of a sample. Although coalescent-based methods permit the specification of a likelihood function, many of the new inferential methods are not fully likelihood based, primarily because of the computational complexity (see Fu and Li (1999) for a review). Likelihood-based methods have been formulated in recent years, either via MCMC or via importance sampling approaches (Stephens, 2001), thereby bringing to population genetics problems the benefits of statistical efficiency and quantitative model comparison.

Because of the complex models and substantial background information, the Bayesian paradigm provides an appropriate framework for statistical inference in the present setting. The demographic models that were introduced in Section 3.1 specify prior distributions for the genealogical tree underlying a sample of DNA sequences, and inference about aspects of this tree proceeds via its posterior distribution given the observed data (and the priors for the mutation rate and other parameters).

The development of MCMC algorithms to approximate the required posterior distribution is challenging for several reasons, including the complexity of the likelihood computations. Wilson and Balding (1998) developed a Metropolis–Hastings algorithm, based on an augmented data approach in which the allelic states at all internal nodes of the tree (i.e. at each coalescence) were regarded as auxiliary variables. The likelihood calculations were thereby greatly simplified, at the cost of a larger parameter space. Thanks to the simpler likelihood calculations, a wide class of proposal distributions for exploring the space of genealogical trees was available, allowing an algorithm with good mixing properties to be developed. We retain these features in the algorithms that are presented here, together with additional features to incorporate the demographic and mutation models introduced earlier.

4.2. Algorithm

4.2.1. The BATWING software

A Metropolis–Hastings algorithm to investigate the models and data sets described above has been implemented in a C program, for which we have adopted the acronym BATWING (Bayesian analysis of trees with internal node generation). UNIX, Windows and Macintosh versions are freely available at

Also available is the BATWING User Guide which explains how to use the program.

4.2.2. Proposal distributions

BATWING's proposal distribution for updating the tree topology is similar to that of MICSAT, described in Wilson and Balding (1998) and also available at the above Web site. Briefly, candidate trees are obtained by selecting an internal node at random on the current tree and choosing a new location at random, but locations near nodes of similar allelic type are more likely to be chosen than locations near dissimilar nodes. For demographic models involving splitting, and for mutation models involving UEPs, the obvious modifications are implemented: the initial state and any proposed state are disallowed if they contravene the model. Thus, for example, if the descendants of the selected node are from two different subpopulations, then the node can be relocated only before the time at which those two subpopulations split.

In addition to updates which change the topology of the tree, BATWING also proposes updates to coalescence times (keeping internal haplotypes constant) and updates to internal haplotypes (keeping coalescence times constant). For a UEP site with unknown ancestral state, proposals are made to change the root UEP haplotype at random among those consistent with the tree.

The method for updating the population supertree topology and splitting times is based on a representation of trees given in Mau et al. (1999). A planar representation of the supertree is made by randomly allocating ‘left’ and ‘right’ subpopulations at each splitting event. The tree can then be written as a list of populations, with a unique splitting time associated with each pair of neighbouring populations. Updates of these splitting times are made by choosing a split at random, and then choosing a new time for this split, uniformly between 0 and the most recent coalescence between two sequences, one from each of the populations. The updated planar representation then implies a new supertree, which may differ in topology as well as splitting times from the current supertree.

Updates to the proportions into which the ancestral population splits are made independently of the supertree updates. Two populations i and j are chosen at random, and inline image, the new proportion in population i, is drawn uniformly in (0,αi+αj). Finally, inline image.

The demographic parameters (N, Nc, r) and μ are each updated on a logarithmic scale by independent uniform perturbations centred at the current value.

4.2.3. Convergence and mixing

Our principal diagnostic tool has been to compare the results from repeat runs of BATWING with widely spaced initial trees, while all other inputs remain unchanged. For this, BATWING includes a ‘badness’ parameter bs which can be set in the interval (0,1), with 0 corresponding to the tree obtained by applying a parsimony heuristic to the data, which is expected to produce a plausible, or ‘good’, tree, whereas 1 corresponds to a random tree under the prior model, which is almost always ‘bad’ for any given data set. With the badness parameter between 0 and 1 successive coalescences in the starting tree are drawn with probability proportional to 1/(10bs+d) where d is the minimum number of mutations required under our modelling assumptions to obtain one haplotype from the other. The allelic types at the new node are drawn from a discretized Gaussian distribution with SD equal to d+10bs/4.

In addition, we routinely inspected plots of the log-likelihood and parameter values over each run of BATWING, as well as autocorrelation plots, for signs of non-convergence. Fig. 3 shows trace and autocorrelation function plots for two parameters and the log-posterior density from one of the analyses in the simulation study (Section 5.1), thinned by retaining every fourth output. The plots suggest approximate stationarity, the trace plots showing no obvious trend or jump, and each autocorrelation function decreasing to 0. Mixing seems good for μ but slow for L, which is unsurprising because L is a function of all the branch lengths, and it depends on the population splitting times and the growth parameters. In most studies L is not a parameter of direct interest.

Figure 3.

Diagnostic plots for the BATWING output from one run of the simulation study (a burn-in of 1000 outputs was discarded, and the subsequent outputs were thinned by retaining every fourth output): (a) trace plots and (b) autocorrelations up to lag 50, of the mutation rate μ, the total tree length L and the (unnormalized) log-posterior density

An ‘effective’ size me of a sample of m observations of a stationary time series can be defined as


where ρi is the lag i autocorrelation (Liu, 2002). For the data of Fig. 3, m=5000 and truncating at lag 50 we estimate inline image for μ and inline image for L, which is the smallest inline image among the 13 parameters monitored. The only other parameter with inline image is the start-of-growth time tg, which is strongly correlated with the growth rate r and the population size.

For the data analyses described below, some parameters such as L, r and tg did sometimes mix slowly, particularly for C96 (Cooper et al., 1996), our largest data set with more than 200 haplotypes. However, over the several years that the present paper has been in gestation, the authors have not detected any problem with convergence or mixing that was not easily diagnosed with the simple tools illustrated above and addressed by lengthening the run. In particular, we have not encountered evidence of multimodality in our models. Although our models are complex and our parameter spaces are large, we may benefit in one sense from the poor information content of the data for the parameters of interest: posterior distributions tend to be diffuse and lacking the ‘peaks’ and ‘troughs’ that can lead to pseudoequilibria.

4.2.4. Validation

Irreducibility of the Markov chain implied by the proposal distributions described above is easy to establish when there are no UEP sites with unknown root state. Successive ‘cut-and-join’ moves can move from any tree to any other. Movement between any two population supertrees can similarly be shown by first changing the underlying genealogical tree, and then successively rearranging splits to obtain the required supertree. With UEPs the cut-and-join moves allow communication between any states that have the same root haplotype. Proposals to change the root UEP haplotype then ensure movement between any two states that are consistent with the data.

In the context of the complex models and data described here, a rigorous verification that the algorithm described above has been correctly implemented in the BATWING software, and usually converges, is not achievable. Some theoretical support is given by Aldous (2000) who showed that, in a no-data setting, a branch swapping algorithm similar to that implemented in BATWING gives convergence in a total variation sense to the appropriate uniform distribution on tree space. The number of steps required is of the order of between n2 and n3 moves (Aldous conjectured that n2 is correct), which is slow compared with the n log (n) moves that are required for a pack of n cards, randomizing one card at a time.

The authors are also encouraged that results from MICSAT published in Wilson and Balding (1998) have since been verified by using an independent algorithm (Stephens and Donnelly, 2000). Further support comes from a simulation study described later in Section 5.1.

4.3. Mitochondrial DNA match probabilities

Suppose that a mitochondrial DNA sequence is recovered from a sample from the crime scene and found to match the mitochondrial DNA sequence obtained from s, a suspect. This observation supports the hypothesis that s is the source of the crime sample. To assess the strength of this support, we wish to evaluate the probability that x would also match the mitochondrial DNA sequence from the crime scene, where x is an alternative possible culprit whose mitochondrial DNA sequence is unknown. For a discussion of the role of match probabilities in the assessment of forensic identification evidence, see Balding and Donnelly (1995).

If we knew the mitochondrial DNA sequence of any of the maternal ancestors of x, as well as the number of generations separating that ancestor from x, then the probability distribution for the mitochondrial DNA sequence of x could readily be calculated under a given mutation model. Of course, this information is not usually available. However, in forensic case-work a reference sample is usually available consisting of the mitochondrial DNA sequences of n apparently unrelated individuals drawn from the same racial group as x. The n+1 observed mitochondrial DNA sequences, those of s and the reference sample, provide information about ancestral mitochondrial DNA sequences, and hence about the unobserved mitochondrial DNA sequence of x.

Here, we exploit this information via a modification of the algorithm described in Section 4.2. In addition to the genealogical tree underlying the n+1 observed sequences, we introduce a branch connecting the unobserved DNA sequence of x with the tree, writing z for the new node thus introduced (Fig. 4). The additional branch, and the mitochondrial DNA state at z, are updated in the same way as for the other branches, except that, since no data are available for the mitochondrial DNA sequence of x, there is no contribution to the likelihood from the branch connecting z with x.

Figure 4.

Representation of the maternal genealogy of n=6 individuals in an anonymous reference database, together with the suspect s and a further individual x regarded as an alternative suspect of unknown mitochondrial DNA type: the node labelled z corresponds to the most recent woman ancestral to both x and at least one of the other n+1 individuals

At each iteration of the modified algorithm, the probability that the mitochondrial DNA sequence of x matches that of s, conditional on the location and state of z, is readily calculated. The average of these conditional probabilities approximates the match probability, given the observations and the standard coalescent model.

5. Analyses

5.1. Simulation studies

A large simulation study was undertaken to check the accuracy of the posterior approximations obtained using BATWING. 100 data sets were simulated from the most complex of our models, the coalescent with splitting and growth (Section 3.1.4). We used three populations, with sample sizes 23, 22 and 15, and the data types were the same as for the C96 data set (five STR loci and one UEP). The SMM (Section 3.2) was employed to generate the STR data and the location of the UEP mutation was chosen at random in the tree. The parameters underlying each simulation (subpopulation sizes, splitting times, start-of-growth time, mutation and growth rates) were obtained via independent draws from the prior distributions that were used to analyse the C96 data set. The model assumptions and prior distributions supplied to BATWING were the same as those used to generate the data.

The average over simulations of inline image, the estimated effective sample size defined at equation (6), is shown in the first row of Table 2.

Table 2.  Results for six parameters from a simulation study consisting of 100 data sets generated from the coalescent with splitting and growth model†
p (%)Results for the following parameters:
  1. †See the text for details of the simulations. 20 000 BATWING outputs were generated for each data set, corresponding to 1.6 × 108 accept–reject steps, after 8 × 106 were discarded as burn-in. The effective sample size is defined at equation (6) and estimated with a cut-off at lag 200. inline image is the number of data sets for which the 100p% equal-tailed posterior interval includes the true parameter value for that simulation. inline image is the mean over data sets of the proportion of outputs which lie in the 100p% equal-tailed prior interval. The exact (binomial) SD is given in parentheses for inline image; the SD for inline image in parentheses is estimated from the 100 data points.

 Average estimated effective sample size
Hit rateinline image(%)
106 (3.0)12 (3.0)3 (3.0)8 (3.0)12 (3.0)9 (3.0)
3024 (4.6)25 (4.6)25 (4.6)30 (4.6)30 (4.6)30 (4.6)
5046 (5.0)49 (5.0)41 (5.0)42 (5.0)39 (5.0)47 (5.0)
7064 (4.6)70 (4.6)66 (4.6)66 (4.6)55 (4.6)63 (4.6)
9086 (3.0)87 (3.0)88 (3.0)89 (3.0)74 (3.0)82 (3.0)
 Coverage rateinline image(%)
1010 (0.3)9 (0.2)9 (0.3)8 (0.5)9 (0.6)10 (0.6)
3031 (0.8)28 (0.6)28 (0.8)24 (1.4)27 (1.7)31 (1.7)
5051 (1.2)47 (1.0)47 (1.2)41 (2.0)46 (2.5)52 (2.4)
7071 (1.2)66 (1.2)66 (1.4)60 (2.3)65 (2.8)72 (2.4)
9091 (0.7)86 (1.3)87 (1.0)83 (1.7)89 (1.8)92 (1.2)

For a specified parameter, let Hx indicate whether the 100p% posterior interval, given data x, includes the correct value. If the BATWING program is valid and the runs are sufficiently long then the ‘hit rate’inline image, the observed average of Hx over the data sets, forms a binomial(100,p) proportion. Hit rates (expressed as percentages) for six parameters are shown in Table 2. In most cases inline image understates the nominal coverage p, but the difference between inline image and p exceeds three SDs only for the growth rate r when pgeqslant R: gt-or-equal, slanted70%.

Although inline image is valuable for ease of interpretation, Rubin and Schenker (1986) pointed to more precise methods of assessing interval coverage. We apply their ‘average probability coverage’ method to 100p% equal-tailed prior intervals. Given a parameter, say φ, write Cx for the posterior coverage, given data x, of an interval (l,u) with prior probability p:


Averaging over data sets generated via random draws from the prior we have


Since E(Cx)=E(Hx)=p, but Cx ∈ [0,1], it follows that var(Cx)leqslant R: less-than-or-eq, slantvar(Hx)=p(1−p). This is verified by the lower part of Table 2, which shows inline image, the average of Cx over the simulated data sets, together with its sample SD. Agreement between the achieved and nominal coverage as measured by inline image is reasonably good, but the discrepancy exceeds three SDs for μ and L. The largest discrepancy is just under 4.5 SDs.

Overall, our conclusion is that, even after making an informal allowance for multiple comparisons, there are some indications of imperfect mixing or convergence or validity, but these seem unlikely to have a large effect on inferences. For most parameters of interest there is good agreement between the achieved and nominal coverage. The coverage is likely to be better for actual data analyses, because of the possibility of longer runs and more careful monitoring of convergence than is feasible in a simulation study. However, because of its large size (the study occupied 10 computer processors for about 1 week) our simulation study was limited to a relatively small total sample size of 60 chromosomes.

Each panel of Fig. 5 shows posterior density curves estimated from the BATWING algorithm output for a random selection of 20 of the 100 simulated data sets. Perhaps the most noticeable

Figure 5.

Posterior density curves from the simulation study: in each panel, each of the 20 grey curves corresponds to a different simulated data set, chosen at random from the 100 data sets that were used in the simulation study; the broken curves indicate the prior density; the bold curves indicate the density obtained by combining 200 randomly chosen outputs for each data set (details of the BATWING runs are as for Table 2)

feature of these curves is that, even in this ideal setting in which the modelling assumptions hold exactly, only weak inferences are possible for some demographic parameters. The reasons will be discussed further in Section 6. Fig. 5 also shows (bold curves) an average posterior density obtained by choosing a random selection of 200 outputs from each of the 100 data sets; in the limit of many data sets, this curve should coincide with the prior curve (broken curves). The agreement is generally good, but a discrepancy is apparent for μ and r.

5.2. Y-chromosome data

5.2.1. Modelling assumptions Mutation at short tandem repeat loci.

To formulate a prior for the STR mutation rate μ, Wilson and Balding (1998) drew on the Y-chromosome STR mutation study of Heyer et al. (1997), which recorded three mutations at tetranucleotide Y-STR loci from 1491 observed meioses. Adopting a gamma(1,1) (i.e. standard exponential) preprior, these data led Wilson and Balding (1998) to a gamma(4,1492) prior for μ, with mean 4/1492≈2.7×10−3. More data have since become available, and combining the results of Heyer et al. (1997) with those of Bianchi et al. (1998) and Kayser et al. (2000) we obtain a total of 17 mutations from 8169 meioses, suggesting a gamma(18,8170) prior for μ with mean ≈2.2×10−3 (key quantiles of this and other prior distributions are shown in Table 3).

Table 3.  Results from BATWING analyses of the C96 human Y-chromosome data set (Cooper et al., 1996)†
ParameterQuantiles (5%, 50%, 95%)
  1. †The prior distributions are as follows: μ∼ gamma(18,8170); ta∼ gamma(2,1); tb|ta∼ uniform(0,ta); for models (a) and (c) N∼ gamma(5,10−3); for models (b) and (d) N∼ gamma(3,10−3);  log (Nc/N)∼ gamma(5,1); r∼ gamma(2,400). The generation time G=25 years. All estimates are based on 20 000 BATWING outputs, corresponding to 1.6 × 108 accept–reject steps, after 8 × 106 were discarded as burn-in.

(a) Standard coalescent
Effective population size N2.0 × 1034.7 × 1039.2 × 1032.5 × 1034.0 × 1036.4 × 103
Mutation rate μ per 103 generations1.
θ (=2Nμ)7.82044131620
TMRCA (years)63 × 103200 × 103600 × 10321 × 10343 × 103120 × 103
(b) Coalescent with growth
Ancestral population size N8202.7 × 1036.3 × 1031707702.0 × 103
Current effective population size Nc15 × 103280 × 10328 × 10612 × 10327 × 10384 × 103
Growth rate r (% per generation)0.0890.421.20.591.22.3
Mutation rate μ per 103 generations1.
Time since start growth, GNtg (years)7.2 × 10328 × 103150 × 1034.3 × 1037.4 × 10314 × 103
TMRCA (years)48 × 103150 × 103460 × 10314 × 10325 × 10359 × 103
(c) Coalescent with splitting
Effective population size N2.0 × 1034.7 × 1039.2 × 1032.8 × 1034.1 × 1037.1 × 103
Mutation rate μ per 103 generations1.
Time since most recent split, GNtb (years)5.2 × 10376 × 103400 × 1034001.2 × 1033.1 × 103
Time since 1st split, GNta (years)33 × 103180 × 103670 × 1035.1 × 10310 × 10321 × 103
TMRCA (years)100 × 103330 × 1031.0 × 10624 × 10350 × 103140 × 103
(d) Coalescent with splitting and growth
Ancestral population size N8202.7 × 1036.3 × 1032607401.9 × 103
Current effective population size Nc15 × 103280 × 10328 × 10619 × 10351 × 103210 × 103
Growth rate r (% per generation)0.0890.421.20.581.22.1
Mutation rate μ per 103 generations1.
Time since start growth, GNtg (years)7.2 × 10328 × 103150 × 1035.5 × 1039.2 × 10316 × 103
Time since most recent split, GNtb (years)2.5 × 10342 × 103260 × 1031.8 × 1034.1 × 1037.9 × 103
Time since first split, GNta (years)16 × 103100 × 103440 × 1036.0 × 1039.6 × 10317 × 103
TMRCA (years)60 × 103200 × 103660 × 10316 × 10329 × 10364 × 103

We have chosen to rely on the recent, directly observed data to formulate our prior distribution. Some researchers (e.g. Forster et al. (2000)) have questioned the use of contemporary pedigree data for historic mutation rates and point to indirect evidence for a lower mutation rate than is supported by our prior; see Siguroardottir et al. (2000) for a discussion of the issues. Pritchard et al. (1999) adopted a gamma(10,12500) prior, with mean ≈0.8×10−3. Although this prior seems inconsistent with our choice, there is some overlap of the two prior density curves and the discrepancy is less than it first appears because Pritchard et al. (1999) allowed mutation steps of size greater than 1.

The STR loci of the C96 data set are all tetranucleotide repeats. Those of Ruiz Linares et al. (1996) are mostly dinucleotides, but include one of the tetranucleotides which was studied by Cooper et al. (1996) and all three of the mutation studies cited above. Some researchers have found dinucleotide mutation rates to be lower than tetranucleotides (Weber and Wong, 1993) whereas others have found the reverse relationship. Kayser et al. (2000) found no significant difference, although they have little dinucleotide data. Here, we adopt the simplest plausible model and assume a common mutation rate for dinucleotide and tetranucleotide STRs. Y-chromosome Alu polymorphism locus.

To allow at least partly for ascertainment bias, we condition on the existence of the YAP insert and assume a priori that it is equally likely to have occurred at any point on the genealogical tree. This assumption reflects the fact that the YAP locus was chosen to be typed because it was known in advance to be polymorphic, but it does not reflect the additional prior information that YAP is polymorphic in many human populations, and so cannot have arisen from a very recent mutation event. Although our prior could be modified to give more support to older times for the YAP mutation event, there is no obvious candidate for a specific assumption, and simulations indicate that more detailed modelling of the ascertainment scenario has no perceptible effect on posterior inferences (the results are not shown). This is because the YAP data are informative about tree topology, but their weak effect on branch lengths is overwhelmed by information from the STR data. Population size and growth parameters.

Wilson and Balding (1998) investigated two prior distributions for N: lognormal(9,1) and gamma(5,10−3). They found that the resulting posteriors of interest were very similar. For the models without population growth, we retain the gamma(5,10−3) prior.

Estimates of the growth rates of human (census) population sizes over the past few thousand years can be obtained from the data reported by Cavalli-Sforza et al. (1994). From their data we estimate an average growth rate over the past few thousand years of about 1.2% per generation for the worldwide population, and 1.6% for Europe. However, effective population sizes can be very different from census sizes, because of, for example, geographical stratification at various levels, age structure and mating behaviour. Plausibly, lower growth rates are appropriate for effective population sizes, and we adopt a gamma(2,400) prior for r, centred at 0.5% but supporting values above 1%.

Under constant population size models, the effective male population size over recent human evolution is often estimated to be of the order of 5000. It thus seems likely that N, the effective ancestral population size (before growth) was somewhat smaller and we choose a gamma(3,10−3) prior. To specify a prior for the current effective population size, Nc, we assign the gamma(5,1) distribution to log(Nc/N). The prior for the time tg at which growth started is implied by the priors described above, together with the relationship


and the assumption that log(Nc/N ), N and r are mutually independent. Population subdivision parameters.

We employ a gamma(2,1) prior for the time ta at which the earliest split occurs. Subsequent splits occur at times which are jointly uniform, given ta. The prior distribution of the subpopulation sizes, expressed as proportions of the total size, is Dirichlet(2,2,…,2). At each coalescence event in the population supertree, all possible coalescences are equally likely. Generation time.

Wilson and Balding (1998) assumed a value of 20 years for G, the male generation time. This may be plausible for females but is likely to be too low for males (see for example Tremblay and Vézina (2000)). Following Thomson et al. (2000), we adopt here G=25 years. To facilitate a comparison with other studies, we do not model our uncertainty about the value of G. The data convey no information about G, so the posterior uncertainty is the same as the prior uncertainty. Time since most recent ancestor.

The prior distribution for TMRCA cannot be independently assigned. It depends (weakly) on the sample sizes but more importantly is a function of the demographic model and prior distributions for population sizes, growth rates and splitting times. Since the splitting time parameters arise in some demographic models and not others, it seems impossible in practice to specify essentially the same prior for TMRCA over our various models. In particular, allowing population growth leads to shorter values for TMRCA, whereas population splitting tends to increase these values, and only unrealistic priors for other parameters could fully compensate for these effects. However, our choices imply priors for TMRCA which overlap substantially, so that prior medians differ by a factor of less than 2 over the four demographic models, and the interval from 120 000 years to 450 000 years lies within the equal-tailed 90% interval for all four prior distributions for both sets of sample sizes (Tables 3 and 4).

Table 4.  Results from BATWING analyses of the R96 data set of Ruiz Linares et al. (1996)
ParameterQuantiles(5%, 50%, 95%)
  1. †The notation, prior distributions and details of the BATWING runs are as for Table 3.

(a) Standard coalescent
Effective population size N2.0×1034.7×1039.2×1032.4×1033.8×1036.3×103
Mutation rate μ per 103generations1.
θ (=2Nμ)7.82044121519
TMRCA (years)64×103200×103600×10327×10350×103110×103
(b)Coalescent with growth
Ancestral population size N8202.7×1036.3×1031907001.9×103
Current effective population size Nc15×103280×10328×1034.6×1038.2×10316×103
Growth rate r (% per generation)0.0890.421.20.190.440.94
Mutation rate μ per 103 generations1.
Time since start growth, GNtg (years)7.2×10328×103150×1036.8×10314×10328×103
TMRCA (years)48×103150×103450×10317×10331×10363×103
(c) Coalescent with splitting
Effective population size N2.0×1034.7×1039.2×1032.5×1034.2×1036.7×103
Mutation rate μ per 103 generations1.
Time since most recent split, GNtb (years)56011×10386×103060320
Time since 1st split, GNta (years)33×103190×103670×1033.4×1035.8×10310×103
TMRCA (years)120×103360×1031.0 × 10627×10351×103110×103
(d) Coalescent with splitting and growth
Ancestral population size N8202.7×1036.3×1031405401.7×103
Current effective population size Nc15×103280×10328×1067.9×10315×10332×103
Growth rate r (% per generation)0.0890.421.20.0290.601.2
Mutation rate μ per 103 generations1.
Time since start growth, GNtg (years)7.2×10328×103150×1037.4×10314×10326×103
Time since most recent split, GNtb (years)2906.1×10353×10310170810
Time since 1st split, GNta (years)16×103100×103440×1035.0×1038.1×10313×103
TMRCA (years)62×103210×103670×10317×10329×10359×103

5.2.2. Results: C96 data set

Wilson and Balding (1998) analysed two subsets of the C96 STR data. Here we analyse the full data set of 212 Y-chromosomes, described in Section 2.1, under all four demographic models introduced in Section 3.1. Key quantiles of the marginal posterior distributions under each of the four models are given in Table 3. In discussing these results we shall regard the posterior median of each parameter of interest as a point estimate.

The striking feature of almost all human Y-chromosome data, including the present data set, is its lack of variability, and the consequences of this are evident in the results reported in Table 3. Under the standard coalescent model, the estimated value of N is about 4000, which may seem implausibly low but is consistent with the results of other studies (Jorde et al., 2001). The estimated TMRCA is 43 000 years, which is well below the 5% quantile of the prior distribution. This result is similar to that obtained by Wilson and Balding (1998), Pritchard et al. (1999) and Thomson et al. (2000), but until recently it would have been regarded as too low since, for example, humans arrived in Australia before that time. One goal of the present analyses is to investigate to what extent this unexpected result might be due to inappropriate modelling assumptions.

Weakening the assumptions of the standard coalescent model to allow for population growth leads to even less plausible results: the estimated TMRCA falls further to only 25 000 years. The growth rate estimate is large (1.2%), but the growth started only recently (estimate 7400 years). Pritchard et al. (1999) estimated a lower growth rate (0.8%) and a longer time since growth (18 000 years); our values lie within their 95% intervals. The current effective population size is estimated at only 27 000, which is several orders of magnitude less than the census population size, but in accord with the estimate of 28 000 obtained by Thomson et al. (2000) by using sequence data analysed via GENETREE (see Section 6).

However, extending the standard coalescent model to allow for geographical structuring via the splitting model (without growth) increases the estimates of both N and TMRCA by about 10%. The first population split occurred recently (median about 10 000 years ago), and this was almost certainly the Nigerian population splitting from the common ancestral population of East Anglians and Sardinians (99.9% of MCMC outputs; the prior assigns probability inline image to each possible split). The more recent split is estimated to have occurred approximately 1000 years ago. Care must be taken in interpreting these times since the splitting model does not allow migration after a split. Estimates would therefore reflect the end of a gradual splitting process and could be misleading if there had been substantial migration following a splitting event.

Adding growth to the splitting model, the most recent split moves further into the past, but the TMRCA estimate is reduced substantially (29 000 years). Since the initial split, the growth rate (which is assumed common to all populations) is very high (estimate 1.6% per generation), but the estimate of the current population size remains low (51 000) because of the small size of the ancestral population (estimate 740) and the short period of growth (estimate 9200 years). The probability that it was the Nigerian population which split first from the common ancestral population is now 99.1%, with the remaining 0.9% probability being roughly equally divided between the Sardinian and East Anglian populations. The posterior mean effective size of the Nigerian population is 37% of the total effective population size, with the East Anglian and Sardinian population proportions being 31% and 32% respectively.

5.2.3. Results: R96 data set

From the 121 Y-chromosomes of the R96 data set (Ruiz Linares et al., 1996), we analysed the 115 chromosomes with no missing data at the YAP and SNP sites. Although missing STR data are relatively easy to handle, data missing from UEP sites are more problematic, and removing just six chromosomes seemed preferable to an arbitrary assumption. Sample sizes for this subset (and the full data set) are shown in Table 5.

Although the data set is very different from that of Cooper et al. (1996), some aspects of the posterior distributions summarized in Table 4 are strikingly similar. For example, the estimates of N under each model are similar for the two data sets, even though the number and distribution of source populations differ greatly. A possible interpretation is that levels of migration among human populations have been such that the current geographic location is relatively unimportant in explaining genetic diversity. More striking still is the similarity across the two data sets of the four TMRCA estimates.

One difference between the two sets of results is that the growth rate estimates are lower for the R96, worldwide, data set than for the C96 data set which is dominated by Europe.

The estimates of relative (effective) sizes of the 13 populations of the R96 data set (given in Table 5) are surprisingly uniform, with posterior means ranging from about 7% to 9% of the total, and bearing no apparent relationship with current census population sizes. The three largest values correspond to the three African populations, in accord with a greater genetic diversity in Africa than in other continents, which has been reported by many previous researchers. The two Amazonian populations had the smallest values.

Table 6 gives some probabilities for clusterings of different populations under the splitting model, with and without growth. The most recent population split was, with probability 38%, between Cambodia and China. The deepest split is estimated to have been between the African and non-African populations, a finding which has also been frequently reported from previous data sets. Although this split is assigned a posterior probability of only about 16%, note that the 13 populations are not grouped into continents a priori, so the 213−1 possible groupings at the root split are initially equally likely. Thus the data provide very strong support for the deepest split being between the African and non-African populations, as well as for the two other continental groupings shown in Table 6.

Table 6.  Posterior probabilities for various population groupings from the R96 data set†
Population groupingPosterior probability(%)
Splitting onlySplitting and growth
  1. †See Table 5 for the population codes. For (c), each entry gives the proportion of BATWING outputs such that the tree has a node which is ancestral to the stated populations only. The table shows all groupings with support under at least one model of 10% or more for (a) and (b), and 40% or more for (c)

(a) At the most recent split
CA, CH3838
LI, PZ2018
(b) At the root split
AFR versus non-AFR1715
AFR + ASI + OCE versus AME + EU1415
AFR + AME + EU versus ASI + OCE710
(c) At any split
KA + SU6979
CA + CH6365
LI + PZ5751
ASI + NG3347
AME + EU3145
CA + CH + NG4645

Looking at groupings which arise at any node in the tree, the three African populations form the strongest cluster, occurring in over 90% of trees. The two Amazonian populations also cluster together frequently. The strongest cross-continent clusterings involve New Guinea with the three Asian populations and Europe with the three American populations. The latter clustering seems surprising but may be due either to a substantial gene flow into both America and Europe from north or central Asia during and/or since the last ice age or perhaps very recent gene flow direct from Europe into native American populations during the period of European colonization. (The data sets consist of aboriginal peoples as far as this can be verified, but there may nevertheless be some recent admixture.)

5.3. β-globin sequences

5.3.1. Modelling assumptions

Harding et al. (1997) inferred an ancestral haplotype for their sample by comparison with a chimpanzee sequence. Using the standard coalescent model together with the infinite sites assumption, they obtained a point estimate of 2.55 for the mutation parameter θ. Conditional on this estimate, and on an estimate of the mutation rate obtained from human–chimpanzee comparisons, they inferred a point estimate of 895 000 years for TMRCA (95% confidence interval 380 000–1 410 000 years).

Our reanalyses regard the tree as unrooted a priori and draw inferences about the root without using the chimpanzee sequence. Further, we incorporate uncertainty about μ, the per sequence and per generation mutation rate. We started with a gamma(3,105) preprior for μ, based on a genome-wide substitution rate estimate of 10−8 per site. Harding et al. (1997) observed 31 sites monomorphic in humans but varying between humans and chimpanzees. Assuming that there have been about 2.5×105 generations since the human–chimpanzee split, this leads to a gamma(34,6×105) distribution for μ, which we adopted as a prior distribution for our analyses. For the effective population size N, we assume respectively a gamma(5,2.5×10−4) and a gamma(3,2.5×10−4) prior distribution under the standard coalescent and the coalescent-with-growth models (these are the priors used for the Y-chromosome analyses, scaled up by a factor of 4).

5.3.2. Results

Prior and posterior quantiles, under both the standard coalescent and the coalescent-with-growth models, are shown in Table 7. Our modelling assumptions are similar to those of Harding et al. (1997), including the adoption of the infinite sites mutation model, and in many respects our results are similar. We use a generation time G=25 years, for comparison with our other analyses, whereas Harding et al. (1997) used G=20 years. After allowing for this difference, our posterior median estimates of TMRCA (1 100 000 and 1 200 000 years ago) are similar to the point estimate of Harding et al. (1997), but our 90% interval is wider than even their 95% interval, because we model uncertainty about N, μ and the root of the tree. In addition, the confidence interval of Harding et al. (1997) is symmetric about the estimate (±2 SDs), whereas our posterior distribution is skew and our interval is shifted towards higher values.

Table 7.  Results from BATWING analyses of the H97 β-globin data set (Harding et al., 1997)†
ParameterQuantiles (5%, 50%, 95%)
  1. †The notation and details of the BATWING runs are as for Table 3. The prior distributions are as follows: μ∼ gamma(34,6 × 105); for model (a), N∼ gamma(5,2.5 × 10−4); for model (b), N∼ gamma(3,2.5 × 10−4);  log (Nc/N)∼ gamma(5,1); r∼ gamma(2,400). The generation time G=25 years.

(a) Standard coalescent
N7.9 × 10319 × 10337 × 10312 × 10321 × 10333 × 103
μ (per 105 generations)
θ (≡ 2Nμ)0.842.
TMRCA (years)240 × 103780 × 1032.4 × 106690 × 1031.2 × 1062.1 × 106
(b) Coalescent with growth
Ancestral N3.2 × 10311 × 10325 × 1039.8 × 10317 × 10328 × 103
Nc59 × 1031.1 × 106100 × 10670 × 103560 × 10318 × 106
r (%)0.0880.421.20.230.661.5
tg (years)7.3 × 10328 × 103150 × 1034.8 × 10314 × 10336 × 103
TMRCA (years)140 × 103480 × 1031.6 × 106620 × 1031.1 × 1062.0 × 106

Table 8 shows part of the posterior distribution for the root node sequence, based on the human data alone. The most likely root sequence is the one assumed by Harding et al. (1997), based on the chimpanzee comparisons, indicating that such outgroup comparisons are not required to estimate ancestral sequences, although these are of course helpful if they are available. Table 9 shows posterior quantiles for the age of each mutation (which is assumed unique at each segregating site under the infinite sites model). Our results are in broad agreement with those of Harding et al. (1997) except that we report (identical) marginal posterior distributions for mutations which cannot be distinguished from the data, whereas they reported, in effect, order statistics.

Table 8.  Posterior probabilities of the MRCA state for the H97 data set
SequencePosterior probability of being root (%) for the following models:
Standard coalescentCoalescent with growth
All others50.251.1
Table 9.  Posterior quantiles of the age of each mutation for the H97 data set
SiteAge of mutation (× 103years)
H97Posterior quantiles

We also performed our analyses under a finite sites mutation model, which does not require the assumption of at most one mutation at each site. For these data, weakening the infinite sites model did not lead to any substantial changes in inferences (the results are not shown).

5.4. Mitochondrial minisequences

5.4.1. Modelling assumptions

Each of the three racial groups in the FSS mitochondrial DNA data set (Section 2.3) were analysed using the F84 mutation model for the 10 SNP sites, the SMM for the STR locus and the standard coalescent model, with the same prior distributions in each case.

The prior distributions that were adopted for the analysis are shown in Table 10. Prior distributions which are symmetric on the logarithmic scale seem appropriate for the ratios πA/πT, πC/πT and πG/πT. We chose prior medians that were close to the sample proportions, averaged over the 10 SNP sites studied here, in the data compiled by Handt et al. (1998), consisting of several thousand human mitochondrial DNA sequences (an updated version is available at Because of shared ancestry, the prior variances should be much higher than would be justified by sampling error in the background data. We chose variances such that the prior density was at least half the modal value within a factor of 2 either side of the median.

Table 10.  Prior distributions for the mutation and demographic parameters in the mitochondrial DNA analysis†
  1. †The prior for μSNP is determined by equation (5) and the priors above it in the table; for this row only, the quantiles stated are simulation-based approximations.

α/2.5lognormal(−11,2)0.033 × 10−51.7 × 10−584 × 10−5
μSNP 0.031 × 10−51.6 × 10−588 × 10−5
μSTRlognormal(−7,2)0.18 × 10−49.1 × 10−4460 × 10−4

The database of Handt et al. (1998) was also used to formulate a prior for the ratio of transversions to transitions at these 10 sites. A rough point estimate of this ratio is obtained by taking the most frequently observed nucleotide at each site as the ancestral state and comparing the number of haplotypes that represent a transversional change to the number that represent a transitional change. For the FSS data, this ratio is 1:127, which is considerably lower than estimates for the control region as a whole (Meyer et al., 1999), which may reflect ascertainment bias in the selection of these sites. The prior chosen for β/α has a median that is close to this point estimate but also supports the higher estimates.

The prior for μSNP is implied by the definition (5) and the priors for the five parameters α, β, πA/πT, πC/πT and πG/πT. The priors defined so far suggest the approximation μSNPα/2.5. The prior for α/2.5 was chosen to be sufficiently diffuse to cover both the (low) estimates for μSNP derived from phylogenetic studies and the (high) estimates derived from pedigree and coalescent studies (Siguroardottir et al., 2000).

The prior for μSTR was chosen to cover the range of estimates for mammalian dinucleotide STRs (Weber and Wong, 1993; Schug et al., 1998). Similarly, the prior for N covers published point estimates for the effective, ancestral population size of human mitochondrial DNA (Sherry et al., 1997; Bonneuil, 1998).

For each of the racial groups, an MCMC-based approximation to the match probability under our model was obtained treating as the crime scene haplotype

  • (a) the haplotype that is most common in that racial group,
  • (b) a haplotype that is unobserved in that group but is very similar to the commonest haplotype and
  • (c) a haplotype that is unobserved and is also very dissimilar to any observed haplotype

(the crime scene haplotype differs from each observed haplotype by a transversion substitution at each SNP site).

Recently evidence has been presented for recombination in mitochondrial DNA, but the consensus seems to remain that it is not subject to recombination (Eyre-Walker and Awadalla, 2001). If this proves to be wrong, the implications for our results may not be great since recombination would presumably be rare and the sites that are studied here are all from the mitochondrial DNA control region.

5.4.2. Results

The MCMC-based match probabilities are shown under ‘MCMC’ in Table 11, along with some average properties of the genealogical tree. As expected, when the crime scene haplotype is dissimilar to any common haplotype, the genealogical tree is much higher than for a ‘similar’ haplotype. The effect on the match probability is, however, less predictable: for Caucasians, the dissimilar haplotype has a lower match probability than the similar haplotype, whereas the ordering is reversed for the other two populations. Two factors may contribute to this: the larger sample size for Caucasians and the fact that the most common Caucasian haplotype has a higher relative frequency than do the most common Asian and Afro-Caribbean haplotypes.

Table 11.  Match probabilities for three mitochondrial DNA haplotypes in each of three FSS databases†
PopulationHaplotypeP(match) (%)Averagetree lengthAveragetree heightBranchlength
  1. †The haplotypes chosen were the most common in the database (‘common’), a haplotype which is unobserved in the database but which differs only at the STR locus from the most common haplotype (‘similar’) and an unobserved haplotype which is dissimilar to any observed haplotype (‘dissimilar’). The match probabilities were calculated using the following: the relative frequency among the n+1 observed haplotypes (‘naïve’); the coalescent model with the F84 model and SMM, approximated from 105 MCMC outputs. The average height and total length of the genealogical trees are shown in coalescent time units, as is the length of the branch from x, the unobserved terminal node, to z, the most recent ancestor shared with an observed sequence.

Caucasian, n=153ATTTG5CGTTTCommon25189.61.40.018
Asian, n=43GTCTG5CGTTTCommon21168.51.90.059

The relative frequency of the crime scene profile among the n+1 observed sequences provides a natural approximation to the match probability. It would be the maximum likelihood estimate of the population proportion if observed sequences were treated as independent, ignoring the genealogical structure and ascertainment (the profile of interest is the last one sampled). The number of possible mitochondrial DNA sequences is vast, and many sequences which occur in the population will not be represented in the reference sample; hence the relative frequency of the mitochondrial DNA profile of s may tend to overstate the match probability. In typical forensic applications, such an overstatement would favour defendants and may be regarded as less serious than an error which tends to disadvantage defendants. It is therefore of interest to investigate situations in which the relative frequency understates the match probability; if these are sufficiently rare then relative frequency approximation may suffice in place of a more carefully calculated match probability. Table 11 suggests that the naïve match probability will usually, but not always, be conservative.

The results presented here are based on the standard coalescent model. We have repeated the analyses using the coalescent-with-growth model and found no appreciable differences in the match probabilities (the results are not shown).

6. Discussion

Changes in population size and structure correspond to rescaling time in the coalescent tree. For example, relative to the standard coalescent model, the times between coalescence events are greater when the population size is large, and vice versa. Therefore the pattern of coalescence events provides indirect evidence about past demographic parameters. However, coalescences are not observed, but are inferred from the data. Moreover, these inferences are usually imprecise: if the mutation rate is high then all trace of the earliest coalescences may be eliminated by subsequent mutations, whereas if the mutation rate is low there will be insufficient mutations to ‘document’ many of the coalescences.

More generally, genealogical models such as coalescent models describe a complex stochastic process evolving through time, whereas most data sets, including those examined in the present paper, provide information at only a single time point. For some demographic parameters, more precise inferences can be made by using multiple unlinked loci, in which case every locus gives independent information (Beaumont, 1999). For genealogical parameters such as TMRCA, and for all inferences based on Y-chromosome and mitochondrial DNA data (which have their own unique male- and female-mediated demographies), this option is not available and only imprecise inferences can reasonably be expected on the basis of genetic information alone, even by using the most powerful methodology. However, genetic data can profitably be combined with information from other sources, such as palaeontology, archaeology and historical records.

The imprecision of inferences is reflected in the posterior distributions shown in Fig. 5, some of which are very diffuse even though the modelling assumptions hold exactly. Despite the poor prospects for precise answers, questions of human population history are of such intrinsic interest, and potential usefulness in understanding other aspects of human genetics, that they seem worth pursuing.

Although we cannot accurately estimate growth rates or splitting times from the data that were considered here, some inferences can be made with reasonable confidence. In particular, the low estimates of the TMRCA for Y-chromosome data seem reasonably robust to modelling assumptions. Our results are further supported by Thomson et al. (2000), who analysed different Y-chromosome data and employed different modelling assumptions and a different method of analysis, as well as by Pritchard et al. (1999), who analysed a larger data set of 445 chromosomes, typed at eight STRs, and used similar modelling assumptions, but employed a rejection method for approximating the posterior distribution, based on a vector of non-sufficient summary statistics.

In contrast with the results of the Y-chromosome analyses, the TMRCA estimates for the β-globin data are high relative to the prior distribution. In fact, the β-globin TMRCA is estimated to be very roughly 30-fold higher than the Y-TMRCA, whereas the simplest models would predict a fourfold difference. This may be due to homogenizing selection acting on the Y-chromosome (an advantageous Y-haplotype may have ‘swept’ through the population relatively recently), whereas balancing selection may be plausible for the β-globin locus, which would tend to increase TMRCA compared with expectations under a neutral model. However, there is substantial between-locus variability in TMRCA even under a neutral model, and so selection may not be needed to explain the observed disparity between Y-chromosome and β-globin results.

Alternative software is available, providing analyses that are similar in some respects to those performed by BATWING. Kuhner et al. (1995, 1998) and Beerli and Felsenstein (1999, 2001) have developed LAMARC, a suite of MCMC algorithms for the estimation of parameters such as the exponential growth rate and migration rates, available at

These algorithms do not exploit auxiliary variables at the internal nodes of the tree. They apply the MCMC algorithm only to the genealogical tree: the demographic and mutation parameters are fixed in each run, and a likelihood surface is inferred by using importance sampling reweighting of a number of runs with different driving values. Stephens and Donnelly (2000) have pointed out problems with this approach: the variance of the importance weights can become large for points on the likelihood surface that are not close to the driving values.

GENETREE is described in Bahlo and Griffiths (2000) and is available at

It approximates likelihood surfaces for the mutation rate under the infinite sites mutation model, an exponential growth rate and a matrix of migration parameters under the island model. The methodology underlying GENETREE can be viewed as a version of importance sampling; the run times are often very long and Stephens and Donnelly (2000) have discussed methods for choosing more efficient importance sampling weights.

For forensic mitochondrial DNA match probabilities, we have found that the current practice of reporting sample relative frequencies seems to be conservative in most but not all cases, provided that the crime scene profile is included in the calculation (because the crime scene sequence is the last sampled, this implies that a zero frequency can never occur). Tully et al. (2001) discussed the more conservative approach of using the relative frequency after the crime scene profile has been added twice to the database, as well as the upper 95% confidence limit, which is even more conservative. For Y-chromosome match probabilities, Roewer et al. (2000) employed a posterior mean with respect to a prior beta distribution, with parameters determined by the observed haplotype diversity, which leads to less conservative values. None of the methods, including our own, includes modelling regional variation on a scale that is finer than that for which databases are available, and this may be important for both mitochondrial DNA and Y-chromosome match probabilities.

We believe that the methodology presented here represents an important advance towards the goal of fully likelihood-based methods for analysing DNA sequence data: we can obtain simultaneous inferences about a wider range of demographic, evolutionary and genealogical parameters, together with more realistic assessments of uncertainty, and for a wider range of DNA data types, than has hitherto been feasible. The present work, together with other recent developments, brings closer the goal of quantitative model criticism, and model comparisons, for detailed statistical models of the genetic history of humans and other species. Under our most complex model, we can currently analyse in reasonable time (say, a few days on a modest desk top workstation) sample sizes of over 200 chromosomes with haplotypes of up to about 10 STR markers (additional UEPs typically reduce the computation time and so are effectively unlimited). Further developments of algorithms and faster computers will continue to increase the feasible sample sizes.


We thank Andres Ruiz Linares, Rosalind Harding and Gillian Tully and Ian Evett of the UK FSS, for providing data and useful advice. Many thanks are due to the various users of BATWING for feed-back on the program, in particular Oliver Pybus, for helping with the Macintosh version, and Noah Rosenberg. This work was supported by the UK Engineering and Physical Sciences Research Council under grant GR/K72599.