Abstract
 Top of page
 Abstract
 References
Summary. We develop a flexible class of Metropolis–Hastings algorithms for drawing inferences about population histories and mutation rates from deoxyribonucleic acid (DNA) sequence data. Match probabilities for use in forensic identification are also obtained, which is particularly useful for mitochondrial DNA profiles. Our data augmentation approach, in which the ancestral DNA data are inferred at each node of the genealogical tree, simplifies likelihood calculations and permits a wide class of mutation models to be employed, so that many different types of DNA sequence data can be analysed within our framework. Moreover, simpler likelihood calculations imply greater freedom for generating tree proposals, so that algorithms with good mixing properties can be implemented. We incorporate the effects of demography by means of simple mechanisms for changes in population size and structure, and we estimate the corresponding demographic parameters, but we do not here allow for the effects of either recombination or selection. We illustrate our methods by application to four human DNA data sets, consisting of DNA sequences, short tandem repeat loci, singlenucleotide polymorphism sites and insertion sites. Two of the data sets are drawn from the malespecific Ychromosome, one from maternally inherited mitochondrial DNA and one from the βglobin locus on chromosome 11.
1. Introduction
 Top of page
 Abstract
 References
Underlying a sample of deoxyribonucleic acid (DNA) sequence data is a complex pattern of dependences that reflects the ancestral relationships between the sequences. In the absence of recombination, these relationships can be represented by a genealogical tree for which each tip, or leaf, corresponds to a sequence at the present time. Moving towards the root of the tree corresponds to going backwards in time, and branches merge, or ‘coalesce’, when the corresponding DNA sequences last had a common ancestor. The root of the tree represents the most recent common ancestor (MRCA) of all the sequences in the sample.
Although the underlying genealogical tree is crucial to modelling the dependence structure of a DNA sample, it is effectively ignored by traditional methods of analysing DNA sequence data which, for example, are based on averaging pairwise statistics over all pairs in the sample. In recent years, however, important advances have been made towards the goal of fully likelihoodbased statistical inference from population genetics data (Griffiths and Tavaré, 1994; Kuhner et al., 1995, 1998; Beerli and Felsenstein, 1999, 2001; Beaumont, 1999; Anderson et al., 2000; Bahlo and Griffiths, 2000; Stephens and Donnelly, 2000; Chikhi et al., 2001; Donnelly et al., 2001; Nielsen and Wakeley, 2001; Markovtsova et al., 2000a, b). The key developments underpinning these advances involve
However, implementing these models and algorithms remains extremely challenging because of the complexity of the processes underlying the data, which include historical patterns of migration, mating behaviour and population growth, as well as mutation and selection. Moreover, if recombination cannot be ignored then there may be different genealogical trees for different segments of the sequences (Griffiths and Marjoram, 1997; Kuhner et al., 2000; Nielsen, 2000; Fearnhead and Donnelly, 2002).
Wilson and Balding (1998) developed a Metropolis–Hastings algorithm for completely linked ‘microsatellite’, or short tandem repeat (STR), loci and reanalysed the human Ychromosome STR data set of Cooper et al. (1996), described later in Section 2.1. They found evidence for a relatively small effective population size for human males (point estimates around 3000) and for a short time since the most recent common ancestor, TMRCA, point estimates around 30 000−40 000 years. However, the modelling assumptions employed by Wilson and Balding (1998) were limited to the standard coalescent (Section 3.1.1) and the stepwise mutation model (Section 3.2.1), and the sensitivity of the final inferences to these modelling assumptions was not investigated.
Here, we extend the analyses of Wilson and Balding (1998) by permitting changes in population size, or structure or both. Inevitably, fully realistic models for the historical patterns of human mating and migration remain outside our grasp. However, we can implement models that generalize those of Wilson and Balding (1998), and hence explore sensitivity to their modelling assumptions, and which capture at least some, and possibly most, of the major underlying demographic effects.
Further, we extend the mutation model of Wilson and Balding (1998) to permit the analysis of a wide range of DNA sequence data for which recombination and selection can be neglected. For the parts of the human genome which are subject to recombination, its effects can often be ignored for sequences of up to a few thousand base pairs (bp). However, recombination rates appear to be highly variable so some much larger chromosome segments may be unaffected by recombination, and some short segments may be heavily affected (Goldstein, 2001).
Inferences about population histories and evolutionary processes are not only of intrinsic interest but are also crucial to the interpretation of genetic data in a wide range of applications, from conservation genetics (Beaumont, 2001) to mapping disease genes (Clayton, 2000). We illustrate this point by elaborating an application of our inferential framework to the assessment of DNA profile evidence. Specifically, we extend our MCMC algorithm by introducing an additional, nodata node, to obtain match probabilities for forensic identification by using mitochondrial DNA miniprofiles.
The present paper is addressed in part to statisticians who may be unfamiliar with some of the genetics terminology. To assist such readers we have included a very brief glossary of terms in Table 1, and each term included is highlighted in italics at first use. Table 1 also includes a list of genetics abbreviations. For further background reading, Hartl and Clark (1997) provide a popular introduction to population genetics, and more advanced material may be found in Balding et al. (2001).
Table 1. Terminology and abbreviations used in the text† Term  Definition 


Allele  Possible state of the DNA sequence (or feature derived from it) at a locus 
Base pair (bp)  Unit of DNA sequence length, equal to the number of nucleotides 
Chromosome  Can here be regarded as a long DNA sequence 
Genome  Total genetic inheritance of an organism; the human genome consists of 23 chromosome pairs (one maternal and one paternal) plus mitochondrial DNA 
Haplotype  Alleles at two or more loci on the same chromosome (cf. genotype: unordered allele pairs {maternal, paternal} at one or more loci) 
Locus (plural loci)  Specified site or short region on a chromosome 
Mitochondrial DNA  Circular chromosome inherited maternally 
MRCA  Most recent common ancestor 
Mutation  Process that changes the allele at a locus 
Nucleotide  DNA sequence unit which takes one of four types, denoted A, C, G and T 
Polymorphism  Locus at which more than one allele arises in a given population (cf. monomorphism: an invariant locus) 
Recombination  Exchange of DNA between maternal and paternal chromosomes (not mitochondrial DNA or Y) during the formation of sperm and egg cells 
Selection  Process whereby advantageous alleles tend to become more common and disadvantageous alleles less common (cf. neutral: not subject to selection) 
SMM  Stepwise mutation model 
SNP  Singlenucleotide polymorphism 
STR  Short tandem repeat 
TMRCA  Time since most recent common ancestor 
Transition  Mutation involving a substitution either of a purine nucleotide (A or G) by the other or of a pyrimidine (C or T) by the other 
Transversion  Substitution of a purine nucleotide by a pyrimidine, or vice versa 
UEP  Unique event polymorphism 
YAP  Ychromosome Alu polymorphism 
Ychromosome  Sexdetermining chromosome, borne only by males 
2.1. Ychromosome haplotypes
 Top of page
 Abstract
 References
STR alleles consist of a short DNA sequence motif, e.g. GATA, multiply repeated (Goldstein and Schlötterer, 1999). Cooper et al. (1996) reported the numbers of STRs at each of five loci on the nonrecombining part of the human Ychromosome, for 174 apparently unrelated men from East Anglia (UK). A further 23 men from northern Nigeria and 15 from Sardinia were also typed. Although the three groups sampled cannot be regarded as representative of all human populations, they include both African and nonAfrican populations, which is important in view of the prominence of the theory of a recent African origin of modern humans (Relethford, 1998). A further advantage of this data set is that the STR unit is the same at each locus, which makes more plausible an assumption of a common mutation mechanism for all loci.
In addition to the STR data, Cooper et al. (1996) reported for each Ychromosome the presence or absence of the socalled Alu insertion sequence at a particular locus, known as the Ychromosome Alu polymorphism (YAP) locus described in Hammer (1994). In formulating likelihoods for the YAP data, we face an ascertainment problem which is common to many DNA sequence data sets. Cooper et al. (1996) chose to type this locus because it was known to be polymorphic in many human populations (and many other potential loci that were not known to be polymorphic were not typed). Inferences which can be drawn under these circumstances will differ in general from inferences which would be justified if the locus had been chosen to be typed ‘at random’.
The human Ychromosome data recorded by Ruiz Linares et al. (1996) consist of five dinucleotide STR loci (including one monomorphic locus), a tetranucleotide STR (one of those included in the study of Cooper et al. (1996)), the YAP locus and a singlenucleotide polymorphism (SNP) site. Data were obtained from 13 worldwide populations, although the sample sizes are often small (see Table 5 later). Ruiz Linares et al. (1996) noted substantial geographic clustering of the observed haplotypes and inferred that the MRCA of human Ychromosomes cannot have been very recent, but they did not give an estimate of this time.
Table 5. Sample sizes for the R96 human Ychromosome data set (Ruiz Linares et al., 1996)† Population or region  Code  Sample size  Mean population proportions(%) 

(a)  (b)  Splitting only  Splitting and growth 


Lisongo  LI  4  4  8.0  8.2 
CAR‡ pygmy  PC  12  12  8.2  8.2 
Zaire pygmy  PZ  9  10  8.5  8.8 
Africa total  AFR  25  26  24.7  25.1 
Karitania  KA  11  11  7.3  7.0 
Maya  MA  9  9  8.0  7.9 
Surui  SU  8  8  7.0  6.9 
Americas total  AME  28  28  22.2  21.8 
Cambodia  CA  15  15  7.5  7.4 
China  CH  9  13  7.5  7.5 
Japan  JA  11  11  7.9  7.8 
East Asia total  ASI  35  39  22.9  22.6 
Australia  AU  2  2  7.4  7.5 
Melanesia  ME  4  4  7.4  7.6 
New Guinea  NG  6  6  7.6  7.6 
Oceania total  OCE  12  12  22.4  22.7 
Europe  EU  15  16  7.8  7.7 
Grand total   115  121  100.0  100.0 
2.2. βglobin sequences
 Top of page
 Abstract
 References
Harding et al. (1997) analysed a subset of the data of Fullerton et al. (1994), consisting of 61 DNA sequences from the Melanesian population of Vanuatu. The 3000 bp region sequenced encompasses the βglobin gene. One end of this region was later identified as a recombination hot spot and 330 bp at that end were ignored to permit analyses based on an assumption of no recombination.
Harding et al. (1997) adopted the ‘infinite sites’ mutation model, which implies that at any one DNA site there has been no more than one mutation since the MRCA of the sample. The full data set was not consistent with the assumption, but Harding et al. (1997) discarded four sequences which seemed to have been affected by recombination, resulting in a 57sequence data set that was consistent with infinite sites. In contrast with the geographic clustering of Ychromosomes reported by Ruiz Linares et al. (1996), Harding et al. (1997) reported substantial haplotype diversity in the Vanuatu sample, with all known haplogroups (groups of similar haplotypes) represented in this sample from a single isolated location. Similarly, the estimate of Harding et al. (1997) of the effective size of the ancestral Vanuatu population is approximately the same as many estimates of the effective size of the entire human population, suggesting that the current geographic isolation is unimportant in explaining the observed genetic diversity. They obtained a point estimate of 895 000 years for the TMRCA of the Vanuatu βglobin sequences, which is longer than corresponding estimates based on worldwide Ychromosome data, even when the fourfold larger population size of βglobin sequences is taken into account. (Each child receives two βglobin sequences from its parents, but on average only half a Ychromosome. Under simple models, the expected TMRCA is proportional to the population size and so would be fourfold higher for βglobin sequences than for Ychromosomes).
The infinite sites assumption adopted by Harding et al. (1997) is crucial to some methods of analysing DNA sequence data, since it implies that all mutations which have occurred are directly visible in the data; none have been ‘overwritten’ by a subsequent mutation. The assumption is not valid for many data sets. It is neither required nor advantageous under the framework that is developed here, but it can readily be implemented and we do so later (Section 5.3) to permit a comparison with the results of Harding et al. (1997).
2.3. Mitochondrial minisequences
 Top of page
 Abstract
 References
The maternally inherited mitochondrial DNA has been widely used to infer aspects of human female population histories (Cann et al., 1987; Sykes, 1999). Because it exists in multiple copies throughout each cell, compared with just one copy of nuclear DNA from each parent, mitochondrial DNA is easier to type from small and/or degraded samples. In recent years Neanderthal and ancient Australian mitochondrial DNA have been successfully typed (Krings et al., 1997; Ovchinnikov et al., 2000; Adcock et al., 2001).
Mitochondrial DNA typing is also useful in forensic identification (Tully et al., 1996, 2001; Bataille et al., 1999), mostly for samples of shed hair, which are often recovered at crime scenes and which contain little or no nuclear DNA. In addition, mitochondrial DNA has been successfully typed from DNA poor sources such as saliva on stamps (Allen et al., 1998) and tooth root dentine (Pfeiffer et al., 1998). However, mitochondrial DNA suffers from a serious drawback in forensic settings: because of the absence of recombination, the mitochondrial DNA sequences of two apparently unrelated individuals can be identical, or very similar, owing to shared inheritance from a common maternal ancestor, possibly many generations in the past. This makes it difficult to assess the evidential weight of matching mitochondrial DNA profiles in a way which is fair but makes efficient use of the data.
The UK Forensic Science Service (FSS) employs mitochondrial DNA ‘minisequences’ consisting of 12 loci in the mitochondrial DNA control region that display a high level of genetic variation, while being relatively quick to type. The 12 loci comprise 10 SNP sites, a dinucleotide STR locus and a locus made up of multiple copies of the Cbase (a polyC region) (Tully et al., 1996). The polyC locus was excluded from our analyses, both because there is little available information from which to formulate an appropriate mutation model and because it is not always utilized in forensic casework owing to its high rate of withinindividual variation. We analyse an FSS database of 297 minisequences, obtained from apparently unrelated UK residents, 152 with primarily European ancestry, 103 of AfroCaribbean origin and 42 with Asian ancestry.
The ascertainment bias problem discussed in Section 2.1 arises again for mitochondrial DNA minisequences: the 12 loci were selected in part because preliminary studies indicated high variability at these loci.
3.1.1. Standard coalescent
 Top of page
 Abstract
 References
The coalescent is a stochastic model for the genealogical tree representing the ancestral relationships between a sample of n DNA sequences. The sequences are regarded as labelled, to avoid combinatorial complications, but they are not yet observed. The model has two attractive features: it is mathematically tractable, and it approximates the distribution of genealogical trees under an important class of neutral population genetics models, including the Wright–Fisher model of a randommating population of constant size N, and the general exchangeable models of Cannings (1974). To recover these approximations, 1 unit of ‘coalescent’ time must be interpreted as N/σ^{2} generations, where σ^{2} denotes the variance in the number of ‘offspring’ of a sequence in the next generation. We assign σ^{2}=1 here but note that the mating behaviour of men differs from that of women and σ^{2} for Ychromosome sequences may be larger than the value that is appropriate for mitochondrial DNA sequences, and may possibly be much larger than 1.
Note that we use N for the number of chromosomes, and not the number of individuals: for the βglobin data, N sequences correspond to N/2 individuals; for Ychromosome and mitochondrial DNA data, N sequences correspond to 2N individuals, N males and N females.
Coalescent time runs backwards, with time t_{0}≡0 denoting the present, corresponding to the leaves of the tree, whereas t_{j}, j ∈ {1,2,…,n−1}, denotes the time of the jth most recent coalescence event. In particular, t_{n−1} denotes the time of the root, or MRCA. Under the standard coalescent model, the betweencoalescence intervals t_{j} − t_{j−1} have independent exponential distributions:
 (1)
for t>t^{′}. At each coalescence event, all pairs of extant lineages are equally likely to be the pair that coalesces.
Mutations in the standard coalescent model occur along the branches of the tree at the points of a homogeneous Poisson process with rate θ/2. Because of the coalescent time rescaling, this corresponds to a mutation rate of μ≡θ/2N per locus per generation in a population of N sequences.
Two notable features of the standard coalescent model are
 (a)
the long period of time (on average, more than half TMRCA) in which the tree has just two lineages and
 (b)
the high variance in total tree height (the standard deviation (SD) is typically about 60% of the mean).
3.1.2. Coalescent with population growth
 Top of page
 Abstract
 References
The standard coalescent model is the special case λ(s)≡1 of the model in which equation (1) is replaced with
 (2)
where Λ(t) is an increasing differentiable function with . The model thus defined approximates the genealogy of a sample drawn from a randommating population of size N λ(t) at time generations ago (Hudson, 1991; Donnelly and Tavaré, 1995). Intuitively, an increment of coalescent time corresponds to more generations when the population size is large than when it is small.
One simple model for a change in population size is the ‘ksize coalescent’, which in the case k=2 is specified by
 (3)
corresponding to a population of constant size N until time t_{g}, after which it instantly attains its present size, Nα.
Pure exponential growth at rate r per generation corresponds to
where R≡Nr and c is an arbitrary constant. When R>0, there are fewer recent coalescences than under the standard coalescent model (with the same expected total branch length, which can be achieved by an appropriate choice of c) and hence more mutations represented only once in the data. When R < 0 each coalescence event has positive probability that it does not occur in finite time, leading to clusters of sequences separated by an infinity of mutations.
Pure exponential growth is unlikely to provide a good model for global human population size, since recent high growth rates would imply a vanishingly small population size a few thousand years ago. Marjoram and Donnelly (1994) considered a twoparameter model for which
corresponding to a population of constant size N until Nt_{g} generations ago, after which it grew at rate r per generation to reach its current size N_{c}, where
We adopt this model below, and for convenience we refer to it as the ‘coalescent with growth’, even though other formulations of population growth are possible. Under this model,
and the coalescence time distributions (2) become
 (4)
The coalescent with growth model reduces to the standard coalescent model both when t_{g}=0 and in the limit as R0. In the examples below we rely on background information to justify an a priori assumption of R0 but note that R<0 in our model does not lead to infinite coalescence times, as is the case for pure exponential growth.
3.1.3. Coalescent with population splitting
 Top of page
 Abstract
 References
Human populations are often subdivided, in some cases by cultural barriers, but most obviously by geographical barriers or distance. The two Ychromosome samples that were described in Section 2.1 are both divided into subsamples obtained from geographically distinct human subpopulations such that individuals are more likely to mate within their subpopulation than outside it. Modelling population subdivision may be crucial to the interpretation of the two data sets, but this is not necessarily the case: a relatively low level of migration can suffice largely to eliminate the effects of subdivision (Hartl and Clark, 1997), and there is evidence for large scale migrations throughout human history and prehistory (CavalliSforza and CavalliSforza, 1995).
One popular approach to modelling subdivision is based on the island model of Wright (1931). This is an equilibrium model in which each pair of subpopulations exchanges migrants at points of a homogeneous Poisson process of given rate; see WilkinsonHerbots (1998) for further details and Bahlo and Griffiths (2000) and Beerli and Felsenstein (2001) for coalescentbased inference under this model. However, the equilibrium assumption underlying the island model is questionable for human populations. In addition, the number of migration parameters grows with the square of the number of populations, becoming unmanageably large for more than a handful of populations, and an assumption of a common migration rate is usually highly unrealistic.
A simple nonequilibrium model which allows changes in population structure is that employed by Weir and Cockerham (1984), which posits a randommating ancestral population of size N until Nt_{a} generations ago, when it split into isolated subpopulations of equal sizes. We adopt this model with two extensions, to allow
 (a)
the ith subpopulation to have size Nα_{i}, with Σ α_{i}=1 and
 (b)
population bifurcations occurring at different times.
The process of subpopulation splits creates a population ‘supertree’, with one leaf for each subpopulation, which we model separately from the underlying genealogical tree. Fig. 1 illustrates a realization of a genealogy in four subpopulations under this ‘splitting’ model.
For the coalescent approximation to the splitting model, the genealogical tree underlying the sample from each subpopulation is given by the ksize coalescent, introduced above at expression (3). This ‘coalescent with splitting’ rules out certain coalescence events (between sequences in different subpopulations) which would be permitted under the standard coalescent model. However, those coalescences which are permitted occur at a higher rate because of the smaller subpopulation sizes. Overall, the rate of coalescences can be either greater or less than under the standard coalescent model. However, if the subpopulation sizes are approximately equal, and the subsample sizes are also approximately equal, then the overall rate of coalescences is reduced by population splitting, and the expected TMRCA is increased.
The coalescentwithsplitting model of population structure remains unrealistic for many human populations, because it disallows the exchange of migrants between subpopulations after they split. Nielsen and Wakeley (2001) formulated a model which incorporates migration after a split, but their implementation is limited to two subpopulations, without population growth. Our simpler model captures much of the effect of population subdivision with relatively few additional parameters.
3.2.1. Short tandem repeat loci
 Top of page
 Abstract
 References
Since different STR loci are usually widely separated, it seems reasonable to assume independence of the mutation processes at distinct loci. The most widely adopted mutation model for an STR locus is the stepwise mutation model (SMM) in which the mutant allele differs from its parent by one repeat unit. Steps in each direction are equally likely, irrespective of the current allele length. Although there is evidence (Brinkmann et al., 1998; Kayser et al., 2000) for the SMM being close to the actual mutation process at STR loci, there is also evidence of deviations from the model, which we now briefly discuss.
Cooper et al. (1999) analysed Ychromosome STR data by using coalescent models and reported evidence for a mutation bias, with mutations leading to increases in allele length more often than decreases. However, the signal in the data for such a bias lies in the skewness of the allele frequency distribution, and this may have other causes. The direct observations of Kayser et al. (2000) do show a nonsignificant excess of increases (10) over decreases (4), but Brinkmann et al. (1998) reported a slight excess in the other direction.
There is direct evidence for the strict onestep model being false: Kayser et al. (2000) observed one mutation (out of 14) which altered the allele length by two repeat units and Brinkmann et al. (1998) reported one twostep mutation and 22 onestep mutations. Pritchard et al. (1999) fitted a model to Ychromosome STRs in which the change in the number of repeat units forms a geometric(p) random variable. They estimated p to be close to 1, in which case their model is difficult to distinguish from the SMM with a slightly higher mutation rate.
Both Kayser et al. (2000) and Brinkmann et al. (1998) reported an apparent correlation between allele length and mutation rate. In the former study the correlation was weak, and in the latter case an important contribution to the correlation seems to have arisen from an absence of mutations observed at loci with mean allele length under 10 repeats. No such loci are included in the two Ychromosome data sets that we analyse below.
Extensions to the SMM to allow for mutation rate increasing with length, mutation bias or multistep mutation events can readily be implemented within our framework. For the reasons indicated above, and to keep the presentation as simple as reasonably possible, we choose not to implement any such extensions here and retain the SMM adopted by Wilson and Balding (1998).
The SMM has no equilibrium distribution, and so there is no natural prior distribution for the STR repeat number at the root of the genealogical tree. A prior which is uniform on the positive integers, although improper, leads to a proper posterior distribution and is adopted below. Although the SMM does not constrain allele lengths to be positive, we find that, conditional on the current allele sizes, the probability that an ancestral allele is assigned a nonpositive length is negligible in practice.
3.2.2. Singlenucleotide polymorphism sites: recurrent mutation
 Top of page
 Abstract
 References
We assume that insertions, deletions and translocations of nucleotides are sufficiently rare that they can be ignored for the timescales that are relevant here. Mutations thus consist only of the substitution of one nucleotide by another. The mitochondrial DNA minisequence sites are separated by many nucleotides, and so it is natural to assume that the substitutions at distinct sites are mutually independent. An SNP mutation model can then be specified by a continuous time Markov chain on the states A, C, G and T.
The simplest such model is the Jukes–Cantor model (Jukes and Cantor, 1969), in which all possible substitutions are equally likely, so that the chain has a uniform stationary distribution: . Felsenstein (1981) introduced the F81 model, with an arbitrary stationary distribution: substitutions occur at rate β and the mutant nucleotide is A, C, G or T with probabilities π_{A}, π_{C}, π_{G} and π_{T}. Note that a nucleotide may be substituted for one of the same type, so that the effective mutation rate is less than β.
We adopt here the F84 model, an extension of the F81 model that is similar to the model of Hasegawa et al. (1985). It has been implemented since 1984 in the program DNAML in the PHYLIP suite of programs (described in Felsenstein and Churchill (1996)). Under the F84 model, a parameter α specifies the nominal rate of additional mutations restricted to be transitions. If the current nucleotide is either of the purines (A or G), then the mutant nucleotide is A or G with probabilities π_{A}/(π_{A}+π_{G}) and π_{G}/(π_{A}+π_{G}), and similarly for the pyrimidines. The stationary distribution is unaffected by the additional transitions, and the overall effective mutation rate is
 (5)
per nucleotide per generation.
3.2.3. Unique event polymorphism loci
 Top of page
 Abstract
 References
If θ (≡2Nμ)≪1 at a biallelic locus, it may be reasonable to assume that there was only one mutation event underlying the observed polymorphism. Although this unique event polymorphism (UEP) assumption is not required in our framework, if valid it can greatly reduce the size of the tree space that must be explored, allowing both computational efficiency and more precise inferences. These advantages are enhanced if it is known which of the two alleles is ancestral, in which case any two haplotypes with the nonancestral state must be more closely related than two haplotypes with differing states.
The YAP (Section 2.1) was assumed by Cooper et al. (1996) to be a UEP. The Alu element is widely interspersed in the human genome and seems capable of being inserted at any location (Sherry et al., 1997), so two inserts at precisely the same location seem very unlikely.
For SNPs, many researchers assume the ‘infinite sites’ mutation model (Section 2.2) under which all SNPs are also UEPs. Since per site θ for humans is on average of the order of 10^{−3}, this assumption seems reasonable provided that mutation is reasonably homogeneous over sites. In contrast with the YAP, for which the UEP assumption implies that the MRCA sequence must lack the Alu insert, the ancestral SNP state is usually unknown, though it is sometimes assumed to be the most common observed state.
4.1. Coalescent models as Bayesian priors
 Top of page
 Abstract
 References
Coalescent models have revolutionized population genetics in the past decade by moving the emphasis away from prospective modelling of populations, which allows at best only crude comparisons with observed data, and towards a retrospective description of the genealogy of a sample. Although coalescentbased methods permit the specification of a likelihood function, many of the new inferential methods are not fully likelihood based, primarily because of the computational complexity (see Fu and Li (1999) for a review). Likelihoodbased methods have been formulated in recent years, either via MCMC or via importance sampling approaches (Stephens, 2001), thereby bringing to population genetics problems the benefits of statistical efficiency and quantitative model comparison.
Because of the complex models and substantial background information, the Bayesian paradigm provides an appropriate framework for statistical inference in the present setting. The demographic models that were introduced in Section 3.1 specify prior distributions for the genealogical tree underlying a sample of DNA sequences, and inference about aspects of this tree proceeds via its posterior distribution given the observed data (and the priors for the mutation rate and other parameters).
The development of MCMC algorithms to approximate the required posterior distribution is challenging for several reasons, including the complexity of the likelihood computations. Wilson and Balding (1998) developed a Metropolis–Hastings algorithm, based on an augmented data approach in which the allelic states at all internal nodes of the tree (i.e. at each coalescence) were regarded as auxiliary variables. The likelihood calculations were thereby greatly simplified, at the cost of a larger parameter space. Thanks to the simpler likelihood calculations, a wide class of proposal distributions for exploring the space of genealogical trees was available, allowing an algorithm with good mixing properties to be developed. We retain these features in the algorithms that are presented here, together with additional features to incorporate the demographic and mutation models introduced earlier.
4.2.2. Proposal distributions
 Top of page
 Abstract
 References
BATWING's proposal distribution for updating the tree topology is similar to that of MICSAT, described in Wilson and Balding (1998) and also available at the above Web site. Briefly, candidate trees are obtained by selecting an internal node at random on the current tree and choosing a new location at random, but locations near nodes of similar allelic type are more likely to be chosen than locations near dissimilar nodes. For demographic models involving splitting, and for mutation models involving UEPs, the obvious modifications are implemented: the initial state and any proposed state are disallowed if they contravene the model. Thus, for example, if the descendants of the selected node are from two different subpopulations, then the node can be relocated only before the time at which those two subpopulations split.
In addition to updates which change the topology of the tree, BATWING also proposes updates to coalescence times (keeping internal haplotypes constant) and updates to internal haplotypes (keeping coalescence times constant). For a UEP site with unknown ancestral state, proposals are made to change the root UEP haplotype at random among those consistent with the tree.
The method for updating the population supertree topology and splitting times is based on a representation of trees given in Mau et al. (1999). A planar representation of the supertree is made by randomly allocating ‘left’ and ‘right’ subpopulations at each splitting event. The tree can then be written as a list of populations, with a unique splitting time associated with each pair of neighbouring populations. Updates of these splitting times are made by choosing a split at random, and then choosing a new time for this split, uniformly between 0 and the most recent coalescence between two sequences, one from each of the populations. The updated planar representation then implies a new supertree, which may differ in topology as well as splitting times from the current supertree.
Updates to the proportions into which the ancestral population splits are made independently of the supertree updates. Two populations i and j are chosen at random, and , the new proportion in population i, is drawn uniformly in (0,α_{i}+α_{j}). Finally, .
The demographic parameters (N, N_{c}, r) and μ are each updated on a logarithmic scale by independent uniform perturbations centred at the current value.
4.2.3. Convergence and mixing
 Top of page
 Abstract
 References
Our principal diagnostic tool has been to compare the results from repeat runs of BATWING with widely spaced initial trees, while all other inputs remain unchanged. For this, BATWING includes a ‘badness’ parameter b_{s} which can be set in the interval (0,1), with 0 corresponding to the tree obtained by applying a parsimony heuristic to the data, which is expected to produce a plausible, or ‘good’, tree, whereas 1 corresponds to a random tree under the prior model, which is almost always ‘bad’ for any given data set. With the badness parameter between 0 and 1 successive coalescences in the starting tree are drawn with probability proportional to 1/(10b_{s}+d) where d is the minimum number of mutations required under our modelling assumptions to obtain one haplotype from the other. The allelic types at the new node are drawn from a discretized Gaussian distribution with SD equal to d+10b_{s}/4.
In addition, we routinely inspected plots of the loglikelihood and parameter values over each run of BATWING, as well as autocorrelation plots, for signs of nonconvergence. Fig. 3 shows trace and autocorrelation function plots for two parameters and the logposterior density from one of the analyses in the simulation study (Section 5.1), thinned by retaining every fourth output. The plots suggest approximate stationarity, the trace plots showing no obvious trend or jump, and each autocorrelation function decreasing to 0. Mixing seems good for μ but slow for L, which is unsurprising because L is a function of all the branch lengths, and it depends on the population splitting times and the growth parameters. In most studies L is not a parameter of direct interest.
For the data analyses described below, some parameters such as L, r and t_{g} did sometimes mix slowly, particularly for C96 (Cooper et al., 1996), our largest data set with more than 200 haplotypes. However, over the several years that the present paper has been in gestation, the authors have not detected any problem with convergence or mixing that was not easily diagnosed with the simple tools illustrated above and addressed by lengthening the run. In particular, we have not encountered evidence of multimodality in our models. Although our models are complex and our parameter spaces are large, we may benefit in one sense from the poor information content of the data for the parameters of interest: posterior distributions tend to be diffuse and lacking the ‘peaks’ and ‘troughs’ that can lead to pseudoequilibria.
4.2.4. Validation
 Top of page
 Abstract
 References
Irreducibility of the Markov chain implied by the proposal distributions described above is easy to establish when there are no UEP sites with unknown root state. Successive ‘cutandjoin’ moves can move from any tree to any other. Movement between any two population supertrees can similarly be shown by first changing the underlying genealogical tree, and then successively rearranging splits to obtain the required supertree. With UEPs the cutandjoin moves allow communication between any states that have the same root haplotype. Proposals to change the root UEP haplotype then ensure movement between any two states that are consistent with the data.
In the context of the complex models and data described here, a rigorous verification that the algorithm described above has been correctly implemented in the BATWING software, and usually converges, is not achievable. Some theoretical support is given by Aldous (2000) who showed that, in a nodata setting, a branch swapping algorithm similar to that implemented in BATWING gives convergence in a total variation sense to the appropriate uniform distribution on tree space. The number of steps required is of the order of between n^{2} and n^{3} moves (Aldous conjectured that n^{2} is correct), which is slow compared with the n log (n) moves that are required for a pack of n cards, randomizing one card at a time.
The authors are also encouraged that results from MICSAT published in Wilson and Balding (1998) have since been verified by using an independent algorithm (Stephens and Donnelly, 2000). Further support comes from a simulation study described later in Section 5.1.
4.3. Mitochondrial DNA match probabilities
 Top of page
 Abstract
 References
Suppose that a mitochondrial DNA sequence is recovered from a sample from the crime scene and found to match the mitochondrial DNA sequence obtained from s, a suspect. This observation supports the hypothesis that s is the source of the crime sample. To assess the strength of this support, we wish to evaluate the probability that x would also match the mitochondrial DNA sequence from the crime scene, where x is an alternative possible culprit whose mitochondrial DNA sequence is unknown. For a discussion of the role of match probabilities in the assessment of forensic identification evidence, see Balding and Donnelly (1995).
If we knew the mitochondrial DNA sequence of any of the maternal ancestors of x, as well as the number of generations separating that ancestor from x, then the probability distribution for the mitochondrial DNA sequence of x could readily be calculated under a given mutation model. Of course, this information is not usually available. However, in forensic casework a reference sample is usually available consisting of the mitochondrial DNA sequences of n apparently unrelated individuals drawn from the same racial group as x. The n+1 observed mitochondrial DNA sequences, those of s and the reference sample, provide information about ancestral mitochondrial DNA sequences, and hence about the unobserved mitochondrial DNA sequence of x.
Here, we exploit this information via a modification of the algorithm described in Section 4.2. In addition to the genealogical tree underlying the n+1 observed sequences, we introduce a branch connecting the unobserved DNA sequence of x with the tree, writing z for the new node thus introduced (Fig. 4). The additional branch, and the mitochondrial DNA state at z, are updated in the same way as for the other branches, except that, since no data are available for the mitochondrial DNA sequence of x, there is no contribution to the likelihood from the branch connecting z with x.
At each iteration of the modified algorithm, the probability that the mitochondrial DNA sequence of x matches that of s, conditional on the location and state of z, is readily calculated. The average of these conditional probabilities approximates the match probability, given the observations and the standard coalescent model.
5.1. Simulation studies
 Top of page
 Abstract
 References
A large simulation study was undertaken to check the accuracy of the posterior approximations obtained using BATWING. 100 data sets were simulated from the most complex of our models, the coalescent with splitting and growth (Section 3.1.4). We used three populations, with sample sizes 23, 22 and 15, and the data types were the same as for the C96 data set (five STR loci and one UEP). The SMM (Section 3.2) was employed to generate the STR data and the location of the UEP mutation was chosen at random in the tree. The parameters underlying each simulation (subpopulation sizes, splitting times, startofgrowth time, mutation and growth rates) were obtained via independent draws from the prior distributions that were used to analyse the C96 data set. The model assumptions and prior distributions supplied to BATWING were the same as those used to generate the data.
The average over simulations of , the estimated effective sample size defined at equation (6), is shown in the first row of Table 2.
Table 2. Results for six parameters from a simulation study consisting of 100 data sets generated from the coalescent with splitting and growth model† p (%)  Results for the following parameters: 

N  μ  T  L  r  t_{a} 


 Average estimated effective sample size 
2800  15000  5100  1300  6000  5400 
Hit rate(%) 
10  6 (3.0)  12 (3.0)  3 (3.0)  8 (3.0)  12 (3.0)  9 (3.0) 
30  24 (4.6)  25 (4.6)  25 (4.6)  30 (4.6)  30 (4.6)  30 (4.6) 
50  46 (5.0)  49 (5.0)  41 (5.0)  42 (5.0)  39 (5.0)  47 (5.0) 
70  64 (4.6)  70 (4.6)  66 (4.6)  66 (4.6)  55 (4.6)  63 (4.6) 
90  86 (3.0)  87 (3.0)  88 (3.0)  89 (3.0)  74 (3.0)  82 (3.0) 
 Coverage rate(%) 
10  10 (0.3)  9 (0.2)  9 (0.3)  8 (0.5)  9 (0.6)  10 (0.6) 
30  31 (0.8)  28 (0.6)  28 (0.8)  24 (1.4)  27 (1.7)  31 (1.7) 
50  51 (1.2)  47 (1.0)  47 (1.2)  41 (2.0)  46 (2.5)  52 (2.4) 
70  71 (1.2)  66 (1.2)  66 (1.4)  60 (2.3)  65 (2.8)  72 (2.4) 
90  91 (0.7)  86 (1.3)  87 (1.0)  83 (1.7)  89 (1.8)  92 (1.2) 
Although is valuable for ease of interpretation, Rubin and Schenker (1986) pointed to more precise methods of assessing interval coverage. We apply their ‘average probability coverage’ method to 100p% equaltailed prior intervals. Given a parameter, say φ, write C_{x} for the posterior coverage, given data x, of an interval (l,u) with prior probability p:
Averaging over data sets generated via random draws from the prior we have
Overall, our conclusion is that, even after making an informal allowance for multiple comparisons, there are some indications of imperfect mixing or convergence or validity, but these seem unlikely to have a large effect on inferences. For most parameters of interest there is good agreement between the achieved and nominal coverage. The coverage is likely to be better for actual data analyses, because of the possibility of longer runs and more careful monitoring of convergence than is feasible in a simulation study. However, because of its large size (the study occupied 10 computer processors for about 1 week) our simulation study was limited to a relatively small total sample size of 60 chromosomes.
Each panel of Fig. 5 shows posterior density curves estimated from the BATWING algorithm output for a random selection of 20 of the 100 simulated data sets. Perhaps the most noticeable
feature of these curves is that, even in this ideal setting in which the modelling assumptions hold exactly, only weak inferences are possible for some demographic parameters. The reasons will be discussed further in Section 6. Fig. 5 also shows (bold curves) an average posterior density obtained by choosing a random selection of 200 outputs from each of the 100 data sets; in the limit of many data sets, this curve should coincide with the prior curve (broken curves). The agreement is generally good, but a discrepancy is apparent for μ and r.
5.2.1.1. Mutation at short tandem repeat loci.
 Top of page
 Abstract
 References
To formulate a prior for the STR mutation rate μ, Wilson and Balding (1998) drew on the Ychromosome STR mutation study of Heyer et al. (1997), which recorded three mutations at tetranucleotide YSTR loci from 1491 observed meioses. Adopting a gamma(1,1) (i.e. standard exponential) preprior, these data led Wilson and Balding (1998) to a gamma(4,1492) prior for μ, with mean 4/1492≈2.7×10^{−3}. More data have since become available, and combining the results of Heyer et al. (1997) with those of Bianchi et al. (1998) and Kayser et al. (2000) we obtain a total of 17 mutations from 8169 meioses, suggesting a gamma(18,8170) prior for μ with mean ≈2.2×10^{−3} (key quantiles of this and other prior distributions are shown in Table 3).
Table 3. Results from BATWING analyses of the C96 human Ychromosome data set (Cooper et al., 1996)† Parameter  Quantiles (5%, 50%, 95%) 

Prior  Posterior 


(a) Standard coalescent 
Effective population size N  2.0 × 10^{3}  4.7 × 10^{3}  9.2 × 10^{3}  2.5 × 10^{3}  4.0 × 10^{3}  6.4 × 10^{3} 
Mutation rate μ per 10^{3} generations  1.4  2.2  3.1  1.3  2.0  3.1 
θ (=2Nμ)  7.8  20  44  13  16  20 
TMRCA (years)  63 × 10^{3}  200 × 10^{3}  600 × 10^{3}  21 × 10^{3}  43 × 10^{3}  120 × 10^{3} 
(b) Coalescent with growth 
Ancestral population size N  820  2.7 × 10^{3}  6.3 × 10^{3}  170  770  2.0 × 10^{3} 
Current effective population size N_{c}  15 × 10^{3}  280 × 10^{3}  28 × 10^{6}  12 × 10^{3}  27 × 10^{3}  84 × 10^{3} 
Growth rate r (% per generation)  0.089  0.42  1.2  0.59  1.2  2.3 
Mutation rate μ per 10^{3} generations  1.4  2.2  3.1  0.90  1.5  2.4 
Time since start growth, GNt_{g} (years)  7.2 × 10^{3}  28 × 10^{3}  150 × 10^{3}  4.3 × 10^{3}  7.4 × 10^{3}  14 × 10^{3} 
TMRCA (years)  48 × 10^{3}  150 × 10^{3}  460 × 10^{3}  14 × 10^{3}  25 × 10^{3}  59 × 10^{3} 
(c) Coalescent with splitting 
Effective population size N  2.0 × 10^{3}  4.7 × 10^{3}  9.2 × 10^{3}  2.8 × 10^{3}  4.1 × 10^{3}  7.1 × 10^{3} 
Mutation rate μ per 10^{3} generations  1.4  2.2  3.1  1.4  2.1  3.1 
θ  7.8  20  44  15  19  23 
Time since most recent split, GNt_{b} (years)  5.2 × 10^{3}  76 × 10^{3}  400 × 10^{3}  400  1.2 × 10^{3}  3.1 × 10^{3} 
Time since 1st split, GNt_{a} (years)  33 × 10^{3}  180 × 10^{3}  670 × 10^{3}  5.1 × 10^{3}  10 × 10^{3}  21 × 10^{3} 
TMRCA (years)  100 × 10^{3}  330 × 10^{3}  1.0 × 10^{6}  24 × 10^{3}  50 × 10^{3}  140 × 10^{3} 
(d) Coalescent with splitting and growth 
Ancestral population size N  820  2.7 × 10^{3}  6.3 × 10^{3}  260  740  1.9 × 10^{3} 
Current effective population size N_{c}  15 × 10^{3}  280 × 10^{3}  28 × 10^{6}  19 × 10^{3}  51 × 10^{3}  210 × 10^{3} 
Growth rate r (% per generation)  0.089  0.42  1.2  0.58  1.2  2.1 
Mutation rate μ per 10^{3} generations  1.4  2.2  3.1  0.93  1.6  2.5 
Time since start growth, GNt_{g} (years)  7.2 × 10^{3}  28 × 10^{3}  150 × 10^{3}  5.5 × 10^{3}  9.2 × 10^{3}  16 × 10^{3} 
Time since most recent split, GNt_{b} (years)  2.5 × 10^{3}  42 × 10^{3}  260 × 10^{3}  1.8 × 10^{3}  4.1 × 10^{3}  7.9 × 10^{3} 
Time since first split, GNt_{a} (years)  16 × 10^{3}  100 × 10^{3}  440 × 10^{3}  6.0 × 10^{3}  9.6 × 10^{3}  17 × 10^{3} 
TMRCA (years)  60 × 10^{3}  200 × 10^{3}  660 × 10^{3}  16 × 10^{3}  29 × 10^{3}  64 × 10^{3} 
We have chosen to rely on the recent, directly observed data to formulate our prior distribution. Some researchers (e.g. Forster et al. (2000)) have questioned the use of contemporary pedigree data for historic mutation rates and point to indirect evidence for a lower mutation rate than is supported by our prior; see Siguroardottir et al. (2000) for a discussion of the issues. Pritchard et al. (1999) adopted a gamma(10,12500) prior, with mean ≈0.8×10^{−3}. Although this prior seems inconsistent with our choice, there is some overlap of the two prior density curves and the discrepancy is less than it first appears because Pritchard et al. (1999) allowed mutation steps of size greater than 1.
The STR loci of the C96 data set are all tetranucleotide repeats. Those of Ruiz Linares et al. (1996) are mostly dinucleotides, but include one of the tetranucleotides which was studied by Cooper et al. (1996) and all three of the mutation studies cited above. Some researchers have found dinucleotide mutation rates to be lower than tetranucleotides (Weber and Wong, 1993) whereas others have found the reverse relationship. Kayser et al. (2000) found no significant difference, although they have little dinucleotide data. Here, we adopt the simplest plausible model and assume a common mutation rate for dinucleotide and tetranucleotide STRs.
5.2.1.2. Ychromosome Alu polymorphism locus.
 Top of page
 Abstract
 References
To allow at least partly for ascertainment bias, we condition on the existence of the YAP insert and assume a priori that it is equally likely to have occurred at any point on the genealogical tree. This assumption reflects the fact that the YAP locus was chosen to be typed because it was known in advance to be polymorphic, but it does not reflect the additional prior information that YAP is polymorphic in many human populations, and so cannot have arisen from a very recent mutation event. Although our prior could be modified to give more support to older times for the YAP mutation event, there is no obvious candidate for a specific assumption, and simulations indicate that more detailed modelling of the ascertainment scenario has no perceptible effect on posterior inferences (the results are not shown). This is because the YAP data are informative about tree topology, but their weak effect on branch lengths is overwhelmed by information from the STR data.
5.2.1.3. Population size and growth parameters.
 Top of page
 Abstract
 References
Wilson and Balding (1998) investigated two prior distributions for N: lognormal(9,1) and gamma(5,10^{−3}). They found that the resulting posteriors of interest were very similar. For the models without population growth, we retain the gamma(5,10^{−3}) prior.
Estimates of the growth rates of human (census) population sizes over the past few thousand years can be obtained from the data reported by CavalliSforza et al. (1994). From their data we estimate an average growth rate over the past few thousand years of about 1.2% per generation for the worldwide population, and 1.6% for Europe. However, effective population sizes can be very different from census sizes, because of, for example, geographical stratification at various levels, age structure and mating behaviour. Plausibly, lower growth rates are appropriate for effective population sizes, and we adopt a gamma(2,400) prior for r, centred at 0.5% but supporting values above 1%.
Under constant population size models, the effective male population size over recent human evolution is often estimated to be of the order of 5000. It thus seems likely that N, the effective ancestral population size (before growth) was somewhat smaller and we choose a gamma(3,10^{−3}) prior. To specify a prior for the current effective population size, N_{c}, we assign the gamma(5,1) distribution to log(N_{c}/N). The prior for the time t_{g} at which growth started is implied by the priors described above, together with the relationship
and the assumption that log(N_{c}/N ), N and r are mutually independent.
5.2.1.4. Population subdivision parameters.
 Top of page
 Abstract
 References
We employ a gamma(2,1) prior for the time t_{a} at which the earliest split occurs. Subsequent splits occur at times which are jointly uniform, given t_{a}. The prior distribution of the subpopulation sizes, expressed as proportions of the total size, is Dirichlet(2,2,…,2). At each coalescence event in the population supertree, all possible coalescences are equally likely.
5.2.1.6. Time since most recent ancestor.
 Top of page
 Abstract
 References
The prior distribution for TMRCA cannot be independently assigned. It depends (weakly) on the sample sizes but more importantly is a function of the demographic model and prior distributions for population sizes, growth rates and splitting times. Since the splitting time parameters arise in some demographic models and not others, it seems impossible in practice to specify essentially the same prior for TMRCA over our various models. In particular, allowing population growth leads to shorter values for TMRCA, whereas population splitting tends to increase these values, and only unrealistic priors for other parameters could fully compensate for these effects. However, our choices imply priors for TMRCA which overlap substantially, so that prior medians differ by a factor of less than 2 over the four demographic models, and the interval from 120 000 years to 450 000 years lies within the equaltailed 90% interval for all four prior distributions for both sets of sample sizes (Tables 3 and 4).
Table 4. Results from BATWING analyses of the R96 data set of Ruiz Linares et al. (1996)† Parameter  Quantiles(5%, 50%, 95%) 

Prior  Posterior 


(a) Standard coalescent 
Effective population size N  2.0×10^{3}  4.7×10^{3}  9.2×10^{3}  2.4×10^{3}  3.8×10^{3}  6.3×10^{3} 
Mutation rate μ per 10^{3}generations  1.4  2.2  3.1  1.3  2.0  3.1 
θ (=2Nμ)  7.8  20  44  12  15  19 
TMRCA (years)  64×10^{3}  200×10^{3}  600×10^{3}  27×10^{3}  50×10^{3}  110×10^{3} 
(b)Coalescent with growth 
Ancestral population size N  820  2.7×10^{3}  6.3×10^{3}  190  700  1.9×10^{3} 
Current effective population size N_{c}  15×10^{3}  280×10^{3}  28×10^{3}  4.6×10^{3}  8.2×10^{3}  16×10^{3} 
Growth rate r (% per generation)  0.089  0.42  1.2  0.19  0.44  0.94 
Mutation rate μ per 10^{3} generations  1.4  2.2  3.1  1.2  1.9  3.0 
Time since start growth, GNt_{g} (years)  7.2×10^{3}  28×10^{3}  150×10^{3}  6.8×10^{3}  14×10^{3}  28×10^{3} 
TMRCA (years)  48×10^{3}  150×10^{3}  450×10^{3}  17×10^{3}  31×10^{3}  63×10^{3} 
(c) Coalescent with splitting 
Effective population size N  2.0×10^{3}  4.7×10^{3}  9.2×10^{3}  2.5×10^{3}  4.2×10^{3}  6.7×10^{3} 
Mutation rate μ per 10^{3} generations  1.4  2.2  3.1  1.3  2.1  3.1 
θ  3.7  12  33  14  17  22 
Time since most recent split, GNt_{b} (years)  560  11×10^{3}  86×10^{3}  0  60  320 
Time since 1st split, GNt_{a} (years)  33×10^{3}  190×10^{3}  670×10^{3}  3.4×10^{3}  5.8×10^{3}  10×10^{3} 
TMRCA (years)  120×10^{3}  360×10^{3}  1.0 × 10^{6}  27×10^{3}  51×10^{3}  110×10^{3} 
(d) Coalescent with splitting and growth 
Ancestral population size N  820  2.7×10^{3}  6.3×10^{3}  140  540  1.7×10^{3} 
Current effective population size N_{c}  15×10^{3}  280×10^{3}  28×10^{6}  7.9×10^{3}  15×10^{3}  32×10^{3} 
Growth rate r (% per generation)  0.089  0.42  1.2  0.029  0.60  1.2 
Mutation rate μ per 10^{3} generations  1.4  2.2  3.1  1.1  1.8  2.8 
Time since start growth, GNt_{g} (years)  7.2×10^{3}  28×10^{3}  150×10^{3}  7.4×10^{3}  14×10^{3}  26×10^{3} 
Time since most recent split, GNt_{b} (years)  290  6.1×10^{3}  53×10^{3}  10  170  810 
Time since 1st split, GNt_{a} (years)  16×10^{3}  100×10^{3}  440×10^{3}  5.0×10^{3}  8.1×10^{3}  13×10^{3} 
TMRCA (years)  62×10^{3}  210×10^{3}  670×10^{3}  17×10^{3}  29×10^{3}  59×10^{3} 
5.2.2. Results: C96 data set
 Top of page
 Abstract
 References
Wilson and Balding (1998) analysed two subsets of the C96 STR data. Here we analyse the full data set of 212 Ychromosomes, described in Section 2.1, under all four demographic models introduced in Section 3.1. Key quantiles of the marginal posterior distributions under each of the four models are given in Table 3. In discussing these results we shall regard the posterior median of each parameter of interest as a point estimate.
The striking feature of almost all human Ychromosome data, including the present data set, is its lack of variability, and the consequences of this are evident in the results reported in Table 3. Under the standard coalescent model, the estimated value of N is about 4000, which may seem implausibly low but is consistent with the results of other studies (Jorde et al., 2001). The estimated TMRCA is 43 000 years, which is well below the 5% quantile of the prior distribution. This result is similar to that obtained by Wilson and Balding (1998), Pritchard et al. (1999) and Thomson et al. (2000), but until recently it would have been regarded as too low since, for example, humans arrived in Australia before that time. One goal of the present analyses is to investigate to what extent this unexpected result might be due to inappropriate modelling assumptions.
Weakening the assumptions of the standard coalescent model to allow for population growth leads to even less plausible results: the estimated TMRCA falls further to only 25 000 years. The growth rate estimate is large (1.2%), but the growth started only recently (estimate 7400 years). Pritchard et al. (1999) estimated a lower growth rate (0.8%) and a longer time since growth (18 000 years); our values lie within their 95% intervals. The current effective population size is estimated at only 27 000, which is several orders of magnitude less than the census population size, but in accord with the estimate of 28 000 obtained by Thomson et al. (2000) by using sequence data analysed via GENETREE (see Section 6).
However, extending the standard coalescent model to allow for geographical structuring via the splitting model (without growth) increases the estimates of both N and TMRCA by about 10%. The first population split occurred recently (median about 10 000 years ago), and this was almost certainly the Nigerian population splitting from the common ancestral population of East Anglians and Sardinians (99.9% of MCMC outputs; the prior assigns probability to each possible split). The more recent split is estimated to have occurred approximately 1000 years ago. Care must be taken in interpreting these times since the splitting model does not allow migration after a split. Estimates would therefore reflect the end of a gradual splitting process and could be misleading if there had been substantial migration following a splitting event.
Adding growth to the splitting model, the most recent split moves further into the past, but the TMRCA estimate is reduced substantially (29 000 years). Since the initial split, the growth rate (which is assumed common to all populations) is very high (estimate 1.6% per generation), but the estimate of the current population size remains low (51 000) because of the small size of the ancestral population (estimate 740) and the short period of growth (estimate 9200 years). The probability that it was the Nigerian population which split first from the common ancestral population is now 99.1%, with the remaining 0.9% probability being roughly equally divided between the Sardinian and East Anglian populations. The posterior mean effective size of the Nigerian population is 37% of the total effective population size, with the East Anglian and Sardinian population proportions being 31% and 32% respectively.
5.2.3. Results: R96 data set
 Top of page
 Abstract
 References
From the 121 Ychromosomes of the R96 data set (Ruiz Linares et al., 1996), we analysed the 115 chromosomes with no missing data at the YAP and SNP sites. Although missing STR data are relatively easy to handle, data missing from UEP sites are more problematic, and removing just six chromosomes seemed preferable to an arbitrary assumption. Sample sizes for this subset (and the full data set) are shown in Table 5.
Although the data set is very different from that of Cooper et al. (1996), some aspects of the posterior distributions summarized in Table 4 are strikingly similar. For example, the estimates of N under each model are similar for the two data sets, even though the number and distribution of source populations differ greatly. A possible interpretation is that levels of migration among human populations have been such that the current geographic location is relatively unimportant in explaining genetic diversity. More striking still is the similarity across the two data sets of the four TMRCA estimates.
One difference between the two sets of results is that the growth rate estimates are lower for the R96, worldwide, data set than for the C96 data set which is dominated by Europe.
The estimates of relative (effective) sizes of the 13 populations of the R96 data set (given in Table 5) are surprisingly uniform, with posterior means ranging from about 7% to 9% of the total, and bearing no apparent relationship with current census population sizes. The three largest values correspond to the three African populations, in accord with a greater genetic diversity in Africa than in other continents, which has been reported by many previous researchers. The two Amazonian populations had the smallest values.
Table 6 gives some probabilities for clusterings of different populations under the splitting model, with and without growth. The most recent population split was, with probability 38%, between Cambodia and China. The deepest split is estimated to have been between the African and nonAfrican populations, a finding which has also been frequently reported from previous data sets. Although this split is assigned a posterior probability of only about 16%, note that the 13 populations are not grouped into continents a priori, so the 2^{13}−1 possible groupings at the root split are initially equally likely. Thus the data provide very strong support for the deepest split being between the African and nonAfrican populations, as well as for the two other continental groupings shown in Table 6.
Table 6. Posterior probabilities for various population groupings from the R96 data set† Population grouping  Posterior probability(%) 

Splitting only  Splitting and growth 


(a) At the most recent split 
CA, CH  38  38 
LI, PZ  20  18 
(b) At the root split 
AFR versus nonAFR  17  15 
AFR + ASI + OCE versus AME + EU  14  15 
AFR + AME + EU versus ASI + OCE  7  10 
(c) At any split 
AFR  95  91 
KA + SU  69  79 
CA + CH  63  65 
LI + PZ  57  51 
AME  48  42 
ASI + NG  33  47 
AME + EU  31  45 
CA + CH + NG  46  45 
Looking at groupings which arise at any node in the tree, the three African populations form the strongest cluster, occurring in over 90% of trees. The two Amazonian populations also cluster together frequently. The strongest crosscontinent clusterings involve New Guinea with the three Asian populations and Europe with the three American populations. The latter clustering seems surprising but may be due either to a substantial gene flow into both America and Europe from north or central Asia during and/or since the last ice age or perhaps very recent gene flow direct from Europe into native American populations during the period of European colonization. (The data sets consist of aboriginal peoples as far as this can be verified, but there may nevertheless be some recent admixture.)
5.3.1. Modelling assumptions
 Top of page
 Abstract
 References
Harding et al. (1997) inferred an ancestral haplotype for their sample by comparison with a chimpanzee sequence. Using the standard coalescent model together with the infinite sites assumption, they obtained a point estimate of 2.55 for the mutation parameter θ. Conditional on this estimate, and on an estimate of the mutation rate obtained from human–chimpanzee comparisons, they inferred a point estimate of 895 000 years for TMRCA (95% confidence interval 380 000–1 410 000 years).
Our reanalyses regard the tree as unrooted a priori and draw inferences about the root without using the chimpanzee sequence. Further, we incorporate uncertainty about μ, the per sequence and per generation mutation rate. We started with a gamma(3,10^{5}) preprior for μ, based on a genomewide substitution rate estimate of 10^{−8} per site. Harding et al. (1997) observed 31 sites monomorphic in humans but varying between humans and chimpanzees. Assuming that there have been about 2.5×10^{5} generations since the human–chimpanzee split, this leads to a gamma(34,6×10^{5}) distribution for μ, which we adopted as a prior distribution for our analyses. For the effective population size N, we assume respectively a gamma(5,2.5×10^{−4}) and a gamma(3,2.5×10^{−4}) prior distribution under the standard coalescent and the coalescentwithgrowth models (these are the priors used for the Ychromosome analyses, scaled up by a factor of 4).
5.3.2. Results
 Top of page
 Abstract
 References
Prior and posterior quantiles, under both the standard coalescent and the coalescentwithgrowth models, are shown in Table 7. Our modelling assumptions are similar to those of Harding et al. (1997), including the adoption of the infinite sites mutation model, and in many respects our results are similar. We use a generation time G=25 years, for comparison with our other analyses, whereas Harding et al. (1997) used G=20 years. After allowing for this difference, our posterior median estimates of TMRCA (1 100 000 and 1 200 000 years ago) are similar to the point estimate of Harding et al. (1997), but our 90% interval is wider than even their 95% interval, because we model uncertainty about N, μ and the root of the tree. In addition, the confidence interval of Harding et al. (1997) is symmetric about the estimate (±2 SDs), whereas our posterior distribution is skew and our interval is shifted towards higher values.
Table 7. Results from BATWING analyses of the H97 βglobin data set (Harding et al., 1997)† Parameter  Quantiles (5%, 50%, 95%) 

Prior  Posterior 


(a) Standard coalescent 
N  7.9 × 10^{3}  19 × 10^{3}  37 × 10^{3}  12 × 10^{3}  21 × 10^{3}  33 × 10^{3} 
μ (per 10^{5} generations)  4.2  5.6  7.4  4.3  5.7  7.3 
θ (≡ 2Nμ)  0.84  2.1  4.3  1.4  2.4  3.7 
TMRCA (years)  240 × 10^{3}  780 × 10^{3}  2.4 × 10^{6}  690 × 10^{3}  1.2 × 10^{6}  2.1 × 10^{6} 
(b) Coalescent with growth 
Ancestral N  3.2 × 10^{3}  11 × 10^{3}  25 × 10^{3}  9.8 × 10^{3}  17 × 10^{3}  28 × 10^{3} 
N_{c}  59 × 10^{3}  1.1 × 10^{6}  100 × 10^{6}  70 × 10^{3}  560 × 10^{3}  18 × 10^{6} 
r (%)  0.088  0.42  1.2  0.23  0.66  1.5 
μ  4.2  5.6  7.4  4.3  5.7  7.4 
t_{g} (years)  7.3 × 10^{3}  28 × 10^{3}  150 × 10^{3}  4.8 × 10^{3}  14 × 10^{3}  36 × 10^{3} 
TMRCA (years)  140 × 10^{3}  480 × 10^{3}  1.6 × 10^{6}  620 × 10^{3}  1.1 × 10^{6}  2.0 × 10^{6} 
Table 8 shows part of the posterior distribution for the root node sequence, based on the human data alone. The most likely root sequence is the one assumed by Harding et al. (1997), based on the chimpanzee comparisons, indicating that such outgroup comparisons are not required to estimate ancestral sequences, although these are of course helpful if they are available. Table 9 shows posterior quantiles for the age of each mutation (which is assumed unique at each segregating site under the infinite sites model). Our results are in broad agreement with those of Harding et al. (1997) except that we report (identical) marginal posterior distributions for mutations which cannot be distinguished from the data, whereas they reported, in effect, order statistics.
Table 8. Posterior probabilities of the MRCA state for the H97 data set Sequence  Posterior probability of being root (%) for the following models: 

Standard coalescent  Coalescent with growth 

TTTCCTCTGGCAT  17.5  16.5 
TTTCCTCCGGCAT  6.9  6.2 
TTTTCTCTGGCAT  6.7  6.8 
TTTCCTCTGGAAT  6.4  6.5 
TCTCCTCTGGCAT  6.3  6.7 
TCTTCTCCGGAAT  6.0  6.2 
All others  50.2  51.1 
Table 9. Posterior quantiles of the age of each mutation for the H97 data set Site  Age of mutation (× 10^{3}years) 

H97  Posterior quantiles 

5%  50%  95% 

532  18  0.69  11  84 
2634  61  6.7  71  240 
2554  63  26  100  280 
1423  100  26  100  280 
379  84  34  110  270 
1358  310  130  300  650 
2792  530  120  330  810 
2945  450  140  350  840 
1416  460  140  350  840 
2008  390  340  800  1600 
906  510  340  800  1600 
508  620  340  800  1600 
2636  730  340  800  1600 
We also performed our analyses under a finite sites mutation model, which does not require the assumption of at most one mutation at each site. For these data, weakening the infinite sites model did not lead to any substantial changes in inferences (the results are not shown).
5.4.1. Modelling assumptions
 Top of page
 Abstract
 References
Each of the three racial groups in the FSS mitochondrial DNA data set (Section 2.3) were analysed using the F84 mutation model for the 10 SNP sites, the SMM for the STR locus and the standard coalescent model, with the same prior distributions in each case.
The prior distributions that were adopted for the analysis are shown in Table 10. Prior distributions which are symmetric on the logarithmic scale seem appropriate for the ratios π_{A}/π_{T}, π_{C}/π_{T} and π_{G}/π_{T}. We chose prior medians that were close to the sample proportions, averaged over the 10 SNP sites studied here, in the data compiled by Handt et al. (1998), consisting of several thousand human mitochondrial DNA sequences (an updated version is available at http://www.hvrbase.de). Because of shared ancestry, the prior variances should be much higher than would be justified by sampling error in the background data. We chose variances such that the prior density was at least half the modal value within a factor of 2 either side of the median.
Table 10. Prior distributions for the mutation and demographic parameters in the mitochondrial DNA analysis† Parameter  Distribution  Quantiles 

2.5%  50%  97.5% 


π_{A}/π_{T}  lognormal(−2.4,0.6)  0.028  0.091  0.29 
π_{C}/π_{T}  lognormal(−0.61,0.6)  0.17  0.54  1.76 
π_{G}/π_{T}  lognormal(−0.56,0.6)  0.18  0.57  1.85 
β/α  lognormal(−4.5,2)  0.00022  0.011  0.56 
α/2.5  lognormal(−11,2)  0.033 × 10^{−5}  1.7 × 10^{−5}  84 × 10^{−5} 
μ_{SNP}   0.031 × 10^{−5}  1.6 × 10^{−5}  88 × 10^{−5} 
μ_{STR}  lognormal(−7,2)  0.18 × 10^{−4}  9.1 × 10^{−4}  460 × 10^{−4} 
N  lognormal(9,1)  1100  8100  58000 
The database of Handt et al. (1998) was also used to formulate a prior for the ratio of transversions to transitions at these 10 sites. A rough point estimate of this ratio is obtained by taking the most frequently observed nucleotide at each site as the ancestral state and comparing the number of haplotypes that represent a transversional change to the number that represent a transitional change. For the FSS data, this ratio is 1:127, which is considerably lower than estimates for the control region as a whole (Meyer et al., 1999), which may reflect ascertainment bias in the selection of these sites. The prior chosen for β/α has a median that is close to this point estimate but also supports the higher estimates.
The prior for μ_{SNP} is implied by the definition (5) and the priors for the five parameters α, β, π_{A}/π_{T}, π_{C}/π_{T} and π_{G}/π_{T}. The priors defined so far suggest the approximation μ_{SNP}≈α/2.5. The prior for α/2.5 was chosen to be sufficiently diffuse to cover both the (low) estimates for μ_{SNP} derived from phylogenetic studies and the (high) estimates derived from pedigree and coalescent studies (Siguroardottir et al., 2000).
For each of the racial groups, an MCMCbased approximation to the match probability under our model was obtained treating as the crime scene haplotype
 (a)
the haplotype that is most common in that racial group,
 (b)
a haplotype that is unobserved in that group but is very similar to the commonest haplotype and
 (c)
a haplotype that is unobserved and is also very dissimilar to any observed haplotype
(the crime scene haplotype differs from each observed haplotype by a transversion substitution at each SNP site).
Recently evidence has been presented for recombination in mitochondrial DNA, but the consensus seems to remain that it is not subject to recombination (EyreWalker and Awadalla, 2001). If this proves to be wrong, the implications for our results may not be great since recombination would presumably be rare and the sites that are studied here are all from the mitochondrial DNA control region.
5.4.2. Results
 Top of page
 Abstract
 References
The MCMCbased match probabilities are shown under ‘MCMC’ in Table 11, along with some average properties of the genealogical tree. As expected, when the crime scene haplotype is dissimilar to any common haplotype, the genealogical tree is much higher than for a ‘similar’ haplotype. The effect on the match probability is, however, less predictable: for Caucasians, the dissimilar haplotype has a lower match probability than the similar haplotype, whereas the ordering is reversed for the other two populations. Two factors may contribute to this: the larger sample size for Caucasians and the fact that the most common Caucasian haplotype has a higher relative frequency than do the most common Asian and AfroCaribbean haplotypes.
Table 11. Match probabilities for three mitochondrial DNA haplotypes in each of three FSS databases† Population  Haplotype  P(match) (%)  Averagetree length  Averagetree height  Branchlength 

Type  Description  Naïve  MCMC 


Caucasian, n=153  ATTTG5CGTTT  Common  25  18  9.6  1.4  0.018 
ATTTG6CGTTT  Similar  0.65  1.2  9.6  1.4  0.035 
CAGGT6ACGAG  Dissimilar  0.65  0.79  13  3.2  0.027 
AfroCaribbean,  GTCCA4CGCTC  Common  9.6  5.0  9.4  1.6  0.016 
n=104  GTCCA5CGCTC  Similar  0.96  0.51  9.3  1.5  0.024 
CAGGT6ACGAG  Dissimilar  0.96  0.93  13  3.3  0.059 
Asian, n=43  GTCTG5CGTTT  Common  21  16  8.5  1.9  0.059 
GTCTG4CGTTT  Similar  2.3  1.7  8.3  1.8  0.091 
CAGGT6ACGAG  Dissimilar  2.3  2.2  13  4.2  0.064 
The relative frequency of the crime scene profile among the n+1 observed sequences provides a natural approximation to the match probability. It would be the maximum likelihood estimate of the population proportion if observed sequences were treated as independent, ignoring the genealogical structure and ascertainment (the profile of interest is the last one sampled). The number of possible mitochondrial DNA sequences is vast, and many sequences which occur in the population will not be represented in the reference sample; hence the relative frequency of the mitochondrial DNA profile of s may tend to overstate the match probability. In typical forensic applications, such an overstatement would favour defendants and may be regarded as less serious than an error which tends to disadvantage defendants. It is therefore of interest to investigate situations in which the relative frequency understates the match probability; if these are sufficiently rare then relative frequency approximation may suffice in place of a more carefully calculated match probability. Table 11 suggests that the naïve match probability will usually, but not always, be conservative.
The results presented here are based on the standard coalescent model. We have repeated the analyses using the coalescentwithgrowth model and found no appreciable differences in the match probabilities (the results are not shown).
6. Discussion
 Top of page
 Abstract
 References
Changes in population size and structure correspond to rescaling time in the coalescent tree. For example, relative to the standard coalescent model, the times between coalescence events are greater when the population size is large, and vice versa. Therefore the pattern of coalescence events provides indirect evidence about past demographic parameters. However, coalescences are not observed, but are inferred from the data. Moreover, these inferences are usually imprecise: if the mutation rate is high then all trace of the earliest coalescences may be eliminated by subsequent mutations, whereas if the mutation rate is low there will be insufficient mutations to ‘document’ many of the coalescences.
More generally, genealogical models such as coalescent models describe a complex stochastic process evolving through time, whereas most data sets, including those examined in the present paper, provide information at only a single time point. For some demographic parameters, more precise inferences can be made by using multiple unlinked loci, in which case every locus gives independent information (Beaumont, 1999). For genealogical parameters such as TMRCA, and for all inferences based on Ychromosome and mitochondrial DNA data (which have their own unique male and femalemediated demographies), this option is not available and only imprecise inferences can reasonably be expected on the basis of genetic information alone, even by using the most powerful methodology. However, genetic data can profitably be combined with information from other sources, such as palaeontology, archaeology and historical records.
The imprecision of inferences is reflected in the posterior distributions shown in Fig. 5, some of which are very diffuse even though the modelling assumptions hold exactly. Despite the poor prospects for precise answers, questions of human population history are of such intrinsic interest, and potential usefulness in understanding other aspects of human genetics, that they seem worth pursuing.
Although we cannot accurately estimate growth rates or splitting times from the data that were considered here, some inferences can be made with reasonable confidence. In particular, the low estimates of the TMRCA for Ychromosome data seem reasonably robust to modelling assumptions. Our results are further supported by Thomson et al. (2000), who analysed different Ychromosome data and employed different modelling assumptions and a different method of analysis, as well as by Pritchard et al. (1999), who analysed a larger data set of 445 chromosomes, typed at eight STRs, and used similar modelling assumptions, but employed a rejection method for approximating the posterior distribution, based on a vector of nonsufficient summary statistics.
In contrast with the results of the Ychromosome analyses, the TMRCA estimates for the βglobin data are high relative to the prior distribution. In fact, the βglobin TMRCA is estimated to be very roughly 30fold higher than the YTMRCA, whereas the simplest models would predict a fourfold difference. This may be due to homogenizing selection acting on the Ychromosome (an advantageous Yhaplotype may have ‘swept’ through the population relatively recently), whereas balancing selection may be plausible for the βglobin locus, which would tend to increase TMRCA compared with expectations under a neutral model. However, there is substantial betweenlocus variability in TMRCA even under a neutral model, and so selection may not be needed to explain the observed disparity between Ychromosome and βglobin results.
These algorithms do not exploit auxiliary variables at the internal nodes of the tree. They apply the MCMC algorithm only to the genealogical tree: the demographic and mutation parameters are fixed in each run, and a likelihood surface is inferred by using importance sampling reweighting of a number of runs with different driving values. Stephens and Donnelly (2000) have pointed out problems with this approach: the variance of the importance weights can become large for points on the likelihood surface that are not close to the driving values.
It approximates likelihood surfaces for the mutation rate under the infinite sites mutation model, an exponential growth rate and a matrix of migration parameters under the island model. The methodology underlying GENETREE can be viewed as a version of importance sampling; the run times are often very long and Stephens and Donnelly (2000) have discussed methods for choosing more efficient importance sampling weights.
For forensic mitochondrial DNA match probabilities, we have found that the current practice of reporting sample relative frequencies seems to be conservative in most but not all cases, provided that the crime scene profile is included in the calculation (because the crime scene sequence is the last sampled, this implies that a zero frequency can never occur). Tully et al. (2001) discussed the more conservative approach of using the relative frequency after the crime scene profile has been added twice to the database, as well as the upper 95% confidence limit, which is even more conservative. For Ychromosome match probabilities, Roewer et al. (2000) employed a posterior mean with respect to a prior beta distribution, with parameters determined by the observed haplotype diversity, which leads to less conservative values. None of the methods, including our own, includes modelling regional variation on a scale that is finer than that for which databases are available, and this may be important for both mitochondrial DNA and Ychromosome match probabilities.
We believe that the methodology presented here represents an important advance towards the goal of fully likelihoodbased methods for analysing DNA sequence data: we can obtain simultaneous inferences about a wider range of demographic, evolutionary and genealogical parameters, together with more realistic assessments of uncertainty, and for a wider range of DNA data types, than has hitherto been feasible. The present work, together with other recent developments, brings closer the goal of quantitative model criticism, and model comparisons, for detailed statistical models of the genetic history of humans and other species. Under our most complex model, we can currently analyse in reasonable time (say, a few days on a modest desk top workstation) sample sizes of over 200 chromosomes with haplotypes of up to about 10 STR markers (additional UEPs typically reduce the computation time and so are effectively unlimited). Further developments of algorithms and faster computers will continue to increase the feasible sample sizes.