Haploid chromosomes in molecular ecology: lessons from the human Y


  • Matthew E. Hurles,

    1. Department of Genetics, University of Leicester, University Road, Leicester LE1 7RH, UK
    Search for more papers by this author
    • *

      Present address: McDonald Institute for Archaeological Research, University of Cambridge, Downing Street, Cambridge, CB2 3ER, UK

  • Mark A. Jobling

    Corresponding author
    1. Department of Genetics, University of Leicester, University Road, Leicester LE1 7RH, UK
    Search for more papers by this author

Mark A. Jobling. Fax: (+44) 116 252 3378; E-mail:maj4@leicester.ac.uk


We review the potential use of haploid chromosomes in molecular ecology, using recent work on the human Y chromosome as a paradigm. Chromosomal sex-determination systems, and hence constitutively haploid chromosomes, which escape from recombination over much of their length, have evolved multiple times in the animal kingdom. In mammals, where males are the heterogametic sex, the patrilineal Y chromosome represents a paternal counterpart to mitochondrial DNA. Work on the human Y chromosome has shown it to contain the same range of polymorphic markers as the rest of the nuclear genome and these have rendered it the most informative haplotypic system in the human genome. Examples from research on the human Y chromosome are used to illustrate the common interests of anthropologists and ecologists in investigating the genetic impact of sex-specific behaviours and dispersals, as well as patterns of global diversity. We present some methodologies for extracting information from these uniquely informative yet under-utilized loci.


Sex-specific differences in ecology, behaviour and migration are abundant in the animal kingdom (Avise 1993). Mitochondrial DNA (mtDNA) has been the most widely used marker for molecular ecological investigations, but gives information only about females (Avise et al. 1987; Bermingham & Moritz 1998). In most species, information about the male has been deduced from the differences between patterns of diversity of mtDNA and those of biparentally inherited markers (Fitz-Simmons et al. 1997; Andersen et al. 1998; Lum et al. 1998; Lyrholm et al. 1999; Nyakaana & Arctander 1999; Wilmer et al. 1999). However, the fact that sex is chromosomally determined in many organisms provides an independent source of sex-specific information. The last 10 years have witnessed an explosion in markers for human Y chromosome diversity, and the potential of these is now being realized in studies of human origins (Hammer et al. 1998), population histories (e.g. Zerjal et al. 1997; Karafet et al. 1999), sex-biased admixture (Hurles et al. 1998), and male–female differences in migration (Seielstad et al. 1998) and other behaviours (Poloni et al. 1997). The purpose of this review is to persuade those who study other organisms of the virtues of analysing the diversity of haploid chromosomes, using the human Y as a paradigm. We summarize recent developments in human Y-chromosomal diversity studies and give specific examples of anthropological research that mirror issues in ecology.

Genetics is a relative newcomer to the field of human evolutionary studies, and only one of a number of tools for studying human history; many other ‘historical’ records exist. Anthropologists, archaeologists, palaeontologists and linguists have their own records of the human past. Lack of congruence between these different records need not necessarily be taken as a failure of one compared to another. It may be the signal of something altogether more interesting, and emphasizes the complementary nature of disciplines investigating similar questions. Genetic and linguistic records as documented in extant diversity represent the outcome of processes documented contemporaneously in the archaeological record (Fix 1999). In this sense it is analogous to the positioning of molecular ecology within other ecological methods (e.g. tagging and recapture).

The human Y chromosome

In humans, as in other mammals, the male is the heterogametic sex, possessing a constitutively haploid Y chromosome which causes testis differentiation and so determines maleness in a dominant fashion through the action of a single gene, SRY (Sex determining region, Y) (Sinclair et al. 1990). The Y chromosome shares a number of regions of homology with the X, and undergoes recombination with it in the two pseudoautosomal regions, at the tips of the long and short arms (Fig. 1). Between these lie about 60 Mb (megabases) of DNA, including the SRY gene, in which recombination does not occur. It is this region that concerns us here.

Figure 1.

Ideogram of the Y chromosome. Showing the locations of the pseudoautosomal regions (PAR), the testis determining gene, SRY, and the long arm heterochromatin.

This escape from the scrambling effect of recombination has profound consequences for the population genetics, mutation processes, and genes of this unique chromosome. Although we concentrate in this review on the first of these, in reality the three are inseparable: the use of molecular markers in population studies has to include a consideration of mutation dynamics, and the potential effects of selection via Y-chromosomal genes (Jobling & Tyler-Smith 2000).

Markers for human Y-chromosomal diversity

Paternal lineage analysis through studies of Y chromosome polymorphism is not a new idea: the first polymorphisms on the human Y were identified 15 years ago (Casanova et al. 1985), although subsequent progress in identifying more markers was slow. This is not the place to present a history of this process, and therefore we only discuss currently useful markers.

The human Y chromosome has a fundamental advantage over haploid chromosomes in other organisms in that a great deal of money is being invested in determining its entire DNA sequence. While this will have an undeniable effect on the future ease of marker identification the majority of markers currently in use were found using ‘old-fashioned’ molecular biology. For examples of how to exploit the variation on haploid chromosomes without a whole chromosome sequencing project see Box 1.

Table 1. 
Box 1 Finding markers on haploid chromosomes without a genome project
Very few species of interest to ecologists have associated genome projects. Haploid-chromosome-specific libraries are equally scarce, although their construction has been facilitated by recent advances in flow cytometry and microdissection. Consequently if the diversity of haploid chromosomes is to be exploited, strategies for efficiently identifying markers need to be developed. These studies are hampered by the rapid divergence of haploid chromosomes which results in diminished conservation of sequence similarity between species. However, many species exhibit greater sequence diversity than Homo sapiens (Gagneux et al. 1999) with its shallow timedepth, and thus correspondingly shorter lengths of sequence need be assayed to produce a informative phylogeny.
Two main approaches to generating haploid chromosomal markers have been used. The first is to exploit the limited homology available in the few highly conserved genes to characterize sequence variants, for example the SRY and ZFY genes on the mammalian Y chromosome (Boissinot & Boursot 1997), and CHD-W (Griffiths et al. 1998) and ASW (O’Neill et al. 2000) genes on the W chromosome of nonratite birds. By September 2000 SRY sequences from 86 mammalian species were available in the EMBL database, many from the highly conserved HMG box region, though the distribution across taxonomic units is by no means uniform. Whilst cross-species homologies can be used to identify haploid chromosome-specific clones from genomic libraries, construction of such libraries is laborious, therefore it is of interest to develop PCR-based strategies for obtaining haploid chromosomal markers. To this end degenerate PCR primers can be used to characterize homologous sequences more accurately before using inverse PCR (Raponi et al. 2000) methods to obtain longer sequences from the often short stretch of known sequence.
The second approach uses methods that identify markers positioned randomly throughout a genome, to subsequently detect haploid chromosome-specific markers through sex linkage. Amplified fragment length polymorphism (AFLP) (Griffiths & Orr 1999) and random amplified polymorphic DNA (RAPD) (Bello & Sanchez 1999) methods have been used to this end and can subsequently be used to develop sequence characterized amplified regions (SCARs), which can be screened for variants. A higher throughput method is that of reduced representation shotgun (RRS) sequencing (Altshuler et al. 2000), which can be allied to flow-sorting of chromosomes to get chromosomal-specific markers (Mullikin et al. 2000).
Once haploid-chromosome-specific sequence has been obtained, markers can be developed by screening a diverse subset of samples before developing high-throughput PCR assays for the polymorphisms discovered [e.g. restriction fragment length polymorphism (PCR–RFLP), allele-specific amplification (PCR–ASA)]. A number of mutation detection techniques developed in recent years could expedite this process enormously, for example denaturing high performance liquid chromatography (DHPLC) (Underhill et al. 1997) and DNA chips (Cargill et al. 1999), although access to these technologies remains limited.
It is worth noting that while multiple methods exist to develop sequence-based markers, the development of microsatellite markers is less easy to envisage in the absence of a chromosome-specific library. Novel screening strategies are needed.

In contrast to mtDNA, the Y chromosome is expected to have a mutation rate similar to that of other nuclear loci, as it is subject to the same repair processes (although it never passes through female mitosis or meiosis). In addition, it should contain the same types of polymorphic loci as are found on the other chromosomes. This has indeed been found to be the case (Jobling & Tyler-Smith 1995; de Knijff et al. 1997; Underhill et al. 1997).

Y-chromosomal markers are best classified on the basis of their mutation rate. This distinguishes between so-called ‘unique’ mutation events that can be considered to have occurred once in human prehistory and those that are likely to be recurrent (Jobling & Tyler-Smith 1995). Into the former class fall single nucleotide polymorphisms (SNPs) and certain insertion–deletion events, for example the insertion of an Alu element (Hammer 1994). Multiallelic markers such as microsatellites and minisatellites have much higher mutation rates (e.g. 10−3 per locus per generation or greater), and thus fall into the latter category.

Unique event markers

Early searches for restriction fragment length polymorphisms (RFLP) on the Y chromosome revealed little diversity (Jakubiczka et al. 1990; Malaspina et al. 1990). More recently studies have focused on sequencing the same region in a number of individuals from diverse populations, and these efforts have met with greater success: there are now in excess of 200 unique event markers available (Underhill et al. 1997, 2000; Hammer et al. 2000; Jobling & Tyler-Smith 2000; Shen et al. 2000), most of which are SNPs. Nevertheless, the general picture remains one of reduced nucleotide diversity compared to non-Y-chromosomal nuclear sequences. In one recent large-scale study, nucleotide diversity within both coding and noncoding regions of three Y-chromosomal genes encompassing 64 120 bp, was found to be about five times lower than that in the corresponding regions of autosomal genes (Shen et al. 2000).

This discrepancy can be explained in a number of ways. One factor contributing to the lack of Y-chromosomal diversity is that, for a species with a 1 : 1 sex ratio, the effective population size of the Y chromosome is a quarter that of any autosome and one third that of the X chromosome. Neutral theory predicts that Y diversity should be correspondingly reduced, purely as a function of this reduced population size (Nei 1987). A further explanation is that mating practices have in the past resulted in certain males producing many more sons than others. While these two explanations are widely accepted, the possibility that natural selection has contributed to reduced Y-chromosomal diversity in humans is more controversial. Several studies have failed to reject the null hypothesis of neutrality (Hammer 1995; Goldstein et al. 1996), suggesting that there has not been a recent global selective sweep on the Y chromosome. However, one study derives an unexpectedly young age for the most recent common ancestor (MRCA) of modern Y chromosomes (Thomson et al. 2000), a result explained either by the strong action of selection, or by the assumptions inherent within the population genetic model used. Box 2 describes some of the genes and phenotypes associated with the human Y, on which selection may be acting.

Table 2. 
Box 2 Selection on the Y chromosome
Studies of human Y-chromosomal diversity implicitly assume that the variation being studied is selectively neutral. This may indeed be true of the SNPs, microsatellites, and other markers themselves; however, the absence of recombination means that all of these markers are permanently linked to genes which may themselves be the targets of natural selection. Positive selection on a Y-chromosomal gene could lead to the fixation, in a ‘selective sweep’ of a particular haplotype: this is the hitch-hiking effect, in which the ‘neutral’ markers are carried as passengers with the selected locus. If selective sweeps have occurred in human history, they would result in an unexpectedly recent common ancestor for all Y chromosomes (Whitfield et al. 1995). Under neutral expectation and given its likely long-term effective population size of 5000 (Hammer 1995), we would expect to find a common ancestor for all extant Y chromosomes, in the absence of selection, around 200 000 years ago. Most estimates based upon unique-event polymorphisms have been consistent with this expectation (Underhill et al. 1997; Hammer et al. 1998); although a more recent estimate (Thomson et al. 2000) gives a younger date, this may be a reflection of assumptions about demography, rather than selection.
The potential for negative selection on the Y chromosome is obvious given its pivotal role in male reproductive fitness: the Y bears not only the sex-determining locus, but also several genes essential in spermatogenesis (Vogt et al. 1996). Studies are underway to determine whether particular Y-chromosomal haplotypes predispose to infertility (Jobling et al. 1998b; Kuroki et al. 1999; Previderéet al. 1999) through comparison of affected and control populations. These studies provide conflicting evidence, and need careful interpretation if selective influences are not to be confused with effects due simply to population structure.
A number of other diverse phenotypes have been ascribed to the human Y chromosome, including stature (Salo et al. 1995), tooth-size (Alvesalo & de la Chapelle 1981), and cerebral asymmetry (Crow 1999). In primate evolution all of these could have been subject to sexual selection, though it is unclear to what extent this will have been important in the evolution of Homo sapiens.

Binary polymorphisms of unique origin can be combined into monophyletic compound haplotypes, called haplogroups, which are related to one another phylogenetically by a single most parsimonious tree (Fig. 2). This perfect phylogeny can be constructed by hand from a character-state table. Ancestral state can usually be determined by examining the homologous region in an appropriate outgroup such as the chimpanzee, and thus the tree can be rooted. The coalescence time to the MRCA of extant human Y chromosomes has been estimated by comparing human sequence diversity to that seen between humans and apes and using estimates of the date of the human-ape divergence to calibrate the measurement. Estimates vary between 50 000 and 200 000 years ago, with most lying around 150 000 years ago; all have wide confidence intervals (Underhill et al. 1997; Hammer et al. 1998; Thomson et al. 2000).

Figure 2.

An unrooted maximum parsimony Y chromosome tree constructed from unique-event, binary polymorphisms. This tree (Jobling & Tyler-Smith 2000) is unrooted due to uncertainty about ancestral states and uses 24 polymorphisms to define 26 compound haplotypes (haplogroups) represented by circles. Arrows point to derived states where this information is known. Mutations are named where they can be typed by PCR. Several other trees, constructed using mostly different polymorphisms, are currently in use (Underhill et al. 1997, 2000; Hammer et al. 2000).

Recurrent event markers

Unique-event polymorphisms on the Y chromosome are often relatively population-specific (e.g. Seielstad et al. 1994; Hammer et al. 1997). Consequently the necessary use of only a small number of individuals when searching for variation can result in an ascertainment bias. Useful measures of diversity require markers that are polymorphic in all populations: microsatellites, tandem arrays of 1–6 bp repeat units that are highly variable in allele length, have this property. There are at least 34 known microsatellite loci on the Y chromosome (Kayser et al. 1997; White et al. 1999; Ayub et al. 2000) Of these, seven have been widely used for both evolutionary and forensic purposes by virtue of their high levels of diversity and unambiguous allele designation (Roewer et al. 1996; Kayser et al. 1997). Microsatellites on the Y chromosome show diversity roughly equivalent to that of autosomal loci (Roewer et al. 1992; Goldstein et al. 1996), and correspondingly similar mutation rates; ascertained through pedigree analysis, the per-locus mutation rate for eight tetranucleotide repeat loci is 2.80 × 10−3 per generation (95% confidence intervals 1.72–4.27 × 10−3) (Kayser et al. 2000).

As with unique-event polymorphisms, alleles at multiple microsatellite loci on the Y chromosome can be combined into compound haplotypes. Unlike unique-event markers, however, these haplotypes cannot be used to construct a single most parsimonious tree (Roewer et al. 1996; Forster et al. 2000). Each haplogroup, in being a monophyletic lineage, is founded by a single male with, by definition, zero diversity at the multiallelic loci. Thus, recurrent polymorphisms, particularly slowly mutating ones, retain some phylogenetic information and their compound haplotypes show some statistical correspondence with the tree of haplogroups (Bosch et al. 1999; Forster et al. 2000).

As well as microsatellites, the human Y chromosome bears a number of other highly polymorphic systems, including a single hypervariable minisatellite (Jobling et al. 1998a), which can be assayed using minisatellite variant repeat (MVR) polymerase chain reaction (PCR) (Jeffreys et al. 1991). These will not be discussed further here.

How unique is unique?

The simple classification of markers as ‘unique’ and ‘recurrent’ is useful, but not definitive. First, some markers with very low mutation rates are nonetheless found to be recurrent. One of the earliest markers to be discovered (Whitfield et al. 1995), the A to G transition at SRY-1532, has reverted, such that A alleles are found on both ancestral and derived haplotypes (haplogroups 7 and 3, respectively — see Fig. 2). A second recurrent marker has been described (Shen et al. 2000), but they remain in a very small minority. It should be noted that no SNP on the Y chromosome can be considered to be ‘unique’ in an absolute sense, because even though the mutation rate is very low (about 2 × 10−8 per site per generation; Thomson et al. 2000), the very large number of extant human Y chromosomes outweighs this. However, given the relatively small sample sizes normally used in human population studies such recurrences will not often be observed. It is reassuring that they can be easily recognized when many markers are available, and a single most parsimonious tree can still be constructed (Hammer et al. 1998). Thus the label ‘unique’ is a relative term, as it contains a consideration of the extent of sampling of the entire intraspecific diversity.

Second, ‘unique’ mutational events can sometimes be distinguished within the evolution of a multiallelic locus. For example the microsatellite DYS390 contains short alleles that result from the deletion of a large number of repeats (Forster et al. 1998). Such deletions assayed at the sequence level define monophyletic lineages. Furthermore, when the time-depth under investigation is sufficiently shallow (e.g. within a human pedigree containing only a few generations), specific alleles at even the fastest mutating loci can usefully define sublineages (Foster et al. 1999).

The genealogical approach

The wide range of independent polymorphic systems having different mutational properties make the Y chromosome a powerful tool for investigating anthropological questions on different time-scales. A combinatorial approach combining both unique and recurrent mutations allows optimum discrimination of both individuals and populations (Jobling & Tyler-Smith 1995; Mitchell & Hammer 1996; Santos & Tyler-Smith 1996). These markers are best combined hierarchically, defining lineages with unique event markers before investigating intralineage diversity with faster mutating recurrent markers (de Knijff et al. 1997). This kind of approach has been termed ‘genealogical’ (Richards et al. 1997).

Many recurrent mutations observed in a phylogeny of microsatellite haplotypes within a population can be eliminated, if haplotype networks are instead constructed within individual haplogroups defined by unique mutations, as illustrated in Fig. 3. This is due to the fact that all chromosomes within a haplogroup are derived from a common ancestral chromosome that is likely to be far younger than the ancestral chromosome of a geographically defined population (de Knijff et al. 1997). A recent study has confirmed this a priori expectation by showing that microsatellite haplotypes are far better structured by lineage than by population (Bosch et al. 1999). The amount of diversity observed within a haplogroup defined by unique mutations, assayed using recurrent mutations with much higher mutation rates, such as microsatellites or minisatellites, gives information on the demographic history of that haplogroup (de Knijff et al. 1997).

Figure 3.

Schematic diagram illustrating the minimizing of confusion between identity by state (IBS) and identity by descent (IBD) through the genealogical approach. The evolution of a single microsatellite locus is followed through three different lineages. Numbers indicate the length of the microsatellite allele in repeat units. Networks under each lineage are constructed from the extant diversity at the single microsatellite locus. Circles represent individual microsatellite alleles the length of which is written inside the circle. Circle area is proportional to the observed frequency of each allele. Note that the network of the young lineage faithfully recapitulates the true evolutionary connections, whereas the networks of the older lineages confuse IBD and IBS to a limited degree, but not as much as the lowest network which represents the sum of the two older networks, effectively ignoring the existence of the marker which distinguishes between the two lineages.

The advantages of a genealogical approach depend upon the existence of suitable polymorphic unique event markers within the population of interest. Some studies focusing on populations for which few markers are available have attempted to use multiallelic haplotypes to define lineages. Using individual haplotypes for this purpose is not meaningful, since their diversity is too high; many are private to particular populations which makes population comparisons impossible. Consequently a number of recent studies have defined sublineages on the basis of a grouping of related compound microsatellite haplotypes (e.g. Malaspina et al. 1998). At present these groupings are performed qualitatively, by inspecting haplotypes within a lineage defined by unique event markers and defining subsets of haplotypes that seem more closely related to one another than to others outside the grouping.

Whilst attempting to define monophyletic lineages using multiallelic markers can render subsequent age estimates meaningless if polyphyletic chromosomes are clumped together, the general application of such markers to lineage definition is not without merit, and, because variation is detected in all populations, goes some way to counteracting the ascertainment bias present in many collections of unique-event markers. Many haplogroups show substantial internal substructuring that can subsequently be shown to precisely recapitulate monophyletic sublineages defined by novel unique event markers (MEH, unpublished observations). What is needed is an unbiased method for defining lineages within such networks.

The genealogical approach at work

As we have previously noted, the construction of a most parsimonious tree from unique event markers is trivial. Subsequent analysis of multiallelic diversity within lineages has been performed with two aims in mind — first, to provide a graphical display of the apparent diversity in an attempt to reveal any substructure, and second to quantify the diversity within a lineage. A number of different measures of diversity have been used to relate intralineage diversity to the age of a lineage.

Lineage substructure

Diversity is displayed graphically so as to tease patterns out of the data that may not be obvious on simple inspection. The most popular methods seek to identify evolutionarily important links between haplotypes and display them in networks or trees. In principle, multivariate statistical methods can summarize such multidimensional data with minimal loss of information but are rarely used to display haplotypic diversity.

The likely mechanism of microsatellite mutation (single-step slippage), results in recurrent mutation being common (Ciminelli et al. 1995; Cooper et al. 1996), and in the existence of very many equally parsimonious trees (Roewer et al. 1996). These are of limited phylogenetic use because linkages due to recurrent mutation (identity by state — IBS) cannot be distinguished from those due to single mutational events on ancestral haplotypes (identity by descent — IBD). The likelihood that any one of the set of possible trees represents the real evolutionary relationship of haplotypes is very low. Nonetheless, unique trees are still presented in many studies (Bergen et al. 1999; Ruiz-Linares et al. 1999). Much attention has focused on the use of networks which include reticulations to represent a set of equally parsimonious trees. Many analytical applications do not require a single phylogeny but rather summaries that encompass multiple trees. Often such summaries can be highly supported by the data despite weak support for individual trees. A number of different algorithms have been applied to network construction, varying in complexity from the computationally intense to those that can be done by hand. Minimum spanning networks are probably the simplest of these methods, though their failure to include unobserved ancestral haplotypes renders trees within them evolutionarily implausible (Bandelt et al. 1995).

Recently two new methods, namely reduced median (RM) and median-joining (MJ) networks, have been developed. It is claimed that these methods can produce networks containing all most parsimonious trees within a given data set (Bandelt et al. 1995, 1999). Performed by a freely available computer program, these analyses also provide a number of criteria that can be invoked to reduce the number of reticulations within the network, thus making it more tree-like. A further refinement weights loci according to their apparent mutation rate (Forster et al. 2000; Helgason et al. 2000; Kayser et al. 2000), resulting in additional network reduction. Both network methods have been used in a number of mtDNA studies to cope with the homoplasies inherent in sequence data from the hypervariable sequences of the mitochondrial genome. More recently RM and MJ networks have been used to display Y-chromosomal microsatellite diversity to good effect — see Fig. 4 (Forster et al. 1998, 2000; Hurles et al. 1999).

Figure 4.

Published networks of human Y chromosomal microsatellite haplotypes, (a) Median-Joining network of haplogroup 22 (see Fig. 2), adapted from Hurles et al. (1999). This single-step network exhibits little diversity and is identical to a minimum spanning network based on the same data. (b) Reduced Median network of global Y-chromosomal microsatellite diversity, adapted from Forster et al. (2000). This multistep network exhibits great diversity and reveals the phylogenetic information within haplotypes of slower mutating microsatellites.

The superimposition of additional information upon networks or trees of multiallelic diversity can be used to identify relationships that, after further testing for significance, can allow new inferences to be drawn and tested. Superimposing population affinity allows population structuring of haplotypes to be easily seen, which can then be confirmed statistically (Zerjal et al. 1997, 1999; Hurles et al. 1999; Helgason et al. 2000). In addition, the superimposition of allelic states whose recurrency is unknown can allow unique events to be distinguished by virtue of their defining a single cluster within a network (Hurles et al. 1998). Thus the graphical display of data represents a tool to guide further analyses rather than an endpoint in itself.


Dating the branchpoints of the tree of lineages is of prime importance in interpreting geographical distributions of lineages in terms of population history. A monophyletic lineage has by definition been founded by a single chromosome, with zero diversity. Thus the diversity within a lineage can be related to the time since the lineage was founded, or more correctly the time since the MRCA of sampled chromosomes. Multiallelic markers with high mutation rates can be used to investigate the intralineage diversity. The markers most often used for this work are microsatellites, although the minisatellite MSY1 has also been used (Jobling et al. 1998a). There are a number of confounding factors that contribute to the extent and nature of the multiallelic diversity within a lineage. Box 3 details the more popular methods used to estimate the age of the MRCA of a lineage from its extant diversity; little work has been done to compare these different methods. The impact of demography on these estimates and their confidence limits has only begun to be appreciated (Thomas et al. 1998), and much work remains before these estimates can be regarded as being robust to prehistoric demographic perturbations. A recent example has shown how the inclusion of an expanding population model within coalescent dating substantially reduces the age of the MRCA of all extant human Y chromosomes (Thomson et al. 2000).

Table 3. 
Box 3 Dating
Nodes in the Y-chromosomal phylogeny are commonly dated using intralineage diversity at fast mutating multiallelic markers. The mutation rates of these markers are sufficiently high to allow them to be determined by analysing pedigrees, and with these rate estimates in hand, a variety of different methods have been used to relate diversity to age.
Perhaps the simplest dating method rests on the assumption that the proportion of mutants within a population is equal to the product of the mutation rate and the time since the first appearance of the ancestral allele. This relationship was first hypothesized by Luria and Delbruck in their study of bacterial mutation in the 1940s (Luria & Delbruck 1943). More recently it has been adapted to microsatellite haplotypes, by taking into consideration a single-step mutational model, when dating the origin of the commonest cystic fibrosis mutation among Europeans (Bertranpetit & Calafell 1996). This method requires that a root haplotype be identified and that the average number of mutational steps from all haplotypes to the root is averaged over all loci. This statistic, when calculated from a tree, is known as ρ (rho) (Forster et al. 1996) and is divided by the mutation rate per generation (ρ/µ = t) to give the age of the MRCA in generations.
Another method that assumes linearity of a statistic with respect to time since the founding of a lineage is average squared distance (ASD) dating (Goldstein et al. 1995a, 1995b; Thomas et al. 1998). The ASD statistic is simply the squared mutational distance between a root haplotype and any other haplotypes within the lineage averaged over all loci and all haplotypes. Again dividing this statistic by the mutation rate gives the age of the MRCA of the lineage in generations. The linearity of ASD has also been investigated for dating population splits. In this case it was found that a related statistic performed better. This statistic, known as δµ2, is a version of ASD corrected for intrapopulation variance (Goldstein et al. 1995b).
Based on neutral theory, the root haplotype is assumed to be the most frequent haplotype, if the number of mutations is small; this assumption has been little discussed, however, and is highly sensitive to sampling. A number of methods do not require that a root be specified, relating the variance of microsatellite allele lengths within a lineage to time (Goldstein et al. 1996; Kittles et al. 1998). Recently, intra-allelic diversity has been analysed using a Bayesian-based coalescent methodology to estimate the age of a MRCA; such methods are in their infancy but are likely to become increasingly popular (Wilson & Balding 1998; Hurles et al. 1999).
Generation time and mutation rate are key parameters in all dating methods considered above and while confidence limits around mutation rates are being narrowed, much uncertainty in generation time remains. Although it is generally accepted that male generation time is longer than that of females, generation times varying from 20 to 30 years have been published. One group consistently uses the figure of 27 years for male generation time that comes from studies of extant hunter–gatherer societies (Weiss 1973; Underhill et al. 1996). Longer generation times have been proposed from the analysis of well-documented genealogies (Tremblay & Vezina 2000), however, the relevance of relatively modern demographies to prehistorical societies remains doubtful.

Coalescent simulations can give age estimates for lineages, based on tree topology and lineage frequency alone, independent of intralineage diversity (Hammer et al. 1998). These methods use estimates of sequence divergence rates that depend on fossil record calibrations, whereas those that use intralineage multiallelic marker diversity do not. Little work has been done to compare dates calculated using these different methods within the same samples, and comparisons are problematic given that confidence intervals are so large (Bosch et al. 1999). If we ignore the confidence intervals, it seems that ages from intra-allelic diversity are younger than those from coalescent dating: for example, microsatellite diversity within haplogroup 16 yields an age of about 4000 years (Zerjal et al. 1997) while coalescent analysis gives an age of 8400 years (Karafet et al. 1999) (confidence intervals are ±7000 years!). Possible discrepancies between dating methods might relate to the differences in their underlying population and mutational model. The latter should be refined by a consideration of allele-specific mutation dynamics, given the growing evidence for the mutational freezing of small alleles (Carvalho-Silva et al. 1999). For a discussion of factors potentially causing divergent age estimates from different analyses see Bosch et al. 1999.

Once the age of a lineage has been calculated, the question of its anthropological relevance arises. One issue that has been much debated is the equating of lineage age with population age. This practice has been widely attacked as failing to appreciate the differences between a population and a lineage MRCA (Cavalli-Sforza & Minch 1997). To adapt a published analogy (Barbujani 1997), if a future human colony were established on Mars, the Y-chromosomal lineages of the colonists might coalesce in the Palaeolithic, which would not mean that Mars had been colonized 15 000 years ago. Estimates of the age of the MRCA of a single allele are susceptible to stochastic effects and are poor estimates of population age. In general alleles are thought to be older than the populations in which they are found. This debate has unfortunately eclipsed the fact that providing a temporal aspect to patterns of lineage sharing can inform in other, anthropologically useful ways. A number of Y-chromosomal studies have used such lineage ages to exclude one of a number of competing hypotheses for a lineage’s distribution (Hurles et al. 1998, 1999).

The pattern of human Y chromosome diversity

Most genetic, palaeontological and archaeological evidence concurs in suggesting that anatomically modern humans arose in Africa around 150 000 years ago (Stringer & McKie 1996; Quintana-Murci et al. 1999). Because of the relative youth of our own species, human genetic diversity is limited compared to that of our closest relatives, the great apes (Gagneux et al. 1999), and differences between human populations are also correspondingly less marked than those between populations of chimps, for example. Studies using ‘classical’ markers [protein polymorphisms and blood groups (Lewontin 1972)], and more recent studies using nuclear DNA polymorphisms (Barbujani et al. 1997) are consistent in showing that approximately 80–85% of human genetic diversity lies within populations, rather than between them. The question then arises of whether this is equally true of all loci within the genome.

Early analyses using Y-chromosomal polymorphisms (e.g. Torroni et al. 1990) suggested that Y haplotypes might show a relatively high degree of population specificity, and this was confirmed as more markers were discovered (e.g. Seielstad et al. 1994; Underhill et al. 1996; Zerjal et al. 1997; Hurles et al. 1999).

Carrying out a quantitative comparison across different loci of the degree of genetic differentiation between different populations is not easy. First, comparisons will not be valid unless the same population samples are analysed for all loci. Second, the markers used to analyse diversity on the Y chromosome, the rest of the nuclear genome, and mtDNA differ in their mutational properties, and this must be taken into account (Stoneking 1998). Third, although the effective population sizes of the Y chromosome and mtDNA are both expected to be around one quarter of those of any autosome, there may be less obvious differences between the effective population sizes of males and females, which may bias the results (Nei 1987). One study has been carried out (Seielstad et al. 1998) which at least takes the first of these issues into account, and this suggests that, within Africa and globally, the Y chromosome shows far more genetic differentiation with geographical distance than do either other nuclear loci or mtDNA (Fig. 5). Despite valid criticisms of this work (Stoneking 1998), research both before (e.g. Salem et al. 1996) and after (Kayser et al. 2000; MEH unpublished results) this landmark study testifies to the generality of the phenomena.

Figure 5.

Schematic graph indicating the greater geographical differentiation of the Y chromosome compared to both mtDNA and autosomal loci (adapted from Seielstad et al. 1998; see text).

In a sense it is not the Y chromosome’s high geographical differentiation that is surprising, but rather mtDNA’s lack of it, since both are expected to have similar effective population sizes. The low effective population sizes of both loci compared to autosomes ought to lead to a greater effect of drift and more rapid divergence of populations, in both time and space. The apparent difference between the two loci was explained in the study of Seielstad et al. (1998), by invoking a higher prehistoric rate of female migration, which was calculated as eight times higher than that of males. While this might seem counterintuitive, given that males often seem to be more mobile during wars and colonizations, for example, the crucial parameter is intergenerational movement. It is thought that the majority of human populations practise patrilocality (reviewed in Quinn 1977), where the children of two people who come from different places are brought up in the paternal place of origin. If this is so, we expect Y chromosomes to remain relatively static while mtDNAs are mobile. Over many generations this will have the effect of homogenizing the human mtDNA genetic landscape.

There are other possible explanations for this difference in patterns of mtDNA and Y diversity: higher male than female mortality before reproduction, local selection on Y chromosomes, and the practice of polygyny, where some males take many wives, may all contribute. Another factor may be the intergenerational transmission of offspring number. Simulation work on the over-representation of disease alleles (Austerlitz & Heyer 1998) and mtDNA lineages (Murray-McIntosh et al. 1998) suggests that if intergenerational transmission of offspring number were stronger through paternal lineages than through maternal lineages, as might be expected in a patrilinearly based society, then this could be a powerful mechanism for limiting local Y-chromosomal diversity and could lead to extensive population structuring. All of these factors may contribute towards a significantly lower effective population size for Y chromosomes than for mtDNAs. A possible alternative explanation lies in recent suggestions that recombination between human mtDNAs operates as a confounding factor for studies of maternal lineages (Awadalla et al. 1999); however, the data used in this study have been criticised (Kivisild & Villems 2000), and multiple whole genome sequences of mitochondria show no evidence of recombination (Ingman et al. 2000).

Whatever its explanation, the geographical differentiation of the Y chromosome is such that many lineages defined by unique markers are found to have population-specific distributions (Jobling et al. 1997). Different populations tend to have very different distributions of haplogroups (Fig. 6). This means that the Y chromosome is probably the most sensitive genetic tool in the human genome for detecting admixture between populations.

Figure 6.

Haplogroup distributions found in four different continental populations, based on the tree of haplogroups shown in Fig. 5, showing only those haplogroups that can be typed by PCR.

From patterns to events

The presence on the Y chromosome of a range of polymorphic markers with different mutation dynamics and rates allow inferences to be made about events at multiple time-depths within human prehistory. Most of these events are directly analogous to issues in ecological genetics.

Origins of Species

Positioning the origin of modern humans in time and space is a common aim of global studies of diversity. A study by Hammer et al. (1998) used nested cladistic analysis to analyse the global distribution of a Y-chromosomal phylogeny. The most ancestral Y-chromosomal lineage, the root of all others, was found exclusively in sub-Saharan Africa. This finding, together with the consensus age to this ancestral haplotype within the past 200 000 years, has been taken to strongly support the Out-of-Africa model for modern human origins. A more recent paper has dated the MRCA of all modern Y chromosomes as being significantly younger, around 50 000 years (Thomson et al. 2000), but is still consistent with an African origin.

Range expansions into terra incognita

Human prehistory contains a number of dramatic examples of population movements into lands previously uninhabited. Historically, the timing and origin of such migrational movements have been studied within many disciplines. Recent studies of the Y-chromosomal origins of American Indians and their movement across the Beringian landbridge have failed to support the linguist Joseph Greenberg’s hypothesis of three separate migrations (Karafet et al. 1999; Santos et al. 1999).

Y-chromosomal studies have also challenged the dominant archaeo-linguistic model of a Taiwanese origin for the Austronesian diaspora that led to the colonization of Polynesia. Taiwan houses both the oldest archaeological evidence (Spriggs 1989) of Austronesian people and the greatest linguistic diversity (Blust 1999). A recent study demonstrated an absence of Y-chromosomal haplotypes ancestral to those found in Polynesia amongst Taiwanese aborigines (Su et al. 2000).


The issue of admixture underpins many of the contentious issues in human evolutionary genetics. Y-chromosomal studies have been used to identify sex-biased admixture with the finding of European Y chromosomes but not mtDNA in Polynesia (Hurles et al. 1998). The large genetic and geographical distances between these two populations make admixture easy to spot. More subtle sex-differences have been recently noted in the relative contributions of the more closely related populations of Scandinavia and Ireland to the colonization population of Iceland (Helgason et al. 2000). Even harder to distinguish are the relative contributions of Palaeolithic and Neolithic peoples to extant genetic diversity in Europe. With no reliable outgroups, this question of admixture in both time and space remains contentious despite contributions from Y-chromosomal studies (Semino et al. 1996; Casalotti et al. 1999).

Isolates (no admixture)

Population isolates have long been important for medical geneticists, and intriguing for anthropologists, and their genetic diversity has been much investigated. The usual suspects of Basque, Finnish and Jewish populations have all been studied using Y-chromosomal markers. Rather unexpectedly the Basques share a lineage with other Iberian populations, and diversity within this lineage indicates that gene-flow must have occurred within the past few millennia (Hurles et al. 1999). The Finns appear to owe much of their Y-chromosomal, as well as linguistic, ancestry to Central Asian populations with whom they share a diagnostic lineage at high frequency (Zerjal et al. 1997). The Y chromosome has also been used to confirm genetically the paternal inheritance within Jewish populations of the Cohanim priesthood (Thomas et al. 1998), as well as the more general tendency of Jews not to mix their genes with gentiles (Hammer et al. 2000), and the Middle-Eastern origin of the Lemba, the ‘black Jews’ of South Africa (Thomas et al. 2000).

Sex-differentiated behaviours

The power of comparing Y-chromosomal to mtDNA data in studying sex-specific behaviours has been demonstrated by a investigation into the relationship between global differences in mtDNA and Y-chromosomal diversity and linguistic affiliations (Poloni et al. 1997). Languages were found to be better correlated with paternal lineages than with maternal lineages, and it was suggested that this might be due to children adopting their father’s language rather than their mother’s, in contrast to the phrase ‘mother tongue’ (Poloni et al. 1997). Alternatively, this could be another effect of patrilocality, since mothers are more mobile than fathers, and this mobility may be accompanied by some language shift.

A similar approach was also used in an investigation of the genetic impact of the Hindu caste system (Bamshad et al. 1998). As might be expected this study found that this mating structure, despite its relatively shallow time depth, has indeed resulted in some degree of genetic stratification of castes and that there is higher female then male gene-flow between castes.


The patrilineal inheritance of the Y chromosome has been exploited in paternity cases where the putative father is unobtainable; most famously this has supported the claim that US president Thomas Jefferson fathered a son by a mulatto slave (Foster et al. 1998). The rate of nonpaternity in humans has been further elucidated in a study, by the eponymous Sykes, of haplotype sharing among males sharing the same surname (Sykes & Irven 2000). A monophyletic origin of this surname was suggested, thus allowing the nonpaternity rate over the past 700 years to be estimated as 1.3% per generation.

Future developments

One potential problem with such comparative analyses of Y-chromosomal and mtDNA diversity is the impact of stochasticity. These loci, with their small effective population sizes, are particularly prone to the vagaries of drift. The question ‘when can differences between these two loci be ascribed to sex-specific as opposed to stochastic differences?’ has not yet been explicitly addressed.

The need to develop an analytical framework under which information from multiple phylogenies can be combined has been appreciated by both anthropologists and ecologists (Avise 1989; Bermingham & Moritz 1998; Jin et al. 1999). These phylogenies can come from multiple species, multiple nonrecombining loci within the same genome or indeed from nongenetic data (Pagel 1999). Phylogenies can be constructed from physical characteristics such as cranial measurements or indeed cultural characteristics, most notably languages. A cross-fertilization of analytical ideas between ecologists interested in inter-species comparative phylogeography, and anthropologists interested in intragenomic comparative phylogeography is to be encouraged. The human genome leads the way in terms of published phylogenies constructed for nonrecombining portions of the genome, of which the Y chromosome is but one (Harding et al. 1997; Jin et al. 1999; Kaessmann et al. 1999). One recent paper constructed a phylogeny for markers within a nonrecombining portion of chromosome 21 and compared it to a Y-chromosomal phylogeny to provide greater support for a substantial ancient back migration from Asia to Africa (Jin et al. 1999).

Diversity of haploid chromosomes in other species

Chromosomal sex-determining mechanisms have evolved several times independently among animals. Furthermore, after they have arisen, constitutively haploid chromosomes are evolutionarily labile: their gene content degrades (Rice 1994), and they tend to lose and gain material at a much greater rate than do other chromosomes, which are constrained by the requirements of pairing and recombination. As a result, useful sequence homologies between the Y chromosomes (or W chromosomes) of different species are the exception, rather than the rule, and this has limited the study of haploid chromosome diversity in species other than humans. Most effort in nonhuman animals has gone into developing markers for sex-testing purposes.

We therefore know little about the diversity of haploid chromosomes in other species. Diversity of nuclear (Kaessmann et al. 1999) and mtDNA sequences (Gagneux et al. 1999) in the great apes is much greater than that in humans, and this is expected to be true of Y-chromosomal sequences too. Preliminary studies of Y diversity among orang-utans (Altheide TK & Hammer MF, submitted) confirm that this is so, although sample sizes are small. As a result of efforts to characterize the mouse genome, polymorphic markers on the Y chromosomes of species in the Mus subgenus have been identified (the Mouse Genome Database lists 71, of which 27 can be typed by PCR) but these have not been exploited in population studies. A few Y-specific polymorphisms have been isolated in a smattering of mammalian species including field voles (a pericentric inversion, Jaarola et al. 1997), cows (microsatellites, Hanotte et al. 2000), dogs (microsatellites, Olivier et al. 1999) and North American deer (ZFY sequences, Cathey et al. 1998).

Although some markers have been isolated in bird species which are specific to either Z or W chromosomes, no studies of W chromosome diversity have been done. Such studies may seem redundant given that W chromosomes are female-specific, and therefore mirror the ancestry of the more easily studied mtDNA; however, W chromosomes should in principle bear a greater diversity of polymorphic systems than mtDNA, and differences in the patterns of diversity between the two systems would give information about any irregularities in the sex-specificity of mtDNA inheritance.

Methods to isolate polymorphic markers specific to the haploid chromosomes in any species could be designed (see Box 1), and would allow the kinds of studies which we have described here to be extended beyond humans. Traits such as stature (Salo et al. 1995), aggression (Maxson 1996), attractiveness (Brooks 2000) and the fitness of sperm in sperm competition may be linked to haploid chromosomes, and will form substrates for the action of natural selection. Phenomena such as male dominance and differential male and female dispersal will influence the effective population sizes of haploid chromosomes. These factors are likely to play a major role in patterning diversity within many species, and when disentangling them, molecular ecologists are likely to find haploid chromosomes a uniquely informative tool.


We thank Dan Bradley, Martin Jones and Chris Tyler-Smith for comments on the manuscript.

Matt Hurles is a Research Fellow in Population Genetics supported by the McDonald Institute for Archaeological Research, and his interests lie in the use of modern DNA diversity of humans and other species in the reconstruction of prehistory, focused in Oceania, and in the analysis of genome instability and genomic disorders. For further information, see http://www-mcdonald.arch.cam.ac.uk/Genetics/home.html. Mark Jobling is a Wellcome Trust Senior Research Fellow in Basic Biomedical Science (Grant no. 057559), with interests in human genetic history, infertility, mutation processes, forensics and genealogy centred on the Y chromosome. For further information, see: http://www.le.ac.uk/genetics/maj4/maj4.html.