The identification of population bottlenecks is critical in conservation because populations that have experienced significant reductions in abundance are subject to a variety of genetic and demographic processes that can hasten extinction. Genetic bottleneck tests constitute an appealing and popular approach for determining if a population decline has occurred because they only require sampling at a single point in time, yet reflect demographic history over multiple generations. However, a review of the published literature indicates that, as typically applied, microsatellite-based bottleneck tests often do not detect bottlenecks in vertebrate populations known to have experienced declines. This observation was supported by simulations that revealed that bottleneck tests can have limited statistical power to detect bottlenecks largely as a result of limited sample sizes typically used in published studies. Moreover, commonly assumed values for mutation model parameters do not appear to encompass variation in microsatellite evolution observed in vertebrates and, on average, the proportion of multi-step mutations is underestimated by a factor of approximately two. As a result, bottleneck tests can have a higher probability of ‘detecting’ bottlenecks in stable populations than expected based on the nominal significance level. We provide recommendations that could add rigor to inferences drawn from future bottleneck tests and highlight new directions for the characterization of demographic history.
In cases when long-term monitoring is not possible and historic data or samples are unavailable, genetic methods that require only a single temporal sample constitute an appealing alternative for detecting past population bottlenecks (Tajima 1989b; Cornuet & Luikart 1996; Garza & Williamson 2001). All single-sample population genetic methods are based on detecting deviations from expectations under mutation-drift equilibrium, the most common approaches for species of conservation concern being M-ratio and heterozygosity-excess tests based on multilocus microsatellite genotypes (Box 1). Thus, in practice, bottleneck tests are used to detect declines in abundance, but in principle they reflect the genetic signature of declines in effective population size (Ne). Because genetic bottleneck methods require only a single population sample and may be applied to standard genetic data, they have become a routine part of conservation genetics studies aimed at characterizing recent demographic history. However, rigorous application of bottleneck tests requires that population declines have a high probability of being detected and that bottlenecks are not regularly inferred for stable populations; in other words, tests should have high statistical power and a low probability of resulting in a Type I error (Taylor & Dizon 1997). Failure to detect recent population bottlenecks in endangered species might delay the implementation of needed conservation actions and increase the likelihood of extinction, whereas incorrectly inferring bottlenecks in stable populations could result in the misallocation of conservation resources to the detriment of more imperiled populations or species. Achieving adequate power and reasonably low Type I error rates is particularly important given that the results of genetic bottleneck tests are increasingly being incorporated into assessments of threatened species (NMFS 2010; USFWS 2010a, b).
Initial assessments indicated that genetic bottleneck tests have reasonable power to detect population declines, given adequate sample sizes of individuals and loci (Cornuet & Luikart 1996; Luikart & Cornuet 1998; Garza & Williamson 2001; Williamson-Natesan 2005). However, simulation studies have only evaluated the ability to detect very large proportional declines in Ne (10- to 1000-fold) or declines to very few individuals (Ne = 10–50), and empirical assessments often evaluated test performance in populations experiencing extreme declines (e.g. Mexican wolves, Canis lupus mexicanus). In contrast, most research on species of conservation concern aims to detect declines that are less severe and that are not obvious from ecological data, but may nevertheless affect the species’ persistence (Traill et al. 2010; Flather et al. 2011). Bottleneck detection is further complicated by the fact that several factors including the timing and duration of the bottleneck, immigration and the amount of pre-bottleneck genetic diversity can influence and potentially obscure genetic signals of population declines (Cornuet & Luikart 1996; Garza & Williamson 2001; Williamson-Natesan 2005). Thus, bottleneck tests may be less likely to detect population declines in species of conservation concern than previously suggested. Indeed, bottleneck tests have failed to detect well-known population collapses in Scandinavian lynx (Lynx lynx; Spong & Hellborg 2002), California sea otters (Enhydra lutris nereis; Aguilar et al. 2008) and Amur tigers (Panthera tigris altaica; Henry et al. 2009). In all three cases, global population size had been reduced to tens of individuals, putting these species at a high risk of extinction from demographic and genetic consequences of reduced population size.
Genetic bottleneck tests require making assumptions about microsatellite evolution to generate expected distributions for test statistics (Cornuet & Luikart 1996; Garza & Williamson 2001). Microsatellites are generally modelled as evolving according to a two-phase mutation model that consists of two parameters, the proportion (pg) and mean size (δg) of multi-step mutations (Box 2). However, results can be highly sensitive to assumed parameter values, and incorrect assumptions about these parameter values can lead to an erroneous inference that a bottleneck occurred (Boxes 2 and 3). As parameter values generally are unknown for the species of interest, values are typically ‘borrowed’ from species for which these parameters have been estimated or estimated indirectly from allele frequency distributions observed in stable populations (Piry et al. 1999; Garza & Williamson 2001). Moreover, parameter values used in bottleneck tests are typically derived from studies of microsatellite mutations in humans, and the extent to which these values apply generally to vertebrates is uncertain given the high level of interlocus and interspecific variation in microsatellite evolution (Ellegren 2000, 2004). However, several studies involving a diversity of vertebrate species have recently been published that allow for assessments of (i) how well typically assumed values reflect microsatellite evolution; and (ii) the effects of differences between assumed and true parameter values on inferences from bottleneck tests.
Here, we reviewed the use of M-ratio and heterozygosity-excess tests in the literature and used simulation analyses to assess how effectively these tests detect recent population declines in vertebrates. We limited our assessment to vertebrates because of the large number of studies that have used genetic bottleneck tests, as well as variability among higher order taxa in ploidy and microsatellite evolution that likely influence inference from bottleneck tests. Our aim was not to conduct a comprehensive analysis of all factors that might influence statistical power and Type I error rates (Williamson-Natesan 2005), but instead to determine how effectively bottleneck tests as implemented in the literature can identify populations in need of conservation action. First, we determined how effective typical sample sizes are for detecting recent bottlenecks across a range of post-bottleneck Ne and determine if less extreme but potentially important population declines are likely to be detected. Second, we assessed potential biases created by incorrect choices of mutation model parameter values using parameter estimates derived from studies of microsatellite mutations in vertebrates. By highlighting potential issues in studies applying bottleneck tests, we hope to provide guidance for future genetic-based assessments of population declines and increase the reliability of bottleneck tests in conservation research and planning.
Box 1. Principles of genetic bottleneck tests
Genetic bottleneck tests are rooted in a wider class of population genetic methods aimed at detecting departures from expectations under mutation-drift equilibrium. Population genetic tests of mutation-drift equilibrium typically contrast two different indices of genetic diversity. One measure is expected to be only marginally affected by the underlying process causing deviations from mutation-drift equilibrium and represents a baseline against which the second, more sensitive, diversity index is compared. Examples of genetic diversity indices employed in mutation-drift equilibrium tests that are expected to be less affected by a population bottleneck initially are nucleotide diversity, heterozygosity and the variance and range in microsatellite allele size. In contrast, a greater reduction is expected in the number of alleles and segregating sites, particularly rare ones. Tajima’s D, a classic test of mutation-drift equilibrium that was originally developed to detect selection, is often used to infer long-term changes in Ne by contrasting the number of segregating sites with nucleotide diversity in DNA sequences (Tajima 1989b). As low-frequency alleles contribute little to the nucleotide diversity, nucleotide diversity is reduced proportionally less than segregating sites during a population bottleneck, resulting in a positive value of Tajima’s D.
The ‘M-ratio test’ developed by Garza & Williamson (2001) is based on the ratio of the number of microsatellite alleles (K) to the range in allele size (r; i.e. M = K/r). During a bottleneck, the number of alleles is expected to decline faster than the range in allele size as most alleles lost will, by chance, be intermediate in size for loci with at least five alleles (graph a). Accordingly, the M-ratio is expected to be smaller in bottlenecked populations than in equilibrium populations. Tests for reductions in M-ratios are conducted by comparing the mean observed M-ratio (across loci) with an expected distribution generated from simulations under mutation-drift equilibrium, or by using a critical value of 0.68 derived from putatively stable wild populations (Garza & Williamson 2001).
The ‘heterozygosity-excess’ test developed by Cornuet & Luikart (1996) contrasts heterozygosity expected under Hardy–Weinberg equilibrium with heterozygosity expected under mutation-drift equilibrium calculated from observed number of alleles. The latter measure is more sensitive to the loss of low-frequency alleles during a bottleneck (graph b), and a bottleneck is therefore inferred when heterozygosity under Hardy–Weinberg equilibrium exceeds heterozygosity under mutation-drift (i. e. heterozygosity excess is detected). Coalescent simulations are conducted assuming a constant effective population size to estimate heterozygosity under mutation-drift equilibrium, conditional on the number of alleles in the observed data (only those simulated data sets with the observed number of alleles at each locus are retained). Several statistical methods can be used to test for greater Hardy–Weinberg than mutation-drift heterozygosity, but the Wilcoxon signed-rank test is the most powerful and commonly used approach (Luikart & Cornuet 1998).
Box 1. Graph
Illustration of (a) an increase in heterozygosity excess following a bottleneck and (b) a reduction in M-ratios, modified from Garza & Williamson (2001) and Luikart & Cornuet (1998). (a) The number of alleles is reduced proportionally more than the range of alleles size (r) following a bottleneck such that the M-ratio (calculated as K?r) is reduced, and (b) heterozygosity at mutation-drift equilibrium (Heq) conditional on the number alleles (K) in the population initially experiences a greater proportionate reduction than heterozygosity under Hardy-Weinberg equilibrium (He) because rare alleles are lost faster than heterozygosity.
Box 2. Microsatellite mutation models and their effects on bottleneck tests
Genetic bottleneck tests require making assumptions about microsatellite evolution to generate expected distributions for test statistics (Cornuet & Luikart 1996; Garza & Williamson 2001). Mutation models considered in bottleneck tests range from an infinite alleles model where each new mutation results in a unique allele, to a stepwise mutation model where each new mutation results in the loss or addition of one repeat microsatellite motif. However, microsatellites are generally believed to mutate according to an intermediate, two-phase model where most mutations result in the addition or loss of a single repeat motif and a smaller proportion of mutations result in the addition or loss of a larger number of repeats (Di Rienzo et al. 1994). Under this model, the proportion of mutations that involve multi-step repeats can be denoted by the parameter pg and the mean size of multi-repeat mutations is described by the parameter δg. Under the assumption that the size of multi-step mutation follows a two-sided geometric distribution, the probability that a multi-step mutation involves x number of repeats can be expressed as
Recommended values for the proportion of multi-step mutations and the mean size of multi-step mutations in M-ratio tests (pg = 0.10 and δg = 3.5 respectively) were inferred by selecting values that resulted in M-ratios in simulated equilibrium populations that matched M-ratios from stable wild populations (Garza & Williamson 2001). The basis for the recommendation of pg = 0.10 and δg = 3.1 (equivalent to under a geometric distribution) for heterozygosity-excess tests is not immediately clear (Piry et al. 1999). Erroneous assumptions about these mutation model parameters can lead to incorrect inferences from genetic bottleneck tests (Luikart & Cornuet 1998; Garza & Williamson 2001; Williamson-Natesan 2005). For M-ratios, if the underlying mutation model results in more and larger multi-repeat mutations than assumed, a greater number of ‘missing alleles will occur between the smallest and largest alleles than expected, and M-ratios estimated for a stable population could match expectations for a bottlenecked population, resulting in a Type I error (Garza & Williamson 2001). In contrast, Type I errors are more likely to occur in heterozygosity-excess tests if the proportion of multi-step mutations is overestimated (Williamson-Natesan 2005).
Box 3. Testing for bottlenecks in marbled murrelets
We illustrate some of the potential issues associated with uncertainty in mutation models in genetic bottleneck testing using a case study based on marbled murrelets (Brachyramphus marmoratus), a seabird on the US federal threatened species list that nests primarily in old-growth forests in the Pacific Northwest. Genetic bottleneck tests constitute an appealing approach for detecting population declines in a geographically isolated population of murrelets in central California because at-sea monitoring conducted from 1999 to 2010 failed to detect a population decline, despite a century of extensive nesting habitat loss and impacts from a variety of other environmental factors (Peery et al. 2004, 2006, 2007). Blood samples were collected from 270 live murrelets and genotyped at 12 microsatellite loci as described in Peery et al. (2008), and a bottleneck was tested for using heterozygosity-excess and M-ratio methods.
Inferences about the occurrence of a putative bottleneck were complicated by the sensitivity of results to the assumed mutation model (see graph). For heterozygosity excess, tests were not statistically significant when pg = 0–0.1, but became increasingly significant with larger values of pg. The mean M-ratio across loci was 0.90 (range = 0.54–1.00) and well above the commonly used critical value of 0.68 (Garza & Williamson 2001), suggesting that the population did not deviate from mutation-drift equilibrium. However, in contrast to heterozygosity-excess tests, significance declined as pg increased and M-ratios were statistically significant when pg = 0 (Box 3, Fig. 1). Thus whether neither of the two tests, both tests or one of the two tests was statistically significant depended entirely on the assumed mutation model.
Box 3. Graph
Probability of rejecting the null hypothesis (P-value) of mutation-drift equilibrium in marbled murrelets with M-ratio and heterozygosity-excess tests as a function of the assumed proportion of multi-step mutations (pg) and mean size of multi-step mutations (δg).
Material and methods
We reviewed published articles that tested for recent population bottlenecks in vertebrates to determine how frequently bottlenecks were detected in both putatively stable populations and populations known to have declined recently. We also quantified typical sample sizes, assumed values for mutation model parameters, and bottleneck characteristics (e.g. magnitude of the bottleneck) and used these data to parameterize simulation-based assessments of statistical power and Type I error rates. We searched the ISI Web of Knowledge database and identified 1247 unique scientific articles citing Cornuet & Luikart (1996), Luikart & Cornuet (1998), Piry et al. (1999) or Garza & Williamson (2001) as of February 2010. To limit the scope of our review, we focused our efforts on the 105 studies for which some quantitative information was available about the status of the population (post-bottleneck census population size or percent decline) at the time of sampling (Appendix S1). In sum, the studies reviewed here used genetic bottleneck tests to test for recent declines in a total of 703 populations among 116 vertebrate species. We classified each population as either ‘bottlenecked’, ‘stable’ or ‘unknown’ based on the authors’ assessment of the demographic history of the population.
We also reviewed 18 published studies that characterized the size and proportion of multi-step mutations in microsatellites in 15 vertebrate species. We pooled all observed mutations when estimating mutation model parameters from these studies rather than calculating the means of species- or study-specific estimates so that studies of a single species with comparatively small sample sizes did not have a disproportionate influence on overall estimates. Specifically, we estimated pg by dividing the total number of multi-step mutations by the total number of mutations and estimated δg by calculating the average size of all multi-step mutations.
Simulation methods used to estimate power followed the approach of Williamson-Natesan (2005), where we first simulated populations in mutation-drift equilibrium using standard coalescent simulations (Hudson 1990) and then subjected simulated populations to the bottleneck process using a Wright–Fisher model to project populations forward in time after an instantaneous reduction in effective population size. The mutation process during coalescent and Wright–Fisher simulations was modelled according to a two-phase model, as described in Box 2. Note that hereafter we refer to mutation model parameter values used to generate simulated data for bottlenecked populations as ‘true’ values.
We estimated statistical power to detect a bottleneck as the proportion of simulated bottlenecked populations for which M-ratios were significantly lower, or heterozygosity excess was significantly greater, than expected for populations in mutation-drift equilibrium. We estimated expected distributions of test statistics for populations in equilibrium using coalescent simulations (i.e. omitting the Wright–Fisher projections described above). Mutation model parameter values used to generate these expected distributions were termed ‘assumed’ values and represented assumptions made by investigators conducting bottleneck tests. For M-ratio tests, we assessed significance by comparing the mean M-ratio in each bottlenecked population to the lower 95th percentile of the mean M-ratio in 500 equilibrium populations. We estimated the expected distribution for heterozygosity by simulating single-locus microsatellite genotypes from an equilibrium population until 200 replicate genotypes were obtained for each value of K (number of alleles) present in the bottlenecked population. We used a Wilcoxon signed-rank test to test for heterozygosity excess that paired heterozygosity in the bottlenecked population at locus i with the mean heterozygosity from the 200 replicates of the simulated equilibrium population, conditional on the number of alleles at locus i (Luikart & Cornuet 1998). The difference in testing procedures between the two metrics reflected differences in the manner in which the two bottleneck tests are typically implemented in the literature. All tests were one-tailed and only tested for population bottlenecks (i.e. heterozygosity excess or a reduction in the M-ratio).
Parameter values used to conduct simulations quantifying the effects of sample size and bottleneck characteristics are provided in Table 1, and we describe the rationale for these values here. We set both assumed and true pg and δg to estimates derived from the 18 studies of microsatellite mutations we reviewed (0.22 and 3.1 respectively; see below), such that the mutation model was assumed ‘correctly’ for all simulations evaluating the statistical power of bottleneck tests. We simulated equilibrium (pre-bottlenecked) populations assuming θ was 1, 5 or 10 (where θ was pre-bottleneck genetic diversity, θ = 4Neμ in a diploid organism, and μ was the per generation mutation rate). We considered four post-bottleneck effective population sizes (Ne = 25, 50, 100 and 500) to estimate power to detect a large bottleneck (Ne = 25) as well as a range of smaller bottlenecks that included thresholds where inbreeding depression (Ne = 50) and the loss of additive genetic diversity (Ne = 500) may threaten viability (Franklin 1980; Soulé 1980). Note that the bottleneck scenarios considered here represent large percentage declines; for example when θ = 5, a post-bottleneck Ne of 25 represents a 98–99.9% decline and a post-bottleneck Ne of 500 represents a 60–99.4% decline assuming μ = 10−5–10−3. Populations were simulated for 1, 5, 10, 20, 30, 40 and 50 generations forward after the bottleneck to evaluate the effect of bottleneck duration on the ability of bottleneck tests to detect reductions in effective population size. We sampled the approximate median number of individuals and loci used in bottleneck studies (35 individuals and eight loci; see below) from simulated populations and also explored the ability to increase power by increasing both the number of individuals and loci (100 individuals and 16 loci).
Table 1. Simulation parameters used to estimate statistical power and Type I error rates as a function of sampling effort, demographic history and mutation model assumptions
Type I error rate assessment
Values in bold indicate approximate median values in published studies and n/a = not applicable.
Number of individuals
Number of loci
1, 5, 10
25, 50, 100, 500
Number of generations since bottleneck
1, 5, 10, 20, 30, 40, 50
Assumed proportion multi-step mutations (pg)
True proportion multi-step mutations (pg)
0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7
Assumed mean size multi-step mutations (δg)
True mean size of multi-step mutations (δg)
2.5, 3, 3.5, 4
Simulations used to assess the effect of erroneous assumptions about mutation model parameters on Type I error rates were conducted similarly, except that simulated populations did not go through a bottleneck and individuals were sampled from populations in mutation-drift equilibrium. Moreover, true and assumed values for pg and δg were allowed to differ to reflect mistaken assumptions about mutation model parameters. Type I error rates were estimated as the proportion of equilibrium populations that yielded significant M-ratio or heterozygosity-excess tests, using the testing methods described above. Deviations of assumed from true parameter values were interpreted to have an impact on tests when Type I error rates were greater than the nominal rate of 0.05.
Parameter values used to conduct simulations assessing the effects of erroneous assumptions about mutation models are presented in Table 1, and we describe the rationale for these values here. In separate simulations, we varied true pg from 0.1 to 0.7 by increments of 0.10 to reflect the approximate range of values observed among species in studies of microsatellite mutations (see below). We set assumed pg to 0.10 to represent the most commonly used value in bottleneck studies (see below) and assess the impacts of what are likely typical errors in assumptions about mutation models. We also conducted simulations with assumed pg set to 0.22 to reflect the estimate derived from the microsatellite mutation studies we reviewed (see below) and determine if using a more realistic value could improve error rates. For simulations evaluating the effects of erroneous assumptions of pg, we set both true and assumed δg to 3.1. We evaluated the impacts of mistaken assumptions about the size of multi-step mutations in a separate set of simulations, where we set true δg from 2.5 to 4.0 by increments of 0.5 to reflect the approximate range of values observed in microsatellite mutation studies (see below). We set assumed δg to 3.5 to reflect the most commonly used value in M-ratio tests and assess what are likely typical errors in assumptions about mutation models. We also set assumed δg to 3.1 to reflect the best estimate from the microsatellite mutation studies (see below) and determine if using a more realistic value could improve error rates. We set both the true and assumed pg to 0.22 when simulating the effects of mistaken assumptions about δg on Type 1 error rates.
Literature review of the application and effectiveness of bottleneck tests
Our literature review indicated that genetic bottleneck tests were generally conducted with modest sample sizes of both individuals (median = 38 and 31 for M-ratio and heterozygosity-excess tests respectively) and loci (median = 9 and 8 for M-ratio and heterozygosity-excess tests respectively; Table 2). In cases where demographic information was available, estimated bottlenecks were large both in terms of the percentage decline in census population size (median = 80 and 95% for M-ratio and heterozygosity-excess tests respectively) and the post-bottleneck census population size (median 24 and 100 individuals for M-ratio and heterozygosity-excess tests respectively). Most (86%) of the studies that reported mutation model parameters only explored a single mutation model for both tests. The most commonly assumed pg was 0.10 for both test types, whereas the most commonly assumed δg was 3.5 for M-ratio tests and 3.1 for heterozygosity-excess tests respectively.
Table 2. Sampling effort, bottleneck characteristics and mutation models in 105 peer-reviewed genetic bottleneck studies that used either M-ratios or heterozygosity excess to test for recent population declines. For sampling effort and bottleneck characteristics, n reflects the number of populations tested. For two-phase mutation model parameters, n reflects the number of unique tests conducted with different mutation models. θ = genetic diversity, Nc = census population size and n/a = not applicable. References for studies included in this summary are provided in Appendix S1
Median or mode
10th, 90th percentile
Median or mode
10th, 90th percentile
Median number of individuals
Median number of loci
Median percent decline
Median post-bottleneck Nc
Two-phase mutation model parameters
Modal percent multi-step mutations (pg)
Modal mean multi-step mutations size (δg)
Our literature review indicated that genetic bottleneck tests often failed to detect bottlenecks in populations that investigators believed had experienced a reduction in abundance. M-ratios were significantly lower than expected under equilibrium in only 43–56% of 54 bottlenecked populations (n = 22 species; Tables 3 and S1, Supporting information). Moreover, the median M-ratio among 48 bottlenecked populations (n = 18 species) for which this value was available (0.710) was similar to the median M-ratio (0.749) among the 22 putatively stable populations (n = 6 species), and variability was high among bottlenecked populations (range = 0.170–0.986; Table S1, Supporting information). Statistically significant heterozygosity excess was detected in only 26–47% of 91 bottlenecked populations (n = 33 species) we reviewed, depending on whether significance was required for all mutation models considered, or whether only a single mutation model was required to yield a statistically significant result (Tables 3 and S1, Supporting information). Note that population declines often went undetected with both methods despite the fact that many bottlenecks were large. Indeed, the median reduction in census population size reflects declines to effective population sizes of only 3–14 individuals, assuming typical effective to census population size ratios of 0.11–0.14 (Frankham 1995; Palstra & Ruzzante 2008).
Table 3. Proportion of bottlenecked and putatively stable populations for which M-ratio and heterozygosity-excess tests were statistically significant derived from 105 peer-reviewed articles. When multiple mutation models were explored, we determined significance based on whether all mutation models yielded significant tests or whether at least one mutation model yielded a significant test. npop and nsp were the sample sizes of populations and species respectively. The proportion of significant tests was calculated using npop
M-ratio (npop, nsp)
Heterozygosity excess (npop, nsp)
0.56 (54, 22)
0.47 (91, 33)
0.43 (54, 22)
0.26 (91, 33)
0.13 (32, 10)
0.20 (41, 16)
0.13 (32, 10)
0.04 (41, 16)
Both methods detected bottlenecks in putatively stable populations somewhat more frequently than expected assuming a significance level of 0.05 (Tables 3 and S2, Supporting information). Specifically, the frequency of Type I errors for M-ratio and heterozygosity-excess tests was 0.13 (n = 32 populations of 10 species) and 0.05–0.20 (n = 41 populations of 16 species) respectively, depending on whether one or all mutation models were required to yield a significant result. While these Type I error rates were not excessively high, limited sample sizes could have prevented violations of assumptions from resulting in more Type I errors (see below).
Review of studies of microsatellite mutations in vertebrates
Information associated with 592 mutations detected in the 18 studies of microsatellite evolution in vertebrates is summarized in Table 4. Some ‘non-model’ species had higher mutation rates than usually cited for microsatellites (the highest being 1.8 × 10−2 mutations per generation in barn swallows; Hirundo rustica), and it could be argued that mutations at these loci were (i) not generally representative of mutations in microsatellites; or (ii) the result of allelic mismatches caused by incorrect paternity assignments. Indeed, high mutation rates in some species may have reflected ascertainment bias during marker development where the most polymorphic loci were targeted for use in mutation studies. However, we did not detect a relationship between the mutation rate and either the proportion or the mean size of multi-step mutations using linear regression (pg: P = 0.35, R2 = 0.12; δg: P = 0.92, R2 < 0.01) based on the studies with at least 20 mutations. Moreover, highly polymorphic loci are also frequently used in conservation and population genetic studies; for example the loci used to characterize mutations in kangaroo rats (Dipodomys spectabilis) were also used to test for bottlenecks in this species (Busch et al. 2007). It is also unlikely that the mutations we considered were an artefact of incorrect paternity assignments because 68% of single-step mutations and 66% of multi-step mutations that could be attributed to an individual parent were inherited from the known rather than the inferred parent. For these reasons, we believe that our estimates of mutation model parameters provided a reasonable representation of microsatellite evolution in vertebrates.
Table 4. Estimates of microsatellite mutation parameters in 15 vertebrate species derived from 18 studies using known or inferred pedigrees
pg = the proportion of multi-step mutations, δg = the mean size of multi-step mutations, nmeioses = number of meiotic events, nmut = number of mutations, nmulti = number of multi-step mutations. The per generation mutation rate was estimated by dividing nmut by nmeioses and pg was estimated by dividing nmulti by nmut. n/p = not provided by the authors of the study and n/a = not applicable.
We estimated that pg = 0.22 based on the studies summarized in Table 4, a value that was considerably greater than (i) recommended by Garza & Williamson (2001) and Piry et al. (1999) for M-ratio and heterozygosity-excess tests; and (ii) the most commonly used value in bottleneck studies (i.e. pg = 0.10). The proportion of multi-step repeats varied significantly among species for which at least 20 mutations were characterized based on a chi-square test (P < 0.001), and ranged from 0.12 to 0.68 (Table 4). We estimated that δg = 3.1, a value that was identical to the recommendations of Piry et al. (1999) for heterozygosity-excess tests, but slightly lower than that Garza & Williamson (2001) recommended for M-ratio tests (δg = 3.5). The mean size of multi-step mutations was significantly different among species for which at least nine multi-step mutations were observed based on one-way analysis of variance (F4,73 = 3.74, P = 0.007), and ranged from 2.6 to 4.0 repeats (Table 4).
Statistical power of genetic bottleneck tests
Simulation analyses indicated that genetic bottleneck tests can have limited power to detect population declines, particularly given the limited sample sizes of individuals and loci typically employed in bottleneck tests. For M-ratios, power to detect large reductions in effective population size (Ne = 25) 10 generations after the bottleneck was modest (0.63) using approximate median sample sizes and assuming θ = 5, but power to detect smaller declines was considerably lower (0.38–0.08 for Ne = 50–500, Fig. 1a). Power to detect bottlenecks was greater when θ = 10, but was still only 0.38 when post-bottleneck Ne = 100. Power was very low (≤0.11) for the range of post-bottleneck Ne considered when θ = 1 (Fig. 1a). Power to detect bottlenecks of Ne = 50 could be increased to reasonable levels (0.80) by increasing sampling effort to 100 individuals and 16 loci when θ = 5, but remained low for larger bottlenecks when θ = 1 (Fig. 1b). Power to detect bottlenecks with M-ratios was greater when populations were sampled a greater amount of time (up to 50 generations) after the bottleneck, but power was generally very low when populations were sampled only 1 and 5 generations after the bottleneck (Fig. 2a).
Power to detect bottlenecks after 10 generations with heterozygosity-excess tests was low (≤0.27) for all post-bottleneck effective population sizes considered, given the approximate median sample sizes when θ = 1 or 5 (Fig. 1c). Moreover, power was modest (<0.60) for all post-bottleneck effective population sizes even with elevated sampling effort (100 individuals genotyped at 16 loci) when θ = 5 (Fig. 1d). Power to detect declines to Ne = 25 and 50 could be increased to reasonable levels (>0.70) with greater sampling effort when θ = 1. It was not possible to estimate statistical power for heterozygosity-excess tests conditional on allele number when θ = 10 because bottlenecks in populations with high genetic diversity often resulted in allele numbers that were lower than allele numbers in the expected data (Fig. S1, Supporting information). Low power for heterozygosity-excess tests was not an artefact of arbitrarily modelling a bottleneck duration of 10 generations; power was <0.40 for bottlenecks lasting from 1 to 50 generations with the approximate median sample size and θ = 5, regardless of post-bottleneck Ne (Fig. 2b).
Effects of violations of mutation model assumptions on genetic bottleneck tests
Simulation analyses indicated that heterozygosity-excess tests were reasonably robust to incorrect assumptions about mutation models, but M-ratio tests were quite sensitive to differences between the true and assumed proportion of multi-step mutations (Fig. 3). Specifically, Type I error rates for heterozygosity-excess tests were ≤0.11, regardless of assumptions of pg, whereas error rates for M-ratio tests increased rapidly with increasing differences between true and assumed values for pg. The error rate for M-ratio tests was estimated to be 0.20 when pg was assumed to be 0.10 (the most common value used in bottleneck studies), but the true value for pg was 0.22 (the overall estimate across microsatellite mutation studies). However, pg is generally unknown in the species of interest and true pg may often be >0.22 (Table 4), which can result in even higher error rates (Fig. 3). In an extreme case, the error rate was estimated to be 0.95 when true pg equalled 0.70 (the approximate estimate for zebrafish, Danio rerio) and pg was assumed to be 0.10. Error rates could be reduced considerably by assuming that pg = 0.22 instead of 0.10; for example when true pg = 0.3, error rates could be reduced from 0.43 to 0.10. Nevertheless, error rates were high (≥0.28) when the true pg was 0.40 or greater, even if pg was assumed to be 0.22.
Both M-ratio and heterozygosity-excess tests were reasonably robust to incorrect assumptions of δg when pg was assumed correctly. Specifically, Type I error rates did not exceed 0.12 across the range of δg observed in studies of microsatellite mutations for either test type (data not shown).
Our review suggests that, as typically applied, bottleneck tests often failed to detect declines in populations known to have experienced population reductions and typically have low statistical power as a result of limited sample sizes. Indeed, power to detect even very large bottlenecks based on the level of heterozygosity excess is generally limited, whereas power to detect bottlenecks with M-ratios depends on many factors, but was likely low in many empirical studies of vertebrates of conservation concern. Our results are similar to those of a recent analysis by Girod et al. (2011) who found limited power to detect 10- to 1000-fold population declines with heterozygosity-excess tests and 10-fold declines with M-ratio tests using similar sample sizes of loci and individuals. However, Girod et al. (2011) used a stepwise mutation model; whereas we showed that power can be low for a more realistic two-phase mutation model. In addition, we showed that power to detect population declines with M-ratio tests was particularly low when populations were sampled one and five generations after the bottleneck and generally did not reach an asymptote until 20 generations under typical sampling designs. Thus, bottlenecked populations could experience the negative impacts of small population processes for many generations before even the more powerful of the two tests we evaluated has a reasonable probability of detecting the bottleneck.
Studies employing bottleneck tests should, when possible, employ larger samples sizes of individuals and loci than used in the studies we reviewed (median = 8–9 loci and 31–38 individuals). In cases where a bottleneck test fails to reject mutation-drift equilibrium, inference may be improved by estimating power to detect a bottleneck across a range of possible demographic histories given the available sample size. Only one of the 105 studies we reviewed conducted such an assessment. If a test fails to detect a bottleneck but simulations reveal that statistical power was high, inference that a bottleneck did not occur is more robust. Nevertheless, a thorough discussion of alternative biological explanations for non-significance such as the potential effects of immigration or a brief bottleneck is warranted. Quantitative assessments of the possible role of factors such as immigration and bottleneck duration in determining statistical significance can help provide support for alternative hypotheses (Hundertmark & Van Daele 2010). However, none of the studies we reviewed conducted a quantitative analysis of the possible confounding effects of immigration on their findings, although wild populations are rarely completely closed and even small numbers of migrants can mask the genetic signature of bottlenecks (Keller et al. 2001; Busch et al. 2007). Inference can also be strengthened by comparing genetic variation before and after the bottleneck [e. g. with museum specimens; Beaumont (2003); Leonard (2008)] when such data are available because multi-point sampling methods are generally more powerful in a statistical sense (Ramakrishnan et al. 2005). Comparing genetic variation in the population of interest to variation in putatively stable populations of the same species can also add insight (Ennen et al. 2010; Hundertmark & Van Daele 2010), although the choice of comparative data needs to be made with care as the evolutionary history of populations can differ in many ways.
We also showed that bottleneck studies often assume values for mutation model parameters that may not be representative of microsatellite evolution in vertebrates and that the proportion of multi-step mutations is, on average, underestimated by a factor of approximately two. Whereas heterozygosity-excess tests are reasonably robust, M-ratio tests can be quite sensitive to mistaken assumptions about the proportion of multi-step mutations which can lead to high probabilities of inferring that bottlenecks occurred in stable populations. Moreover, assuming a single set of mutation model parameter values (as was the case in 86% of bottleneck studies) is not sufficient to infer a bottleneck, given the fact that parameter values appear to vary widely among vertebrates. Inference is stronger when results are consistent across a range of possible values for the proportion of multi-step mutations (Guinand & Scribner 2003). Requiring significance across a full range of values for pg observed in microsatellite mutation studies (up to 0.68) will, however, likely result in highly stringent tests. Rather, we suggest that future studies explore a range of values up to at least 0.22, given that this value represented the ‘average’ estimate from studies of microsatellite evolution in vertebrates. Indeed, assuming that pg = 0.22, as opposed to pg = 0.10, reduces Type I error rates for M-ratio tests considerably across a range of values for true pg, although Type I error rates will still likely be high if true pg ≥ 0.40 (Fig. 3). If statistical significance depends on the assumed proportion of multi-step mutations within this range, we believe the most appropriate inference is that the possible occurrence of a bottleneck remains uncertain.
Differences in sensitivities to violations of assumptions about mutation models between M-ratio and heterozygosity-excess tests also have important implications for assessing the timing of population declines with bottleneck tests. In principle, heterozygosity excess is expected to regain mutation-drift equilibrium more rapidly than M-ratios because new alleles arising from mutations do not necessarily increase M-ratios (Garza & Williamson 2001) and M-ratios change more slowly than heterozygosity immediately following a population bottleneck (Williamson-Natesan 2005). As a result, researchers often infer that a bottleneck occurred historically when M-ratio but not heterozygosity-excess tests are significant (Spear et al. 2006; Henry et al. 2009; Marshall et al. 2009), and conversely, that the population decline occurred more recently when heterozygosity excess, but not M-ratios, is significantly different from expectations (Funk et al. 2010; Hundertmark & Van Daele 2010). Our simulations indicated that inferring the timing of a population bottleneck based on differences in statistical significance between tests is probably not reasonable because M-ratio and heterozygosity-excess tests are affected differently by violations of assumptions regarding mutation models. Specifically, the probability of incorrectly rejecting the null hypothesis can increase dramatically with the assumed proportion of multi-step mutations for M-ratio tests, but heterozygosity-excess tests are relatively robust to assumptions about this parameter (Fig. 3). Even when mutation model is assumed correctly, power to detect a bottleneck can differ between M-ratio and heterozygosity-excess tests for reasons other than the timing of the decline. For example statistical power for heterozygosity-excess tests may be greater than for M-ratio tests when pre-bottleneck genetic diversity is low, but heterozygosity-excess tests may be less powerful than M-ratio tests when pre-bottleneck genetic diversity is high (Fig. 1).
Current trends and future directions for characterizing demographic history
The use of summary statistics such M-ratios and heterozygosity to detect population bottlenecks is transitioning to less ad-hoc methods for characterizing demographic history. In particular, Bayesian methods that estimate posterior distributions of demographic parameters and evaluate competing models of demographic history are undergoing rapid development and are increasingly being adopted by empirical studies (Beaumont 1999, 2010; Bertorelle et al. 2010). Full Bayesian methods make use of coalescence theory and Markov chain Monte Carlo sampling to estimate the likelihood of genealogical trees of sampled genes conditional on model parameters such as effective population size and the timing of demographic changes. Indeed, Bayesian methods implemented in program MSvar (Beaumont 1999; Storz & Beaumont 2002) can have a higher probability of detecting bottlenecks than M-ratio or heterozygosity-excess tests, and estimates of population size and bottleneck duration seem more robust to certain violations of mutation model assumptions (Girod et al. 2011). Moreover, Bayesian skyline plots can provide estimates of effective population size between individual coalescent events (Drummond et al. 2005; Drummond & Rambaut 2007). Approximate Bayesian Computation (ABC) methods allow for the implementation of more complex population models as they do not rely on computing the full likelihood (Beaumont 2010; Bertorelle et al. 2010). Hoffman et al. (2011) recently used ABC to reconstruct accurately the demographic history of the heavily exploited Antarctic fur seal (Arctocephalus gazelle) when the bottleneck tests described herein failed to detect a decline. Moreover, statistical software packages have recently been developed for both microsatellite and sequence data that make these methods accessible to non-experts (Cornuet et al. 2008, 2010; Wegmann et al. 2009). Nevertheless, ABC methods for inferring demographic history require careful consideration of a number of factors including alternative models, which summary statistics to use in the estimation procedure and the number of simulations, as well as the sensitivity of results to these choices (Bertorelle et al. 2010). Moreover, additional assessments will be necessary to determine how much sampling effort is required to achieve adequate levels of precision in parameter estimates and reliably discriminate among competing models, as well as to understand how sensitive results are to violations of model assumptions.
The ability to sequence a large fraction of the genome of non-model species with next-generation sequencing technologies has the potential to improve the characterization of demographic history in several ways. Indeed, these technologies offer the opportunity to sequence hundreds of nucleotides at thousands of loci across individual genomes (Davey et al. 2011), which will increase power to detect recent bottlenecks and characterize demographic history by increasing the number of loci that can be analysed with methods described above (Allendorf et al. 2010). Next-generation sequencing technologies now make it possible to identify large numbers of single nucleotide polymorphism (SNP) loci, but it remains unclear if SNPs will improve the power of bottleneck tests because the loss of alleles in bottlenecked populations generally results in monomorphic loci (Morin et al. 2004). However, sequencing large fractions of genomes makes it possible to apply at least two newer classes of genetic methods that are based on haplotype data to questions of demographic history. The first of these involves linkage tract-based methods, where linkage tracts represent regions of DNA that are identical by descent (IDB) between gene copies present in two individuals (Albrechtsen et al. 2009; Gusev et al. 2009; Pool et al. 2010). The length of linkage tracks is a function of recent effective population size because gene copies are more likely to be identical by descent in small populations where, on average, individuals are more closely related than in large populations. Indeed, a recent study was able to estimate effective population size based on the distribution of linkage-tract lengths in an isolated human population only 23 generations ago (Gusev et al. 2012). The second approach involves the use of the allele frequency spectrum, which is a measure of the observed frequency distribution of alleles at a large number of loci [Pool et al. (2010); see also Luikart et al. (1998) for a similar test using microsatellites]. In principle, a bottleneck can be inferred if the allele frequency spectrum in a sample of individuals taken from a population of interest contains fewer rare alleles than expected. To the best of our knowledge, no study has yet tested for population bottlenecks in a species of concern using either allele frequency spectrum or linkage tract data derived from next-generation sequencing, but we suspect that the use of such methods in conservation will increase dramatically over the coming years.
BNR was funded in part by a STAR Fellowship from the US Environmental Protection Agency, EDB was funded by a National Science Foundation IGERT Fellowship and PJP was funded in part by a Hrdy Fellowship in Conservation Biology at Harvard University. We thank four anonymous reviewers for helpful comments on earlier drafts of this work.
M.Z.P. is a conservation biologist that investigates the demography and genetics of endangered species. R.K. studies the molecular and trophic ecology of mammals and birds. B.N.R. studies comparative population dynamics and landscape genetics in freshwater turtles. R.S. is an applied ecologist interested in the conservation of reptiles, amphibians, and birds. S.J.R is a wildlife ecologist using molecular methods to investigate gene flow and disease dynamics in wildlife populations. C.V.-C. uses molecular methods to support the conservation of Neotropical mammals threatened by illegal traffic and exploitation. J.N.P. is a mammalian ecologist interested in a range of applied and basic questions in mammalian population and trophic ecology. P.J.P. is interested in the application of genetic data to conservation, ecology and evolution.