Distribution modelling and statistical phylogeography: an integrative framework for generating and testing alternative biogeographical hypotheses

Authors

  • Corinne L. Richards,

    Corresponding authorSearch for more papers by this author
  • Bryan C. Carstens,

    1. Department of Ecology and Evolutionary Biology, 1109 Geddes Ave, Museum of Zoology, University of Michigan, Ann Arbor, MI 48109-1079, USA. New address Department of Biological Sciences, 202 Life Sciences Building, Louisiana State University, Baton Rouge, LA 70803, USA
    Search for more papers by this author
  • L. Lacey Knowles

    1. Department of Ecology and Evolutionary Biology, 1109 Geddes Ave, Museum of Zoology, University of Michigan, Ann Arbor, MI 48109-1079, USA. New address Department of Biological Sciences, 202 Life Sciences Building, Louisiana State University, Baton Rouge, LA 70803, USA
    Search for more papers by this author

*Corinne L. Richards, Department of Ecology and Evolutionary Biology, 1109 Geddes Ave, Museum of Zoology, University of Michigan, Ann Arbor, MI 48109-1079, USA. E-mail: clrichar@umich.edu

Abstract

Statistical phylogeographic studies contribute to our understanding of the factors that influence population divergence and speciation, and that ultimately generate biogeographical patterns. The use of coalescent modelling for analyses of genetic data provides a framework for statistically testing alternative hypotheses about the timing and pattern of divergence. However, the extent to which such approaches contribute to our understanding of biogeography depends on how well the alternative hypotheses chosen capture relevant aspects of species histories. New modelling techniques, which explicitly incorporate spatio-geographic data external to the gene trees themselves, provide a means for generating realistic phylogeographic hypotheses, even for taxa without a detailed fossil record. Here we illustrate how two such techniques – species distribution modelling and its historical extension, palaeodistribution modelling – in conjunction with coalescent simulations can be used to generate and test alternative hypotheses. In doing so, we highlight a few key studies that have creatively integrated both historical geographic and genetic data and argue for the wider incorporation of such explicit integrations in biogeographical studies.

Introduction

Biogeographical research seeks to identify the processes structuring organismal diversity at a variety of geographic and taxonomic scales, from community patterns of species richness to higher-order taxonomic study. Molecular data are featured prominently in contemporary biogeographical studies because patterns of genetic variation, when interpreted in the context of geography, can provide insights into the historical demographic and biogeographical history of species (Avise et al., 1987; Avise, 2000; Knowles & Maddison, 2002). However, whereas the relationship between geographic distribution and genetic variation is central to biogeography, as Kidd & Ritchie (2006) recently noted, phylogeographic research has to date placed most of its emphasis on the ‘phylo’ component, and much less on ‘geography’, despite the inherent information that the spatial-geographic component contains about the evolutionary past. These authors illustrate the potential of new GIS-based techniques to bring phylogeography back into balance, not only allowing a more powerful investigation of the geographic components of genetic variation, but also facilitating the formation of historical biogeographical hypotheses. We argue that GIS-based approaches to generating such alternative hypotheses, when coupled with genetic approaches to testing them, have the potential to increase profoundly the rigour of phylogeographic research. Herein we aim to provide readers with the necessary tools and conceptual background to take advantage of this powerful combination of distribution and coalescent-based modelling techniques in generating and testing biogeographical hypotheses. This approach has broad utility given that the required data can be readily generated for many taxa.

Improving phylogeographic studies through hypothesis testing

In phylogeography, intraspecific genetic data are interpreted in a geographic context to infer historical and contemporary population structure and demography (Avise et al., 1987; Avise, 1989, 2000). The processes generating such genetic structure will differ among species, and may include demographic events such as population bottlenecks and expansions, as well as various types of population divergence, ranging from vicariant events to differentiation with migration (reviewed in Knowles, in press). Whereas traditional phylogeographic studies have been applied in many contexts, they have been particularly informative about the biogeographical consequences of climate change. For example, a number of studies have detected population bottlenecks coincident with the restriction of species distributions to disjunct refugia during the Earth’s most recent glacial cycles (Cook et al., 2001; McCracken et al., 2001; Fedorov & Stenseth, 2002; Carstens et al., 2004; Knowles & Richards, 2005; Steele & Storfer, 2006). Other applications of phylogeographic analyses include inferring post-glacial colonization routes (Bernatchez & Wilson, 1998; Taberlet et al., 1998; Hewitt, 2000), defining species boundaries (da Silva & Patton, 1998), and assigning and assessing conservation priorities (Avise, 1992; Moritz & Faith, 1998; Richards & Knowles, 2007). Phylogeographic comparisons across codistributed taxa can also be informative about changes in the community structure of biogeographical regions over time (e.g. Schneider et al., 1998; Riddle et al., 2000; Sullivan et al., 2000; Carstens et al., 2005a; Riginos, 2005). To date, most descriptions of genetic variation and the underlying processes generating it have focused on the contemporary geographic distribution of the focal taxon (but see Hugall et al., 2002).

Because biogeography and phylogeography are concerned with historical events that cannot be directly observed or experimentally replicated, our understanding of these fields is necessarily shaped by the identification of positive evidence. That is, where one of several competing historical hypotheses is identified as more probable than the others (Cleland, 2001). In this situation, tests of competing hypotheses that represent a range of possible explanations for a given phenomenon (Chamberlin, 1890) provide a framework for exploring alternative historical scenarios. Whereas phylogeographers have traditionally formulated hypotheses about the events (e.g. vicariance or migration) leading to an observed population genetic structure by comparing the shape of the genealogy with the geographic distribution of the species (e.g. Avise, 2000), this descriptive approach is prone to over-interpretation (Edwards & Beerli, 2000; Knowles & Maddison, 2002; Hudson & Turelli, 2003; Wakeley, 2003; Knowles, 2004). Because of the stochasticity of gene-lineage coalescence (Kingman, 1982; Hudson, 1992), the geographic distribution of genetic variation may not accurately reflect the population history (Pamilo & Nei, 1988; Takahata, 1989; Hudson & Coyne, 2002).

To avoid the potential problems that arise when the genealogical history of a locus is implicitly equated with the population history (i.e. interpretations concerning the biogeographical and demographic past are based on a visual inspection of a gene tree), the analysis of genetic data can proceed by means of statistical phylogeographic approaches (Knowles & Maddison, 2002), whereby the stochasticity of genetic processes is explicitly considered (Hudson, 1990; Wakeley, 2007). However, statistical phylogeographic inferences rely on explicit models of historical scenarios (e.g. divergence with gene flow, isolation by distance, or population expansion). The choice of a model may be guided by a variety of factors. For example, decisions regarding the potential geographic configuration and temporal sequence of population divergence could be based on fossil data (e.g. Brunhoff et al., 2003), packrat middens (Cognato et al., 2003), palaeoenvironmental data (Tribsch & Schonswetter, 2003), or possibly be estimated from multi-locus data sets (Knowles & Carstens, 2007). However, such data are not available for all species. Herein we provide a step-by-step demonstration of how species distribution modelling techniques, coupled with palaeoclimate estimates, can provide the information necessary for generating alternative models (e.g. hypotheses about past population structure and likely corridors for migration) in cases for which no external information on past distributions has previously been available. We then walk through the steps involved in using empirical genetic data to test such hypotheses in a coalescent framework.

A brief methodological outline

There are two major components to the coupled distribution and genetic-modelling approach: (1) generating alternative phylogeographic hypotheses for the empirical data, and (2) statistically testing these hypotheses. Each of these components involves a series of steps (see Fig. 1) and one or more modelling techniques, which will be described in detail in the following sections.

Figure 1.

 Schematic describing the process of generating alternative biogeographical hypotheses using palaeodistribution models and of testing them using coalescent simulations and empirical genetic data. In (e) and (f) the gradient from red to white differentiates areas with predicted high to low suitability, respectively, for the species in question.

In terms of generating alternative biogeographical hypotheses (Component I below), the necessary data consist of a set of GIS layers containing information about the pertinent aspects of the current environment for the geographic area and species of interest (Fig. 1a), a set of georeferenced localities that describe where the species has been documented to occur (Fig. 1b), and, for the case of palaeodistributions, a second set of GIS layers describing an estimate of the environment at a particular time period of interest in the past (Fig. 1d). Using these inputs and any of several species distribution modelling algorithms (Fig. 1c), both the current (Fig. 1e) and past (Fig. 1f) distributions of the focal species can be estimated. These estimates of a species’ past distributions, or palaeodistribution models, can then guide the generation of alternative biogeographical hypotheses (Fig. 1g).

The testing of alternative biogeographical hypotheses requires two inputs: a set of data simulated under the respective population models that represent the biogeographical hypotheses (Fig. 1h,i), and an empirical genetic data set. Each replicate of the simulated data can be characterized using a summary statistic (see Knowles, in press), generating an expectation for the pattern of genetic variation under a specific biogeographical hypothesis (Fig. 1j). The same summary statistic can then be computed for the empirical genetic data and compared with that of the simulated data for a statistical evaluation of the biogeographical hypotheses (Fig. 1k). These steps are explained in detail below in the subsection Component II.

Component I: Generating Alternative Biogeographical Hypotheses

Generating a set of alternative hypotheses about the biogeographical history of a taxon of interest should be the one of the first steps in any phylogeographic study. However, this task has historically been difficult as information about past distributions, other than what might be inferred from the empirical genetic data (e.g. Avise, 2000), is sparse to non-existent for many taxa. In this section we describe how species distribution modelling techniques can be used to generate models of species past distributions. First we provide a brief introduction to species distribution modelling, including empirical applications that illustrate how the integration of phylogeographic and species distribution modelling techniques can improve our understanding of the processes influencing contemporary patterns of biodiversity. The available algorithms and data sources for distribution modelling, as well as those relevant to generating palaeodistribution models, are then discussed along with the potential sources of error and limitations of these approaches. Finally, we describe how the resulting palaeodistribution estimate can be translated into a set of alternative biogeographical hypotheses, which can then be statistically tested using coalescent simulations.

Applications of species distribution modelling to phylogeography

Species distribution models have been applied to a variety of research questions, including explorations of hybridization (Swenson, 2006), speciation (Losos & Glor, 2003; Graham et al., 2004a), diversity gradients (Graham et al., 2005, 2006; Weins et al., 2006), and extinction (Martínez-Meyer et al., 2004; Bond et al., 2006). Because phylogeography and species distribution modelling both seek to understand biogeographical patterns and the processes generating them through studies of spatial-geographic variation, they each provide independent, but complementary, information. For this reason, studies that integrate these two sources of information are particularly powerful at detecting biogeographical patterns and inferring their causes. For example, Rissler et al. (2006) found concordant phylogeographic patterns among Californian reptiles and amphibians, suggesting that geographic features such as the Central Valley and the San Francisco Bay represent important barriers to dispersal. Maps of the predicted distributions of these species and lineages, generated using a species distribution modelling algorithm, were then used to identify areas of endemism and their geographic relationships to these barriers. As is the case for most phylogeographic studies, Rissler et al. (2006) generated hypotheses about the effects of specific geographic features on gene flow using patterns of genetic variation alone. However, their use of species distribution models, which draw upon a different set of data, supported these hypotheses from an ecological standpoint as well, revealing similar discontinuities in species distributions, and, conversely, routes of interconnectedness. Another example illustrating how phylogeography and distribution modelling can be integrated is the study by Bond et al. (2006), which investigated the role that population extinction has played in defining the current distribution of Apomastus spiders in the Los Angeles basin. Phylogeographic data were used to detect genetic structure and signatures of population extinction, and species distribution models were used to identify regions where the spiders would probably have been found had the area’s habitat not been altered by urban development.

Distribution modelling techniques and available data sources

To generate a species distribution model, the set of conditions that offer the best prediction of the geographic distribution of a species are identified using environmental data from sites of known species occurrence (Austin, 1985; Peterson, 2001; Pearson & Dawson, 2003; Elith et al., 2006). Models can be based on a variety of climatic or other environmental variables, for example measures of temperature, precipitation, elevation, ground cover, or soil type. The spatial distributions of these variables (usually captured in a set of GIS data layers, see Fig. 1a), along with a set of georeferenced sites of known species occurrence (see Fig. 1b), are then evaluated by one of several possible modelling algorithms (Fig. 1c). Each algorithm is designed to extract the relationship between environmental variation and species occurrence, although they differ in methodology and input formats (see Table 1; see also Elith et al., 2006, for a recent review and comparison among techniques). This relationship is then used to predict the species’ distribution given the environmental conditions of the area and time period of interest. These could be current climate measurements (Fig. 1e) or estimated climatic conditions at some time in the past (Fig. 1f) or future.

Table 1.   Examples of species distribution modelling algorithms available on the Internet.
AlgorithmDescription(X,Y) Input*SoftwareURLReference
  1. *P, presence only; PA, presence and absence.

BIOCLIMEnvelope modelPdiva-gishttp://www.diva-gis.org/Nix (1986), Busby (1991)
DomainGower distancesPdiva-gishttp://www.diva-gis.org/Carpenter et al. (1993)
GARPGenetic algorithmPDesktopGarphttp://www.nhm.ku.edu/desktopgarp/index.htmlStockwell & Peters (1999)
Generalized additive model (GAM)RegressionPAgrasphttp://www.unine.ch/cscf/grasp/Lehmann et al. (2002)
Generalized linear model (GLM)RegressionPAgrasphttp://www.unine.ch/cscf/grasp/Lehmann et al. (2002)
MAXENTMaximum entropyPmaxenthttp://www.cs.princeton.edu/~schapire/maxent/Phillips et al. (2006)

Many GIS-based environmental layers are publicly available, and an appropriate data set can often be assembled from these sources (see Table 2 for a list of data sets commonly used in distribution modelling). Species distribution data may be collected in the field or, for many taxa, gleaned from one of a number of searchable Internet data bases (see Table 3 for examples). Some data bases provide georeferenced data (i.e. X, Y coordinates corresponding to a geographic coordinate system, such as decimal degrees or UTM), but in most cases only verbal descriptions of localities are provided and georeferencing is left to the user. A set of georeferencing guidelines for the MANIS/HerpNET/ORNIS distributed natural history networks can be found at http://manisnet.org/GeorefGuide.html. See Graham et al. (2004b) for a review of the various promises and challenges of using specimen data from natural history collections for distribution modelling.

Table 2.   Examples of commonly used environmental data sets.
Data setDescriptionSourceURL
  1. *Several other useful data sets, including some with global coverage, are available from the USGS (edc.usgs.gov/).

WORLDCLIMInterpolated climate layers for global land areasHijmans et al. (2005)http://www.worldclim.org/
SRTM 90m DEMs90-m-resolution digital elevation data for global land areasThe Consultative Group for International Agriculture Research’s - Consortium for Spatial Information (CGIAR-CSI)http://srtm.csi.cgiar.org/
Several availableGlobal current climate, environmental variables, and future climate scenariosIntergovernmental Panel on Climate Change (IPCC)http://www.ipcc.ch/
HYDRO1kGlobal topographically derived data (e.g. streams, drainage basins, etc.)United States Geological Service (USGS)*http://edc.usgs.gov/products/elevation/gtopo30/hydro/index.html
Table 3.   Examples of species distribution data bases available on the Internet.
NameTaxon specific?Geographic coverageURL
Global Biodiversity Information Facility (GBIF)NoGlobalhttp://www.gbif.org/
World Information Network on Biodiversity (REMIB)No146 countrieshttp://www.conabio.gob.mx/remib_ingles/doctos/remib_ing.html
European Natural History Specimen Information Network (ENHSIN)NoEuropehttp://www.nhm.ac.uk/research-curation/projects/ENHSIN/
Australian Biodiversity Information Facility (ABIF)NoAustraliahttp://www.abif.org/
The Biota of Canada Information Network (CBIF)NoCanadahttp://www.cbif.gc.ca/
Distributed Information for Biological Collections (SpeciesLink)NoBrazilhttp://splink.cria.org.br/index?&setlang=en
Instituto Nacional de Biodiversidad (INBio)NoCosta Ricahttp://www.inbio.ac.cr/en/default.html
HerpNETYes – reptiles and amphibiansGlobalhttp://www.herpnet.org/
Ornithological Information System (ORNIS)Yes – birdsGlobalhttp://olla.berkeley.edu/ornisnet/
Mammal Networked Information System (MANIS)Yes – mammalsGlobalhttp://manisnet.org/
System-wide Information Network for Genetic Resources (SINGER)Yes – crop, forage and tree speciesGlobalhttp://singer.grinfo.net/
Ocean Biogeographic Information System (OBIS)Yes – marine taxaGlobalhttp://www.iobis.org/
Missouri Botanical Garden (Tropicos)Yes – plantsGlobalhttp://mobot.mobot.org/W3T/Search/vast.html

Methods for modelling species distributions differ in a number of ways, including in how they select relevant predictor variables, weight the individual variables’ contributions, and predict patterns of occurrence (see Guisan & Zimmerman, 2000; Elith et al., 2006). Whereas some algorithms require only records of species presence, others require both presence and absence data (see Table 1 for examples of each). Ultimately, the choice of modelling algorithm should be based on both the resulting distribution estimate’s intended use and the available data (Fielding & Bell, 1997; Loiselle et al., 2003; Graham et al., 2004b; Elith et al., 2006). However, newer algorithms, such as boosted regression trees and maximum entropy methods (e.g. MAXENT), appear to outperform several of the more established methods (e.g. GARP, BIOCLIM) in comparisons across a number of species and geographic regions (Elith et al., 2006).

As with any modelling approach, the amount and type of data used can influence the accuracy of the predicted distributions. For example, generating an accurate projection of a species’ distribution typically requires samples from at least 20 localities (Stockwell & Peterson, 2002; but see Pearson et al., 2007). Biases in terms of where the samples are collected can affect the model’s output, particularly if some areas are more accessible than others (reviewed in Graham et al., 2004b), as can the choice of environmental data and modelling algorithm (Araújo & Guisan, 2006). Likewise, to the extent that recent habitat changes (e.g. ground cover) affect the presence/absence of a species, distribution models based on such rapidly changing variables run the risk of being inaccurate. For current climate layers based on multi-year averages (e.g. WorldClim: 1950–2000), however, such short-term fluctuations are less likely to unduly influence the projected distributions. Species distribution models do not take into account the potential effects of biotic exclusion, dispersal limitation, or historical contingency on species ranges. As such, it is important to recognize that these models reflect species potential ranges rather than their realized ranges (Araújo & Guisan, 2006). This distinction can be important for some applications, for example in conservation planning.

Applications of palaeodistribution modelling to phylogeography

Whereas species distribution models are generally built on current environmental and species occurrence data, the inferences drawn from this approach are not limited to the present. As discussed above, distribution models can be projected onto models of the climate at some future time, for example to predict species invasions (Roura-Pascual et al., 2004) or to understand how future climate change might influence species distributions (Parra-Olea et al., 2005). Similarly, models of the current niche can be projected onto models of the past climate (e.g., Hugall et al., 2002; Carstens & Richards, 2007; Knowles et al., 2007) to reconstruct the distribution of suitable habitat at that point in the past (see Fig. 1f). For example, Hugall et al. (2002) used this approach to estimate the historical range of a snail in the Australian wet tropics. Comparisons between the snail’s probable past distribution and its population-genetic structure, as well as the population-genetic structuring of several codistributed vertebrates, identified a common vicariant history among the species of vertebrates. Palaeodistribution models have also been used to identify putative locations for Pleistocene refugia (Peterson et al., 2004; Carstens & Richards, 2007; Knowles et al., 2007), to identify historical migration pathways (Ruegg et al., 2006), and to provide information about potential dispersal corridors (Carstens & Richards, 2007). In other studies, palaeodistribution models have shed light on the degree to which organismal ranges have changed over time (Lawton, 1993; Gaston, 1996).

Generating a palaeodistribution model

Palaeodistribution models can be generated using the algorithms and data sets described above. The only additional requirement is a set of palaeoclimate estimates on which to project the species distribution (Fig. 1d; see also Cane et al., 2006, for a review of recent progress in palaeoclimate modelling). Because projecting species distributions onto palaeoclimatic conditions requires the set of current and historical climate layers to be congruent, palaeodistribution studies are limited to those data for which both current measurements and palaeoclimate estimates are available. At present, we are aware of only a few publicly available palaeoclimate model outputs, and none is provided in a ready-to-use format for palaeodistribution modelling. As described below, however, these publicly available data can be re-formatted for this purpose.

The US National Oceanic & Atmospheric Administration’s National Climatic Data Center (NOAA-NCDC) runs a World Data Center (WDC) for Paleoclimatology (http://www.ncdc.noaa.gov/paleo/) from which the outputs of several palaeoclimate models can be downloaded and viewed (http://www.ncdc.noaa.gov/paleo/modelvis.html). The available model runs include some from the Paleoclimate Modelling Intercomparison Project (PMIP) as well as some from other modelling groups. These raw outputs can be downscaled and calibrated for use with a set of current climate layers in palaeodistribution modelling. An example set of palaeoclimate layers, generated using the CCM1 model (Kutzbach & Guetter, 1986; Wright et al., 1993) for the last glacial maximum (21,000 yr bp), can be found in the Supplementary Material (Appendix S1), along with details of the downscaling and calibration procedure used (Appendix S2). This data set is formatted for use with the WorldClim (Hijmans et al., 2005) current climate layers.

Two important caveats associated with palaeodistribution modelling are that a literal interpretation of the projected past distribution assumes that: (1) the palaeoclimate predictions are accurate, and (2) the physiological limits of species are constant (Hadly et al., 1998; Davis & Shaw, 2001). Whereas recent work has demonstrated niche conservatism in several groups (Peterson & Vieglais, 2001; Martínez-Meyer et al., 2004; Kozak & Wiens, 2006; Martínez-Meyer & Peterson, 2006), it is not known whether this assumption holds true for most organisms.

From palaeodistributions to testable hypotheses

When a species distribution model is projected onto palaeoclimate estimates, the result is a GIS layer with continuous values indicating the predicted suitability of each cell for the species at one time in the past (i.e. a palaeodistribution model). Regions of core habitat (red in Fig. 1f), other less suitable areas (yellow in Fig. 1f), as well as regions that would probably have been uninhabitable (white in Fig. 1f) can be inferred from these continuous predictions, or, if desired, the predictions can be converted into binary presence–absence maps by setting minimum thresholds for species distributions (see Liu et al., 2005, for a comparison among various types of thresholds and their applications).

By providing a range of predicted areas of low and high suitability (e.g. from 0 to 100), a palaeodistribution estimate facilitates the formulation of alternative models of historical population structure. For example, consider the current-day distribution prediction in Fig. 1e. The four most suitable areas are discrete and similar in size. Since not only the contemporary geographic configuration and associated demographic impacts, but also past population distributions may leave a genetic signature on patterns of genetic variation, such historical population structure needs to be taken into account. A palaeodistribution model (e.g. Fig 1f) can add this critical historical perspective, providing information about past population associations that might have contributed to patterns of genetic variation. Given the palaeodistribution model in Fig. 1f, we might hypothesize that current-day populations i and ii were descendant from a refugial population in area W, population iii from the refugial population X, and populations iv and v from refugial population Y, corresponding to the left-hand model of population structure in Fig. 1g. Alternatively, population ii could have descended from refugial population X along with population iii, corresponding to the right-hand model in Fig. 1g. Alternative hypotheses, such as one in which population v descended from refugial population Z as opposed to Y, could be envisioned and tested as well. In the next section we will describe how the alternative hypotheses generated from palaeodistribution models can be tested statistically using coalescent simulations and empirical genetic data.

Component II: Testing Alternative Biogeographical Hypotheses

Once a set of historical biogeographical hypotheses has been identified, the next step is to evaluate statistically the extent to which the empirical genetic data support a given hypothesis. A variety of statistical phylogeographic approaches could be used to test alternative population models. Here we emphasize those approaches that employ summary statistics (reviewed in Knowles, in press), as opposed to evaluating the full probability of the observed genetic data (reviewed in Excoffier & Heckel, 2006), because of the great flexibility and ease of computation that the summary-statistic approach offers. In this section we highlight the biogeographical hypotheses that can be addressed, with reference to some recent empirical investigations, to illustrate the synergy that results when palaeodistribution models are used to generate a predictive framework that can be tested in statistical phylogeographic studies. We then provide a step-by-step guide to the process of testing alternative hypotheses with coalescent simulations, mentioning the available software.

Using coalescent models to test alternative hypotheses

Coalescent models have proved to be a useful tool for phylogeographic research even in the absence of explicit reconstructions of species past ranges (e.g. Milot et al., 2000; Knowles, 2001; Carstens et al., 2005b; DeChaine & Martin, 2005; Russell et al., 2005; Steele & Storfer, 2006). For example, using a statistical evaluation of five separate potential population models, Steele & Storfer (2006) were able to show that populations of Pacific giant salamander (Dicamptodon tenebrosus) were isolated in separate glacial refugia during the Pleistocene glaciation. Whereas the structure of the genealogy was suggestive of this disjunction, the coalescent modelling provided details that would not otherwise have been known. These included an evaluation of the timing of divergence, which was consistent with a mid-Pleistocene divergence, thereby providing corroborative evidence for the biogeographical hypothesis of divergence among Pleistocene refugia. By using a coalescent framework, the authors could be assured that the observed geographic distribution of genetic variation reflected the population history, rather than simply the stochasticity of genetic processes. Nonetheless, palaeodistribution modelling could have added rigour to this (and other) phylogeographic studies by guiding the formation of realistic alternative hypotheses. In the case of Steele & Storfer’s (2006) study, this information would ensure that the specified refugia probably contained suitable habitat for the focal species, as well as facilitating inferences about the sizes and locations of other putative refugia. Consequently, inferences about the relative contributions of past events, such as the effect of climate-induced shifts in species distributions, on population genetic structure would be not only more accurate, but also more detailed.

The potential benefits of this approach extend to comparative phylogeographic studies, in which general regional hypotheses provide a metric for comparisons among organisms with different life-history traits (Arbogast & Kenagy, 2001). For example, Carstens & Richards (2007) generated palaeodistribution models for eight codistributed lineages from the Pacific Northwest mesic forests of North America and used the fit of genetic data to the alternative models, as determined with coalescent simulations, to evaluate whether there was congruence in the location and structure of Pleistocene refugia and post-Pleistocene dispersal corridors among the taxa. Such a framework is critical for identifying whether differences in the patterns of genetic variation among species reflect varying responses to common historical events, or, despite shared distributions today, reflect incongruence among the species past distributions.

A step-by-step guide to testing alternative hypotheses with coalescent models

The alternative population structures suggested by palaeodistribution models can be evaluated by constructing null distributions for expected patterns of genetic variation (or a summary statistic that is used to characterize the data) from data simulated by a neutral coalescent process under a specific population model. For example, at least two testable hypotheses are suggested by the model shown in Fig. 1f. Coalescent models that correspond to these hypotheses may be conceptualized by the respective population trees, in which branch lengths reflect the timing of divergence and branch widths correspond to the effective population size (Fig. 1g). Coalescent models can be designed with varying degrees of complexity. However, an excessively complex model may have limited utility because the available genetic data for evaluating such models may not be sufficient – complex models can require large amounts of genomic data (Knowles & Maddison, 2002). Furthermore, since the use of summary statistics necessarily involves a loss of information, the ability to distinguish among various complicated models of a species history may not be possible because the expected value of the summary statistic may not differ between the models (Wakeley, 2003). The key is to identify the simplest model that captures the relevant features of the organism’s history (Knowles, 2004).

Whereas the palaeodistribution models provide crucial information for erecting a coalescent model that captures the geography of divergence, as illustrated in Fig. 1g, there are demographic aspects of the population history that are also important as they too influence the pattern of genetic variation across the landscape by influencing the rate of gene-lineage loss (i.e. the amount of genetic drift). These include the timing of divergence, as well as the effective population size, which may or may not have been constant over time. Whereas the timing of divergence may be derived from the palaeoclimatic information (e.g. the last or preceding glacial maxima), other demographic parameters are estimated directly from the genetic data. For example, the effective population size, Ne, can be calculated from the population-mutational parameter θ, which is 4Neμ, when there is an estimate of the mutation rate μ (e.g. the commonly used rate of divergence of 2% per million years for insect mitochondrial DNA). The parameter θ might be estimated using a coalescent-based program (e.g. using lamarc: Kuhner, 2006), as might an estimate of a population growth parameter in the event that a constant effective population size is not a reasonable assumption. Otherwise, θ might be estimated directly from the distribution of segregating sites (e.g. using Watteson’s estimator of θ) or the pairwise differences (e.g. based on nucleotide diversity π) among DNA haplotypes (e.g. using dnaSP: Rozas et al., 2003).

Coalescent simulations are used to evaluate the fit of the empirical data to a particular historical model (Fig. 1h,i). For such tests, the data should be simulated under conditions that mirror all aspects of the empirical data, including the amount of data and mutational model underlying the observed patterns of genetic variation. For example, if a researcher sequenced 983 basepairs from a gene, which evolved under an HKY+Γ model of sequence evolution, from 129 individuals, the simulated data should share these characteristics. The program ms (Hudson, 2002), in combination with seq-gen (Rambaut & Grassly, 1997), allows users to specify θ, the number of basepairs, the model of sequence evolution, and the number of individuals in order to generate simulated data that provide an expectation for the pattern of genetic variation under a specific population history. Mesquite (Maddison & Maddison, 2006) includes modules with similar capabilities along with several analytical tools that allow users to calculate a summary statistic (such as the number of deep coalescents or Slatkin and Maddison’s s) for each simulated data set that can then be used to construct a null distribution for the summary statistic (Fig. 1j).

This coalescent-based hypothesis-testing process involves first generating a large number of genealogies simulated by a neutral coalescent process under each model of population history (e.g. Fig. 1h). Sequence data are then simulated on these genealogies (Fig. 1i). A summary statistic is calculated for each replicate data set, and together (e.g. considering the values of the summary statistic from each of 1000 simulated data sets) they generate a null distribution for the summary statistic (Fig. 1j) (see Voight et al., 2005, and Hickerson et al., 2006, for examples in which multiple summary statistics are considered simultaneously). When the value of the summary statistic estimated from empirical genetic data is compared with the null distribution, it provides a statistical framework for evaluating the fit of the data to one or more models (for example, the red and green distributions in Fig. 1j reflect the expected number of deep coalescents under the respective population models, Fig. 1g). For example, the number of deep coalescents observed in the empirical data in Fig. 1 differs significantly from what would be expected had the data evolved under a model in which population i was not colonized from the same ancestral population as population ii (i.e., the population model on the right in Fig. 1g) – less than 5% of the simulated data sets exhibited a value for the number of deep coalescents that was equal to or greater than what was observed for the empirical data. However, the data are consistent with the alternative population model (i.e. the population model on the left in Fig. 1g), as the probability of observing the number of deep coalescents that was calculated for the empirical data was less than 5% (i.e., < 0.05).

Benefits of the integrative approach

Whereas the integration of palaeodistribution and coalescent modelling techniques represents a new and informative development in biogeographical research (Stigall & Lieberman, 2006), it has yet to be widely employed. However, predictive models of the type advocated here can lead to important biogeographical insights at a variety of spatial and temporal scales. This is because the genetic data are used to test hypotheses built with explicit reference to the species under study, as opposed to relying on generic models. Moreover, the coupling of palaeodistribution and coalescent models provides a flexible framework with which to evaluate patterns of genetic variation under the diverse and varied historical conditions that have contributed to contemporary patterns of species diversity.

There are challenges associated with palaeodistribution modelling (Araújo & Guisan, 2006; Hijmans & Graham, 2006), as well as with statistical phylogeographic tests (reviewed in Knowles, in press), but these difficulties are offset by the potential benefits of improving studies of the population processes that contribute to regional patterns of biodiversity. Indeed, it is only when present and historical geo-spatial and genetic data are integrated in such a predictive, hypothesis-testing framework that the discipline of phylogeography will fulfill its promise as an integrative field capable of connecting microevolutionary processes to macroevolutionary patterns (Bermingham & Moritz, 1998).

Acknowledgements

Training in species distribution modelling was provided to C.L.R. by the Center for Biodiversity and Conservation at the American Museum of Natural History and was funded by the University of Michigan’s Rackham Graduate School. The research was funded by a National Science Foundation grant (DEB-0447224) to L.L.K.

Biosketches

Corinne Richards is a PhD candidate at the University of Michigan whose dissertation research integrates studies of molecular and phenotypic variation among populations of Panamanian golden frogs (Atelopus varius and A. zeteki). She is interested in the application of phylogeography and landscape genetics to conservation, the role of selection in the evolution of morphological variation, and the effects of climate change and disease on declining amphibian populations.

Bryan Carstens is interested in the evolution of ecological communities and the methodological approaches used in comparative phylogeography.

L. Lacey Knowles’ studies of the processes that initiate or contribute to population divergence span a wide range of temporal and spatial scales. Her primary research interests include the relative contributions of selection and drift to speciation, the evolution of reproductive isolation, the processes generating macroevolutionary patterns of diversity, and the use of statistical approaches (especially coalescent models) to infer the biogeographical, demographic and temporal contexts of lineage divergence.

Editor: Michael Patten

Ancillary