Phylogeographic information systems: putting the geography into phylogeography

Authors


*David Kidd, Environmental and Evolutionary Biology, School of Biology, Sir Harold Mitchell Building, University of St. Andrews, St. Andrews, Fife KY16 9TH, UK.
E-mail: dk@nescent.org

Abstract

Phylogeography is concerned with the observation, description and analysis of the spatial distribution of genotypes and the inference of historical scenarios. In the past, the discipline has concentrated on the historical ‘phylo-’ component through the utilization of phylogenetic analyses. In contrast, the spatial ‘-geographic’ component is not a prominent feature of many existing phylogenetic approaches and has often been dealt with in a relatively naive fashion. Recently, there has been a resurgence of interest in the importance of geography in evolutionary biology. Thus, we believe that it is time to assess how geographic information is currently handled and incorporated into phylogeographical analysis. Geographical information systems (GISs) are computer systems that facilitate the integration and interrelation of different geographically referenced data sets; however, so far they have been little utilized by the phylogeographical community to manage, analyse and disseminate phylogeographical data. However, the growth in individual studies and the resurgence of interest in the geographical components of genetic pattern and biodiversity should stimulate further uptake. Some advantages of GIS are the integration of disparate data sets via georeferencing, dynamic data base design and update, visualization tools and data mining. An important step in linking GIS to existing phylogeographical and historical biogeographical analysis software and the dissemination of spatial phylogenies will be the establishment of ‘GeoPhylo’ data standards. We hope that this paper will further stimulate the resurgence of geography as an equal partner in the symbiosis that is phylogeography as well as advertise some benefits that can be obtained from the application of GIS practices and technologies.

Introduction

Phylogeography is concerned with the principles and processes governing the geographical distributions of genealogical lineages, especially those at the intraspecific level (Avise, 1998, 2000). Phylogeographic study is undergoing a major expansion as the range of convenient genetic markers available to study geographic variation increases. Ultimately, the aim of phylogeography might be defined as a means to understand microevolution and speciation in its geographic or spatiotemporal context. Detecting coincidence or concordance of geographic variation in genotypes, or their genealogies, and the environment is therefore at the heart of phylogeographic inference. Geographical concordance between genealogies may be across sequence characters within a gene, between significant genealogical partitions across multiple genes within a species, or in the geography of gene-tree partitions across multiple co-distributed species (Fig. 1). The latter has been called comparative phylogeography (Bermingham & Moritz, 1998) and may provide a ‘bridge’ between the historically separate disciplines of phylogeography and historical biogeography (Riddle & Funk, 2004; Riddle & Hafner, 2004; Riddle, 2005). Any of these three types of genetic pattern may be geographically concordant with an environmental feature, for example, a mountain range, river or climatic barrier. A wider definition of phylogeography includes genetically controlled traits, such as morphology or behaviour, for which comparable concordance patterns can be studied (Avise, 1998).

Figure 1.

 Phylogeographic Concordance (adapted from Fig. 5.1, p. 217, Avise, 2000). There are three different aspects of genealogical concordance which may also be concordant with externally defined environmental partitions or parameters, here depicted as two parapatric geographical areas A and B. (a) Across sequence characters within a gene, (b) in significant genealogical partitions across multiple genes within a species, and (c) in the geography of gene-tree partitions across multiple co-distributed species (comparative phylogeography).

Practically, phylogeographers usually survey the geographical variation of one or more organisms in the field to which they apply one or more of a variety of analytical techniques. The results are subsequently compared to external data or information describing the present and past landscape across which the variation evolved. Avise (1998, 2000) reviews the considerable progress that has occurred in the genealogical aspects of phylogenetic concordance. However, it is clear that there has existed a bias towards the ‘phylo-’ component of phylogeography in preference to the ‘-geographic’ component throughout the development of this field. This bias is clear in the literature where gene trees, networks, population genetic statistics and measures of trait covariation must be quantitative and statistically testable, whereas geographical relationships are usually only qualitatively addressed (e.g. studies reviewed in Avise, 2000).

This situation has arisen from the disciplinary history of phylogeography, which primarily emerged from population genetics and phylogenetics rather than as a truly integrated combination between these disciplines and geography. The challenge for phylogeography is to integrate both phylogeny and geography within a quantitative analytical framework that encompasses the diverse aspects of phylogeographic concordance. Without such a framework, phylogeography will remain essentially a narrative biogeographic approach (Humphries, 2000; however, see Arbogast & Kenagy, 2001; Cleland, 2001; Riddle & Funk, 2004; Riddle & Hafner, 2004, for alternative reasoning).

A number of recent publications have described analytical techniques that can be applied to phylogeographic problems and emphasized the need for quantitative techniques (e.g. Sork et al., 1999; Barbujani, 2000; Knowles & Maddison, 2002; Epperson, 2003; Manel et al., 2003; Sites & Marshall, 2003; Thompson, 2005). Instead of providing detailed descriptions of the merits of specific analytical techniques, here we aim to give an overview of how geography is incorporated – and could be better incorporated – into phylogeography. Within the context of this paper space refers to the location of samples in two or three spatial dimensions without reference to environmental variation within that space, whereas, geography is space with environmental variation. Geographical information systems (GISs) constitute an established technology used for the storage, analysis and visualization of geographic data in science, business and local government (Longley et al., 1999). However, GIS have not been exploited extensively by evolutionary biologists (exceptions are Kidd & Ritchie, 2000; Swenson & Howard, 2004) despite its potential to provide a data storage and computational framework within which phylogenetic data can be intimately connected with the present and past landscape across which the evolutionary patterns evolved. However, the use of GIS may be greater than the literature suggests because its application is simply not reported.

Phylogeography as a geographical information system

Information science pertains to the capture, representation, analysis, visualization and dissemination of data. If the data have a spatial component then the principles of geographical information science should be applied (Goodchild, 1992; Mark, 2003). Phylogeographic information is spatial and therefore falls within the realm of geographical information science. Organisms are sampled in the field followed, usually, by some form of laboratory analysis to create georeferenced data on organism variation. These data are then analysed and historical scenarios inferred using aspatial evolutionary analytical approaches: techniques from phylogenetics or population genetics, or the coalescent (Rosenburg & Nordborg, 2002; Hey & Machado, 2003).

In the phylogenetic/population genetic approach, graphical phylogenetic trees, networks or clades are calculated from the observed variation data. These graphical representations of evolutionary relationships are subsequently compared to external information that could be another trait of the same organism, a trait or distribution of a different organism, or an environmental variable, to identify congruent spatial and phylogenetic pattern (Fig. 1) facilitating the inference of, usually qualitative, historical scenarios (e.g. Taberlet et al., 1998; Hewitt, 1999).

In contrast, in the coalescent approach external information is used, again often qualitatively, to develop hypothetical evolutionary scenarios as formal structures within which the coalescent simulations are run. The structure may include hypotheses of population division, demography and migration. Thus, phylogeographers must manipulate a wide range of data and information types including ‘raw’ character matrices and DNA sequences, derived trees and network graphic representations, coalescent hypotheses as well as external contextual data describing environmental variation and landscape structure.

Phylogeography therefore utilizes a heterogeneous set of quantitative and qualitative data obtained from a wide variety of sources with differing data structures, e.g. trees, networks, spatial grids or vector points, lines or polygons. Individual studies follow an ‘information pipeline’ from data collection (from field survey or existing publications and data bases) through analysis to the eventual dissemination of results and conclusions.

The principles of geographical information science (Goodchild, 1992; Mark, 2003) are applied in a variety of programs that can store and manipulate spatial data; however, it is generally within GIS that they are most fully implemented. GIS has revolutionized quantitative studies in geography (Longley et al., 1999) in much the same way as bioinformatics is opening new fields in genetics. GIS facilitates the integration and interrelation of data sets of different origins via a common georeferencing system. In addition, they provide a set of spatial data models and tools, e.g. the ArcObjectsTM Common Object Model compliant library (ESRI, Redlands CA, USA) facilitating complex customized representations of geographical phenomena, as well as analysis, visualization and reporting facilities. The challenge is to apply these tools and principles to phylogeographic problems to create an information system framework within which phylogeography can become a better unification of phylogenetics and geography.

Integration of GIS with existing software will vary from weak linkage with file exchange, through data base access protocols, to strong linkage where GIS functionality is embedded in phylogenetic programs or vice versa. Two examples of GIS-mediated phylogeography briefly illustrate how old questions can be addressed in a new way, and new questions addressed with GIS. The first example is Swenson & Howard's (2004) re-examination of Remington's North American suture zones (Remington, 1968), which concluded that only 2 of the 13 zones were consistent with the distribution of recorded hybrid zones. The second example is the quantitative inference of niche evolution (Peterson et al., 1999; Rice et al., 2003; Graham et al., 2004) and the postdiction of historical distributions using historical environment reconstructions (Hugall et al., 2002, 2003; Martínez-Meyer et al., 2004), which has only recently become possible with the increasing number of digital environmental data sets and data bases of present and historical taxon distributions (from fossil records) brought together in GIS. Such palaeodistribution reconstructions provide independent quantitative hypotheses of the spatiotemporal environments within which genetic variation evolved.

In the remainder of this paper we consider the phylogeographic information pipeline from data capture and storage to analysis and dissemination stages from a geographical information system/science perspective and further describe how geography is presently integrated into phylogeography

Data capture

All phylogeographic studies are based on data describing organism variation across a geographical region. The geographical sampling strategy of most phylogeographical studies is determined solely on a study-specific basis without reference to other potential uses of the data; for example, Taberlet et al. (1998) were unable to use taxa sampled only within a subset of their range in a Brooks parsimony analysis (BPA) (Brooks, 1985, 1990) of European taxa. Two sampling issues must be addressed when designing individual phylogeographical studies, what to sample (number of species and traits at each location) and where to sample (total area to sample and the distribution of locations within that area). Most phylogeographical studies examine only a few characters within a single taxon (the phylogeography section of the average issue of Molecular Ecology is predominated by studies of mtDNA variation in single species). However, this is beginning to change with multi-species (Petit et al., 2003) and multi-character (Manel et al., 2003; Sivasundara & Hey, 2003) studies becoming more common. Here we are concerned with the where rather than the what; explicitly the geographical sampling strategy and the methodology of recording sample locations.

The cost and time required to obtain genetic profiles creates a tension between depth of sampling per population and number of populations sampled. In addition, sampling is often biased towards regions expected to show significant pattern and that are easily accessible. Adequate assessment of variation within populations is a prerequisite for the assessment of variation between locations; however, the frequency and distribution of spatial sampling must also be considered because it affects both the variety of analyses that can be undertaken and the range of historical scenarios that can be tested.

Spatial interpolation algorithms (Lam, 1983; Mitas & Mitasova, 1999) estimate parameter values at unsampled locations from a spatial distribution of observed points, providing a means of integrating data sampled at different sets of locations, as well as visualization (Cavalli-Sforza et al., 1994; Cesaroni et al., 1997; Kidd & Ritchie, 2000; Rosser et al., 2000; Miller et al., 2006). The number and distribution of sample locations affects the quality of surfaces interpolated from point samples with spatially regular patterns generally outperforming clustered patterns (Cox et al., 1997; Rempel & Kushneriuk, 2003; Di Zio et al., 2004; Paez et al., 2005). Grid sampling is uncommon in phylogeography; Petit et al. (1997) however, provide an exceptional example of grid sampling and spatial interpolation which identified cpDNA haplotypes patches for two species of Quercus (oak) that were inferred to be the residual signal of long-distance dispersal events during post-glacial recolonization.

Synthetic maps are multivariate maps created by interpolating character surfaces from a set of point samples then undertaking Principal Components Analysis using the gridded values that have been used to identify clines in multiple characters (e.g. Cavalli-Sforza et al., 1994; Cesaroni et al., 1997; Kidd & Ritchie, 2000). The interpolation of character surfaces is part of the process of creating synthetic maps and thus if a synthetic map is desired, spatial sampling should be as regular and dense as practicable. When surfaces are interpolated from an irregular distribution of points care must be taken to identify those components of the pattern that are potential artefacts of the point distribution and interpolation algorithm (Sokal et al., 1999a,b; Kidd & Ritchie, 2000).

When recording sample locations, two issues must be addressed. First, the type of geographical representation to be used to store and analyse the spatial component of the data must be determined, i.e. points, lines, or polygons; and second, how to record their geographical position. The choice of representation may have significant effects on the types of analyses that can be undertaken as well as their results (Fig. 2). While it is good practise that precise coordinates (often latitude/longitude or Universal Transverse Mercator coordinates) for sample locations are published, the use of location names persists. It must be emphasized that names alone are often ambiguous due to duplication; one British Road Atlas lists 23 occurrences of ‘Newtown’, of which five are in the county of Hampshire. Confusion may also arise from alternative spellings or names, for example, in the Pyrenees (where the authors work), the same place may have Catalan, French and Spanish names. Obtaining coordinates for location names is a time-consuming burden that is completely avoidable by the publishing of full coordinates for sample locations.

Figure 2.

 The effect of spatial representation on analysis. Two alternative representations of populations as point and polygon objects are depicted along with their respective pairwise distance matrices. When populations are represented as points, B is closer to C than to A, whereas when represented as polygons, the opposite is true, with A being closer to B than C.

With the availability of inexpensive handheld global positioning systems (GPS) there is no excuse for not providing accurate and precise coordinates for sample locations (see Lange & Gilbert, 1999, for further information on using GPS for GIS data capture). However, it must be recognized that GPS performance varies, for example in mountainous, built up and forested areas, sky visibility may be restricted limiting thus the number of satellites that can be locked onto. In addition, signal interference from nearby objects can degrade positional accuracy. Handheld GPS specifications typically give positional accuracies of ±10 m in the horizontal and ±35 m in the vertical; however, ground-truthing can reveal coordinates to be considerably more accurate. For studies where the distances between sample sites are below the accuracy of a handheld GPS, other more accurate survey techniques should be applied; for example, differential or Wide Area Augmentation System enabled GPS, or alternatively more traditional surveying methods. Positional coordinates must always be accompanied by information identifying the geographic map projection or geodetic spheroid and datum used. Alternate spheroids and projections distort the Earth's surface in different ways so coordinate systems and map projections must be compatible before data integration from different sources, and selected carefully so as to be suitable for any analysis undertaken (Muehrcke & Muehrcke, 1998).

Conservation management, commercial confidentiality or other issues may prevent the general dissemination of precise coordinates, however, the presumption should always be for the freedom of information and the recording, if not disseminating, of accurate and precise coordinates.

Storage, query and dissemination

Detailed design criteria for phylogeographical data bases are beyond the scope of this paper, however, we describe here some data and information requirements for phylogeographical information systems. In common with most modern data bases, once a data structure has been developed GIS are dynamic systems that instantly reflect the addition of new and amendment of existing data. GIS data bases can be easily linked to other spatial or non-spatial data bases and the data structures themselves are readily editable without affecting the stored data.

Phylogeographic information systems will store data on the geographical location of sampled individuals and populations as points, lines or polygons. Character data such as DNA sequences, AFLP or RAPD bands or other phenotypic measurements will be stored in data tables linked to these explicit spatial map features. Phylogeographers undertake a wide variety of analyses on these ‘raw’ geographical and character data to create phylogeographical models and information in a wide variety of formats including phylogenetic trees and networks, pairwise matrices, spatial autocorrelation statistics and interpolated surfaces. As data are created and combined it is important that metadata describing the data lineage, i.e. its origins and history of manipulation (Clarke & Clark, 1995), is stored with the data set. External data, e.g. organism distributions, climate and models of landscape structure, may be integrated quantitatively into phylogeographic analysis; however, they are more often used in some form of secondary analysis or qualitatively compared to phylogeographic pattern using some form of visualization. Phylogeographical analysis may directly incorporate concordance with external information and give rise to historical scenarios that are usually inferred qualitative semantic descriptions of a spatial-temporal sequence of cause and effect. We know of no data base system that can manage all these aspects of phylogeographical data.

The need to develop data bases to manage the growth in phylogenetic information was recognized over 10 years ago (Sanderson et al., 1993). Since then a number of data bases have been developed for both raw genetic data and phylogenies (Table 1). The success of these data bases depends not only on overcoming technical difficulties such as the representation and querying of phylogenies using relational or object-oriented data base models (Nakhleh et al., 2002) but also in the winning of ‘hearts and minds’ through the creation, or enforcement by journals and funding councils, of a culture of mutual benefit through data sharing.

Table 1.   Data bases and data browsers for phylogenetic and character data
Data baseDescriptionWWW addressReference
TreeBASERelational data base that can store a wide range of character types (sequences, RFLPs, morphological, etc.) with associated phylogenetic trees. Trees stored as text strings in NEWICK formathttp://www.treebase.orgMorell (1996), Nakhleh et al. (2002)
HOGENOMA data base of homologous genes from fully sequenced organisms. Allows the selection of homologous genes among species, and the visualization of multiple alignments and phylogenetic treeshttp://pbil.univ-lyon1.fr/databases/hogenom.htmlDufayard et al. (2005)
PALIData base of structure-based sequence alignments and phylogenetic trees derived from the three-dimensional structures of homologous proteinshttp://pauling.mbu.iisc.ernet.in/~paliSujatha et al. (2001)
PANDITData base of protein sequences, corresponding DNA sequences and derived treeshttp://www.ebi.ac.uk/goldman-srv/pandit/Whelan et al. (2003)
PromethiusPrototype object-oriented taxonomic data base that separates taxonomy from classification in two related hierarchies. Importantly uses specimen (rather than taxon) as basis of hierarchies.http://www.dcs.napier.ac.uk/~prometheus/Pullan et al. (2000)
HICLASData base that implements a ‘taxon view’ of taxonomy that has also been developed to store phylogenetic trees Zhong et al. (1999)
PGDBMap interface to phylogeographic studieshttp://seahorse.louisiana.edu/PGDB/ 

GenBank (http://www.ncbi.nlm.nih.gov/Genbank/index.html) and other data bases for nucleotide and protein sequences provide good models for mutually collaborative phylogenetic or phylogeographic data bases. These phylogenetic data bases exhibit many of the data structures and tools required to store, query and disseminate phylogeographic information including storing multiple character types and phylogenetic trees as well as allowing Internet access. However, none store geographic location data or information describing historical scenarios. TreeBASE (Sanderson et al., 1993; Morell, 1996; Nakhleh et al., 2002) is probably the most suitable data base system for adaptation to phylogeographic data because it already stores multiple character types as well as associated phylogenetic trees. However, it does not support phylogenetic networks, or the output of other analysis techniques, such as pairwise Fst matrices. In addition, TreeBASE stores phylogenetic trees as ASCII NEWICK-formatted strings, which limits the use of the Structured Query Language in data base queries. These latter non-geographic issues are being addressed (Nakhleh et al., 2002). The prototype Population Genetics Data Browser (PGDB; Table 1) provides a different approach to data dissemination based on a simple cartographic interface for viewing the geographical distribution of mtDNA sequence data from different studies. The PGDB, however, not does store trees or other derived data sets, or allow searching on the basis of geographic location or the display of locations from more than a single study at a time.

A phylogeographic information system must be more than simply the addition of geographical representations to phylogenies because phylogeography is an integrative discipline that employs information derived from a wide range of other disciplines (Avise, 1998, 2000); GIS can provide this integration. For example, organism distribution has been employed in phylogeographical analysis (Hugall et al., 2002, 2003) and is becoming increasingly accessible through the online provision of digital museum inventories as well as dedicated organism-mapping projects (Table 2). A considerable quantity of environmental data sets is now available in digital format. Of particular interest are those freely available from government and inter-government organizations. Presently, most free data sets are at global or continental scales with regional and local scale data less easily available and often costly to obtain. Geodata portals are good ways to find out what data are available from whom (Table 2). However, copyright, cost, proprietary ownership and restrictions due to conservation status may limit the availability of some data.

Table 2.   Digital data sets available online of interest to phylogeographers. This table is only to provide examples of the types of data and data search engines available
Data sourceWWW address
Species distribution maps
 Atlas Florae Europaeae – European florahttp://www.fmnh.helsinki.fi/english/botany/afe/
 Fauna Europa – European faunahttp://www.faunaeur.org/
 United States Geological Survey – distribution of North American treeshttp://climchange.cr.usgs.gov/data/atlas/little/
 Centre for Ecology and Hydrology and Joint Nature Conservancy Council – United Kingdom flora and faunahttp://www.searchnbn.net
 Centre Suisse de Cartographie de la Faune – Swiss faunahttp://www.cscf.ch/
 Sprinkhanenwerkgroep van de Benelux Orthoptera in Belgium and Luxemburghttp://www.saltabel.org/default.htm
Geodata portals
 Australian Spatial Data Directoryhttp://asdd.ga.gov.au/asdd/
 New York State, USAhttp://www.nysgis.state.ny.us/index.html
 United Kingdomhttp://www.gigateway.org.uk/default.html
 United Nations Environmental Programhttp://geodata.grid.unep.ch/
 United States of America Governmenthttp://www.geodata.gov/

Analysis

There has been a recent resurgence of interest in how space and geography are incorporated in phylogeographic analyses (e.g. Sork et al., 1999; Barbujani, 2000; Knowles & Maddison, 2002; Epperson, 2003; Manel et al., 2003; Sites & Marshall, 2003; Thompson, 2005). Here we classify analytical techniques according to their function (Table 3). The first class of techniques extracts spatial pattern from geographically distributed genetic data to identify either geographical partitions or clines (first order effects, in the terminology of spatial statistics), or alternatively, patterns of isolation-by-distance (second order effects). The second class of techniques attempts to infer historical scenarios directly from observed distributions of genes or taxa and one or more phylogenetic models (though arguably such inferences are typically made in an explicit model-free post hoc manner in most phylogeographic studies). This class contains the historical biogeographic techniques of ancestral area analysis (Bremer, 1992, 1995), dispersal vicariance analysis (Ronquist, 1997) and secondary BPA (Brooks, 1990; Brooks et al., 2001) because they have been (Bremer, 1992; Taberlet et al., 1998), or potentially could be (Riddle & Hafner, 2004), applied to phylogeographic problems if allele distributions are substituted for species distributions. A third class provides statistical testing of specific historical population genetic scenario hypotheses. Space limitations prevent examination of the pros and cons of these various methods (see cited reviews and original papers); instead we concentrate on how GIS have been or could be employed to enhance and extend these phylogeographic analyses.

Table 3.   Spatial phylogeographic analysis techniques
Function/methodDescriptionSoftwareReferences
Pattern identification
First order partitions and clines
 Analysis of Molecular Variance (AMOVA)Hierarchical partitioning of variance into predefined geographical or otherwise defined groups.ARLEQUIN,http://lgb.unige.ch/arlequin/Excoffier et al., 1992; Machado et al., 2002
 Spatial Analysis of Molecular Variance (SAMOVA)Identifies adjacent sample locations by creating a Voronoi polygon tessellation from the sample points, which is subsequently partitioned into neighbourhood clusters using a simulated annealing approach.http://wwwpeople.unil.ch/Isabelle.Dupanloup/samova.htmlDupanloup et al., 2002
 Monmonier's Maximum, Difference AlgorithmIdentifies lines of maximum genetic distance between groups of sample locations. A connectivity network is generated between sample locations and genetic distances calculated between locations connected by the network. A path is threaded through the network from the site of maximum pairwise differentiation following the a line of maximum differentiation at each network node.Alleles in Space, http://www.marksgeneticsoftware.net/Miller, 2005, Dupanloup et al., 2002
 WomblingFinds areas of abrupt change in multivariate variables. Irregularly geographically sampled variables are interpolated to a grid and surfaces differentiated to determine the rate of change at each grid point., Two maps are output showing the overall rate of change of the variables with distance and significance via permutation.ORINOCO ( M.E. Hurles, PhD thesis, University of Leicester, 1999). BoundarySeer, http://www.terraseer.com/boundaryseer.htmWomble, 1951; Rosser et al., 2000
 Synthetic MapsVariables are standardized and interpolated to a grid surface. A Principal Components Analysis is then undertaken on these surfaces to create synthetic (heuristic) maps of multivariate change in space. Geographical clines observed in the synthetic maps need further analysis to determine if they are artefacts of the sampling distribution. Cavalli-Sforza et al., 1994; Cesaroni et al., 1997; Rendine et al., 1999; Sokal et al., 1999a,b; Kidd& Ritchie, 2000; Kidd, 2001
 Hybrid Zone Cline FittingUses the metropolis algorithm and maximum likelihood to fit the centreline of a tanh cross-section cline in continuous space to irregularly spaced genetic data.Analyse, http://helios.bto.ed.ac.uk/evolgen/Mac/Analyse/Bridle et al., 2001
Second order Isolation-by-Distance (IBD)
 Spatial autocorrelation (Moran's I, Gearey c), semivariograms, Pairwise Fst v geographical distance, Mantel TestVarious ways of examining the relationship between genetic differentiation and geographical distance. Significance of IBD is often determined via a Mantel test.IBD, http://www.bio.sdsu.edu/pub/andy/IBD.html SPAGeDI, http://www.ulb.ac.be/sciences/lagev/spagedi.html SGS, http://kourou.cirad.fr/genetique//software.html Phylogeographer 1.0, http://www.maizegenetics.net/bioinformatics/phyloindex.htm R (spdep package), http://cran.;r-project.org/src/Contrib./PACKAGES.html Alleles in Space, http://www.marksgeneticsoftware.net/Buckler, 1999; Barbujani, 2000; Bohonak,, 2002; Epperson, 2003; Miller, 2005
 Allelic Aggregation IndexA modification of Clark& Evan's (1954) test for non- random distribution of genetic diversity across a landscape. Test identifies random, aggregated and spatially uniform patternsAlleles in Space, http://www.marksgeneticsoftware.net/Miller, 2005
Combined Analysis
 Partial Mantel testRegression analysis of pairwise matrices with significance testing via permutation test. Often used to test significance of IBD, but more variables can be included in a partial Mantel test, allowing the incorporation of environmental or historical parameters.ZT, http://www.psb.rug.ac.be/~erbon/mantel/Phylogeographer 1.0, http://www.maizegenetics.net/bioinformatics/phyloindex.htmThorpe, 1996; Ritchie et al., 2001
Direct historical inference
 Ancestral area analysis (AAA)Infers an ancestral area for a clade from the distribution of extant members of the clade in defined areas by comparing the gains and losses of areas during Carmin–Sokal or reversible Fitch parsimony. Bremer, 1992, 1995; Ronquist, 1994, 1995
 Dispersal vicariance analysis (DIVA)Infers the vicariance, dispersal and extinction history of internal clade nodes using a method that minimizes dispersal and extinction in relation to vicariance derived from Page's ‘maximum co-speciation method‘ for inferring co-evolutionary histories. Allows multiple reticulate relationships between areasDIVA, http://www.ebc.uu.se/systzoo/research/diva/diva.htmlRonquist, 1997; Anderson, 2002
 Secondary Brooks Parsimony Analysis (BPA)Infers the vicariance, dispersal and extinction history of internal clade nodes though the elimination of area homoplasy via iterative application of parsimony and area definition re-evaluation.PAUP/NONABrooks, 1990; Brooks et al., 2002
 Nested-Clade Analysis (NCA)The nested relationships between clades in a haplotypes network and relative spatial distances are used to infer the evolutionary history of an organism (whether it is undergoing range expansion, or has had a history of population fragmentation, vicariance, etc.) through an inference key.GeoDis 2.0, http://darwin.uvigo.es/software/geodis.htmlTempleton et al., 1995; Templeton, 1998, 2001; Knowles, 2001
Historical population genetic hypotheses testing
 Global Fst and derivativesGenerally only determine panmictic from non-panmictic, simple population structure (island, stepping-stone). Often the first (or only) step of any analysis because if populations are panmictic no significant spatial (or other) structure exists.ARLEQUIN, http://lgb.unige.ch/arlequin/ 
 Structured coalescent modelsTests concordance between observed and simulated gene trees constrained within a population model consisting of a geographic (island), demographic, and possibly migration and recombination models.MESQUITE, http://mesquiteproject.org/mesquite/mesquite.html TREEVOLVE v1.3, http://evolve.zoo.ox.ac.uk/software/Treevolve/main.html SIMCOAL v.1.0, http://cmpg.unibe.ch/software/simcoal/Knowles & Maddison, 2002; Knowles, 2001

First and second order pattern identification

First order genetic pattern in multiple traits has been investigated using variations of amova (Excoffier et al., 1992), Wombling (Womble, 1951; Rosser et al., 2000), synthetic maps (Cavalli-Sforza et al., 1994; Cesaroni et al., 1997; Rendine et al., 1999; Sokal et al., 1999a,b; Kidd & Ritchie, 2000; Kidd, 2001), Monmonier's Maximum Difference Algorithm (Monmonier, 1973) and the fitting of cline models using maximum likelihood techniques (Barton & Baird, 1998; Bridle et al., 2001). Wombling, synthetic maps and Monmonier's Algorithm require the spatial interpolation of individual character surfaces from point data. A considerable range of interpolation algorithms is available in a variety of software packages, including GIS. Character surfaces have been interpolated using a number of algorithms including inverse-distance weighting (Kidd & Ritchie, 2000), kriging (Cesaroni et al., 1997; Petit et al., 1997) and Delaunay triangulation (Miller, 2005). However, these algorithms assume spatial homogeneity in the distribution of the causal phenomena – a criterion not normally met by organisms. Individuals are rarely distributed uniformly in space, population density may vary greatly throughout the range, and ranges often enclose unsuitable habitat where the species is not present. For example, density influences the spatial pattern and clines characteristic of tension zones, with clines expected to settle in low-density regions (Nichols, 1988). Standard interpolation algorithms are very poor at incorporating such geographical heterogeneity in the interpolated surfaces (Kidd, 2001). The Network Surfacing approach (Kidd, 2001; Kidd & Ritchie, 2001), however, allows the a priori inclusion of partial or absolute barriers in surface generation, providing a way of incorporating some elements of demography into interpolated surfaces. In addition to issues of interpolation algorithm choice, techniques such as synthetic maps and Wombling that combine interpolated surfaces have been criticized because the interpolation process itself introduces additional spatial autocorrelation above that inherent in the phenomena being mapped. This effect can result in even spatially randomized data producing apparent trends and patterns (Rendine et al., 1999; Sokal et al., 1999a,b).

Second order phylogeographic pattern is often investigated through the analysis of distance matrices; however, the Allelic Aggregation Index (Clark & Evans, 1954; Miller, 2005) provides an alternative approach (Table 3). Geographic distance matrices can be created from direct measurement in the field or from a map; however, it is more common that they are calculated from explicit spatial coordinates (e.g. from a GPS). Taking populations as an example, they may be spatially represented either as points or polygons. Lines may be appropriate for linear populations such as those associated with river corridors. If a study covers a local area (100s of kilometres) then Euclidean distances between point locations can be calculated relatively easily and accurately using Pythagoras’ formula. At coarser scales, the curvature of the earth becomes significant and great circle distances should be used; alternatively, coordinates can be projected to an equidistant map projection before distance measurement.

GIS have been demonstrated to be very useful in the investigation of second-order isolation-by-distance patterns through exploration of non-Euclidean geographic distances between sample locations that may better represent the interaction of an organism and the landscape in which it exists. Standard GIS functions can be used to generate cost–distance matrices from ecological models or data on organism distribution. Grid-based minimum cost–distance analysis identified streams and roads as significant barriers to gene flow for prairie dogs, whereas the highly erosive cliffs, characteristic of the badlands environment, were not (Bowser, 1996). Raster (grid) based methods tend to overestimate distance because routes traverse the grid structure via adjacent raster cells (Kidd, 2001). This problem can be overcome by representing geographical proximity hypotheses as spatial networks connecting sample locations. Such networks are not confined to an orthogonal grid structure, and hence, the paths between sample locations will be the correct length whatever the direction. Networks can be designed to incorporate additional connectivity hypotheses including barriers to dispersal such as mountain ranges or coastlines (Buckler, 1999; Kidd, 2001; Kidd & Ritchie, 2001). GIS has also been used for examining the effect of isolation-by-distance on genetic relatedness of red squirrel populations where woodland is represented as polygons (Hale et al., 2001). GIS-based grouping of woodland patches using different inter-woodland clustering distances showed that Fst increases significantly when the clustering distance is greater than 1500 m. The authors recognized that their analysis was only made possible by the availability of accurate woodland maps and genetic sampling over the same time period, and the integration of these data within a GIS.

Partial Mantel tests partition variation in a dependent pairwise distance matrix between multiple independent matrices and thus may be used in combined analysis of both first and second order pattern. Independent matrices may represent a geographical distance hypothesis (typically Euclidean, but ideally from a more realistic network), or environmental or historical contrasts between sites (Thorpe, 1996). GIS has been used to create all of these matrices types (e.g. Ritchie et al., 2001).

Direct inference of historical scenarios

One of the main aims of historical biogeography is to infer past events (often only vicariance events but also dispersal events) and distributions from present taxon distributions and phylogenies (see Table 3). In phylogeography, direct inference techniques attempt an analogous step except that we are generally interested in the geographical distribution of clades within a taxon. Some direct inference techniques infer histories from single phylogenies (AAA, Bremer, 1992, 1995; NCA, Templeton et al., 1995; Templeton, 1998, 2001, 2004; and DIVA, Ronquist, 1997; Anderson, 2002), while others (secondary BPA; Brooks, 1990; Brooks et al., 2001) require three or more phylogenies to differentiate general pattern (i.e. common history) from specific pattern (i.e. individual history). With the exception of NCA, direct inference techniques are rarely employed in phylogeography; examples include: Taberlet et al. (1998), who used BPA to look for general pattern in genetic variation between ten species across Europe; and Bermingham & Martin (1998), who investigated the common history of three genera of Central American fish with the COMPONENT (Page, 1993) tree comparison program. NCA is a single taxon technique specifically designed for the analysis of phylogeographic pattern, which infers an historical scenario through a complex logical-deductive key that compares the spatial spread of individuals within clades with distances between the spatial centroids of individuals within hierarchical clades (but see Knowles & Maddison, 2002).

The success of direct inference techniques is determined by: (1) the information content of the phylogeographic data, and (2) the applicability of the assumptions to the specific problem under investigation. All phylogeographic and historical biogeographic signals degrade over time; however, some signals can still be detected after many millions of years. Phylogeographic studies are limited in both spatial sampling density and within-site depth by cost and time. Hence, techniques should take the quality of the phylogeny and spatial sampling into account when determining the ‘plausibility’ of historical scenarios. None of the identified techniques do this quantitatively; however, qualitative assessment of data quality is integral to NCA (Templeton, 1998, 2004). Many techniques employ quite naive models of both geographical space and biogeographical process. In AAA, BPA and DIVA geography is reduced to independent user-defined areas and biogeographic process to vicariance and/or inter-area dispersal. While such simple models may be suitable for continental scale plate tectonic induced patterns, they are unable to include known adjacency constraints on area relationships that are common both within and between tectonic plates. In NCA, the spatial distribution of clades and the spatial relationships between clades are quantified with two indices of spatial proximity; the clade distance and nested clade distance. These indices are spatial rather than geographical, as they do not incorporate information on clade shape or the topological relationships between clades. Geographical adjacency, however, is integral to Hovencamp's (2001) graphical approach to locating and temporally ordering vicariance events, as well as the recently published likelihood framework for inferring taxon range evolution (Ree et al. 2005).

GIS has potential as a flexible computing environment in which data can be pre-processed for input to these approaches, e.g. preparing species-area tables. However, these techniques may be viewed as historical scenario hypothesis generators; the hypotheses subsequently being tested with pattern analysis (described above) or hypothesis testing techniques (described below and Riddle & Hafner, 2004).

Historical scenario hypothesis testing

The coalescent originated as a statistical model of neutral genes within a lineage (Kingman, 1982), however, the coalescent is only mathematically tractable for very simple population models and is more generally used to refer to reverse-time simulation models of lineages within some population history structure (Knowles, 2001; Knowles & Maddison, 2002; Rosenburg & Nordborg, 2002; Hey & Machado, 2003). Many lineages are simulated within a structure to generate an expected distribution of neutral phylogenies from which test statistic distributions, e.g. Slatkin's s (Slatkin & Maddison, 1989), are extracted and compared to the observed value. GIS has huge potential as the environment within which structured geo- historical hypotheses can be developed from external information concerning environmental or general biotic history. For example, recent advances in ecological niche modelling (Scott et al., 2002) and the availability of present and past climatic and other environmental data have made it possible to estimate potential historic distributions that are completely independent from organism trait data. For example, Hugall et al. (2002, 2003) identified congruence between Pleistocene vicariance of the ranges of Australian wet tropical rain forest land snails and their phylogeographic pattern (Fig. 3). These palaeodistribution predictions constitute an historical scenario hypothesis that can be translated into a structure and subsequently tested against observed genetic variability with coalescent simulation.

Figure 3.

 Combining palaeoclimate and phylogeographic data, an example. BIOCLIM habitat suitability distribution models for the Australian Wet Tropics rain forest land snail Gnarosophia bellendenkerensis (Brazier 1875) for (a) the last glacial maximum (LGM) and (b) current distribution, with mtDNA haplotype phylogeny, showing the geographical association of the major clades. Letters refer to regions discussed in the original paper. Figure adapted from Fig. 1 in Hugall et al. (2002); copyright (2002) National Academy of Sciences, USA.

Visualization

Scientific visualization of phylogenies and phylogeographic patterns has a long history (Hewitt, 2001) and is an important but sometimes overlooked aspect of phylogeography. A common visualization is the simple cartographic display of trait data as pie charts, scaled-dots or isolines over a base map. Hoffmann et al. (2003) provide an example of cartographic visualization of genetic character states and differentiation index scores. Such maps can be created simply and efficiently within GIS. Phylogenetic trees and networks are often visualized over a cartographic background. Trees with associated geographical coordinates can be imported and displayed with the Mesquite Cartographer (Maddison & Maddison, 2005) and Phylogeographer (Buckler, 1999) packages. When phylogenetic trees and networks are displayed over geography it is usual for tree branch length to be distorted to fit the geography. An alternative would be to distort the map to fit the phylogeny – an interesting means of visualizing geography in terms of organism gene flow or dispersal. More complex interactive visualization of geographical hierarchies can be undertaken with linked graphics windows displaying colour-coded trees and maps; selection of a level of hierarchy in the tree results in a map of geographic distribution of the tree branches at that level (e.g. White & Sifneos, 2002). Many GIS provide tools for visualizing three-dimensional spatial data that can be used to visualize spatiotemporal data in two-dimensions of space and one of time (e.g. Fig. 4; see also Fig. 6 in Perez et al., 2002). One of the authors (D.M.K.) is presently developing an application for the ArcGIS (http://www.esri.com) to facilitate such visualizations.

Figure 4.

 Visualizing phylogenies in ArcGIS ArcScene® (ESRI, Redlands, CA, USA, 2004). Phylogenies of the freshwater fish family Goodienae (shades of purple by tribe; Webb et al., 2004) and genera Poeciliopsis (green; Mateos et al., 2002) and Notropis (blue; Schönhuth & Doadrio, 2003) with modern elevation and drainage, Pliocene and Miocene drainage and palaeolakes from de Cserna & Alvarez (1995)). The visualization can be rotated, zoomed in on, phylogenies rescaled in z, and additional information added (e.g. distributions or fossil locations) to aid the identification of spatio-temporal congruence.

Spatial interpolation is a useful visualization approach for continuous phylogeographic pattern, e.g. allele frequencies. The Alleles in Space package (Miller, 2005) has a ‘genetic landscape shapes’ algorithm designed specifically to visualize patterns of genetic diversity. In genetic landscape shapes, pairwise genetic distances are calculated at the intermediate points of a connectivity network and then an inverse-distance interpolation algorithm used to generate a continuous spatial grid of estimated genetic distances. The genetic distance surfaces can then be visualized as a three-dimensional surface with height representing genetic distance (e.g. Miller et al., 2006).

Discussion

Phylogeography attempts to infer history from the geographical variation of genes and genetically controlled characters. Understanding the geographical context of the observed pattern is essential in the development of evolutionary models of real organisms. GIS provides a ready-made computer environment for the capture, storage, analysis, visualization and dissemination of geographic data and have the potential to form the basis of an integrated phylogeographical information system. Internet GIS server technologies (Peng & Tsou, 2003) allow data bases to be disseminated and managed via the Internet, eventually perhaps to provide a global, multi-species, phylogeographical information system greater than any one institution could create. Current sampling approaches may not match the ideal required by many available spatial analysis techniques and no existing data base fulfils all the requirements for storing and disseminating phylogeographic information. Given the near ubiquity of GIS applications in dealing with spatial data in a wide range of scientific disciplines, as well as business and government, its potential application should be seriously examined by the phylogeographic community.

The review of spatial analysis methods and software (Table 3) reveals a growing interest in spatial and geographic analysis by phylogeographers. Each software package examined, however, has its own unique file formats for importing geographical data. The lack of a common ‘GeoPhylo’ data exchange format limits the use of these programs and makes comparative studies to assess their relative merits time consuming. It is time that a ‘GeoPhylo’ data standard exchange format is developed to allow easy exchange of spatial phylogenies, historical reconstructions, diversity surfaces and their associated metadata between GIS and other programs. Any format should be compliant with developing standards for geographical data exchange, in particular, the Geographical Markup Language (GML; Open GIS Consortium, Inc., 2004).

In this article, we have provided an introduction to GIS-based spatial representation and tools, as well as describing examples of GIS-mediated analysis and visualization of phylogeographic data. It seems highly likely that GIS will become a central tool in the future of phylogeography through its ability to link disparate organism studies explicitly to each other and to the environmental context in which the variation formed, through the common language of geospatial referencing.

Acknowledgements

We are grateful to Ruth Hamill, Constantino Macias-Garcia and Pablo Gesundheit for some valuable comments concerning this article and Andrew Hugall for supplying Fig. 3. We also thank two anonymous referees for their extensive and valued comments. This work was supported by NERC standard grant NER/A/S/2001/00450.

Biosketches

David M. Kidd has worked with GIS for 12 years in local government, industry and latterly academia. He was recently employed at the University of St. Andrews applying GIS to phylogeographic and historical biogeographic problems in a genus of European bushcrickets, and is now on a fellowship at the National Evolutionary Synthesis Center, based at Duke University, North Carolina, USA, developing GIS applications for phylogeographers.

Michael G. Ritchie is Professor of Evolution at the University of St. Andrews. He has research interests in behavioural genetics and phylogeography of a broad range of species.

Editor: Brett Riddle

Ancillary