Spatial modelling and landscape-level approaches for visualizing intra-specific variation



    1. Center for Tropical Research, Institute of the Environment, University of California, Los Angeles, La Kretz Hall, Suite 300, 619 Charles E. Young Dr. East, Los Angeles, CA 90095, USA
    Search for more papers by this author

    1. Center for Tropical Research, Institute of the Environment, University of California, Los Angeles, La Kretz Hall, Suite 300, 619 Charles E. Young Dr. East, Los Angeles, CA 90095, USA
    2. School of Biological Sciences, University of Nebraska, Lincoln, NE 68588, USA
    Search for more papers by this author

    1. Center for Tropical Research, Institute of the Environment, University of California, Los Angeles, La Kretz Hall, Suite 300, 619 Charles E. Young Dr. East, Los Angeles, CA 90095, USA
    2. Department of Ecology and Evolutionary Biology, University of California, Los Angeles, 621 Charles E. Young Dr. East, Los Angeles, CA 90095, USA
    Search for more papers by this author

    1. Center for Tropical Research, Institute of the Environment, University of California, Los Angeles, La Kretz Hall, Suite 300, 619 Charles E. Young Dr. East, Los Angeles, CA 90095, USA
    Search for more papers by this author

    1. Center for Tropical Research, Institute of the Environment, University of California, Los Angeles, La Kretz Hall, Suite 300, 619 Charles E. Young Dr. East, Los Angeles, CA 90095, USA
    2. Department of Ecology and Evolutionary Biology, University of California, Los Angeles, 621 Charles E. Young Dr. East, Los Angeles, CA 90095, USA
    Search for more papers by this author

    1. Center for Tropical Research, Institute of the Environment, University of California, Los Angeles, La Kretz Hall, Suite 300, 619 Charles E. Young Dr. East, Los Angeles, CA 90095, USA
    2. Department of Ecology and Evolutionary Biology, University of California, Los Angeles, 621 Charles E. Young Dr. East, Los Angeles, CA 90095, USA
    Search for more papers by this author

Henri A. Thomassen, Fax: +1 310 825 5446; E-mail:


Spatial analytical methods have been used by biologists for decades, but with new modelling approaches and data availability their application is accelerating. While early approaches were purely spatial in nature, it is now possible to explore the underlying causes of spatial heterogeneity of biological variation using a wealth of environmental data, especially from satellite remote sensing. Recent methods can not only make inferences regarding spatial relationships and the causes of spatial heterogeneity, but also create predictive maps of patterns of biological variation under changing environmental conditions. Here, we review the methods involved in making continuous spatial predictions from biological variation using spatial and environmental predictor variables, provide examples of their use and critically evaluate the advantages and limitations. In the final section, we discuss some of the key challenges and opportunities for future work.


New developments in spatial analysis of biodiversity and the increasing availability of georeferenced environmental data (Box 1), have led to their rapid integration into the field of molecular ecology. For example, recent extensions of classic phylogeographic analyses have integrated geographic information system (GIS) analyses of environmental data and ecological niche modelling techniques into phylogeographic inference on both contemporary and historical timescales (Kidd & Ritchie 2006; Carstens & Richards 2007; Knowles et al. 2007; Kozak et al. 2008; Swenson 2008). Indeed, the unification of environmental and genetic data has given rise to an entire subdiscipline within the field of molecular ecology, landscape genetics, which uses these landscape-level data in spatially explicit studies to better understand the historical and contemporary factors that influence the distribution of intra-specific diversity (Manel et al. 2003).

Much of the work in landscape genetics and phylogeography seeks to identify the historical and contemporary landscape features that have shaped current patterns of biological diversity. Recent advances in GIS technologies and spatial statistics have increased the predictive power of spatial analyses by refining approaches that identify and quantify associations between environmental variables and intra-specific genetic or phenotypic diversity (e.g. Foll & Gaggiotti 2006; Joost et al. 2007) and use this information to project the response variable patterns in space (the landscape) and time. These statistical associations can be used to project patterns of diversity across unsampled areas of the landscape, resulting in continuous predictions of (genetic or phenotypic) variation. When projected onto historical or estimated future environmental conditions, these predictions can be used to understand the spatio-temporal dynamics of biological variation (i.e. any type of variation within or between species or communities, such as alpha or beta diversity) under various scenarios of changing environmental conditions. Detailed maps of genetic, phenotypic and demographic variation have recently been used to address a broad array of topics, such as the nature of microevolutionary processes that promote diversification (Carnaval et al. 2009; Pease et al. 2009; Freedman et al. 2010), the prioritization of areas for conservation (Vandergast et al. 2008; Thomassen et al. 2010) and the identification of current and future hotspots of disease outbreaks (Gilbert et al. 2008).

An understanding of the utility, predictive power and limitations of these analyses is an important prerequisite for their application in molecular ecology. As such, our aim here is to review the techniques that generate spatially explicit predictions of biological variation (Table 1). In doing so, we specifically focus on techniques that explicitly integrate environmental information (such as climate or vegetation characteristics; see Box 1), as opposed to methods that perform purely spatial analysis (and thus focus on the relative position of divergent populations, without taking into account habitat properties). We exclude from our review those analyses that do not visualize the patterns of intra-specific variation and instead refer the reader to a number of excellent reviews of these topics (Manel et al. 2003; Kidd & Ritchie 2006; Carstens & Richards 2007; Knowles et al. 2007; Storfer et al. 2007; Holderegger & Wagner 2008; Kozak et al. 2008; Balkenhol et al. 2009a,b; Manel & Segelbacher 2009). Similarly, in a recent review of spatial statistical methods, Guillot et al. (2009) focused on isolation-by-distance and clustering methods, but did not discuss approaches that incorporate environmental data. The aim here is to fill this gap by focusing on the intergration of spatial environmental information with the analysis of intra-specific variation in order to project and visualize patterns of variation across landscapes. While we will focus on the analysis of intra-specific genetic and phenotypic variation, it is important to note that these methods are often suitable for the analysis of species-level data, such as community composition, or alpha or beta diversity. For each of the methods discussed we: (i) provide a technical description; (ii) present an illustrative example; and (iii) critically assess each method’s advantages and limitations. Finally, in an effort to stimulate thought and spur further development of these powerful analytical approaches, we identify specific theoretical and practical challenges that confront researchers interested in projecting biological variation on heterogeneous landscapes.

Table 1.   Approaches, utility, input data types, examples and associated software packages
MethodUtilityData typeExamplesSoftware and web address
Canonical trend surface analysisModelling of biological variation across landscapeSpatial coordinates, environmental variablesWartenberg (1985); Grivet et al. (2008); Sork et al. (2010)Implemented in SAS; trend surface analysis implemented in SAM (
Principal coordinates of neighbour matricesModelling of biological variation across landscape; purely spatial modelling step of which the results are used in subsequent (regression) analysesSpatial co-ordinatesBorcard & Legendre (2002); Dray et al. (2006); Legendre et al. (2009); Ruggiero et al. (2009)Implemented in SAM (
Tree regression, random forestModelling of biological variation across landscape; relating environmental heterogeneity to biotic differencesAny type of location-specific measurementBreiman et al. (1984); Breiman (2001a,b); Prasad et al. (2006); Cutler et al. (2007)R-packages tree and randomForest available from the R site;,
GDMModelling of biological variation across landscape; relating environmental heterogeneity to biotic differencesAny type of dissimilarity matrix and spatial coordinatesFerrier et al. (2007); Thomassen et al. (2010); Freedman et al. (2010)ArcView 3.2 in conjunction with SPlus (no official release of the software); R-package GDM_R_ Distribution_Pack_V1.1, but not full utility; authors are working on a version with full utility;

Simple regression methods

Technical description

One of the most straightforward methods to spatially predict biological variation based on environmental variables is by using classical, linear or non-linear regression methods or more sophisticated methods, such as logistic regression (used in, for example, Foll & Gaggiotti 2006). In these types of analyses, the regression equation that describes the relationship between the biological and environmental variables is used to assign each grid cell on the map an expected response value for the biological trait of interest based on the value of the relevant environmental predictor in that cell. The use of multiple-regression models allows for the inclusion of the effects of several different predictor variables.

In order for genetic data to be regressed against environmental variables, the data often need to be transformed to a measure that is informative at the site level. For instance, if the genetic data are available in the form of haplotypes, haplotype diversity could serve as the response variable in a regression framework. Similarly, genotypic data could be regressed as a frequency of a given genotype. Ordination methods are often used to transform genetic data into scores that can be used in regression. Ordination methods cluster sets of samples in multidimensional space, based on a set of associated characteristics along orthogonal axes. Genetic data can subsequently be summarized based on their scores along one of these axes. One of the best known ordination methods is principal components analysis (PCA). However, a recent study highlighted issues associated with the use of PCA on spatially autocorrelated data in a spatial context (Novembre & Stephens 2008; see below). An overview of commonly used ordination methods in genetics is provided by Jombart et al. (2009).

In landscape-level studies, it is desirable to distinguish spatial structure in biological data unrelated to environmental variation from structure associated with environmental components. In order to separate the different spatial components of genetic variation along a set of independent axes, spatial ordination techniques have been developed. Canonical trend surface analysis (Wartenberg 1985) uses canonical correlation analysis to assess the relationship between predictor and response variables. The methodology is used to estimate a purely spatial (latitude, longitude and elevation) response surface across a landscape, but can also be used as input in subsequent analyses that include environmental variables. For instance, the residuals from two independent canonical trend surface analyses on morphological characters and environmental variables were used in a regression to explain morphological variation in honey bees by environment, independent of space (Diniz-Filho & Bini 1994). First, a canonical correlation analysis (CCA) is performed on the data. Canonical correlation analysis finds the linear combinations of variables out of two sets of predictors such that their correlations are maximized. The so-called redundancy coefficients may be examined in order to assess the amount of variation that is explained by each set of correlated variables. This is important, because it has been observed that two sets of predictor variables may be highly correlated, but explain only a small proportion of the variation (Wartenberg 1985). The scores resulting from the CCA are subsequently used to compute the equation that describes the relationship between predictor and response variables. This equation is used in the final step to define the shape of the response surface across the entire landscape.

Since its development, issues have been identified with the use of canonical trend surface analysis (see below) and principal coordinate analysis of neighbour matrices [PCNM; implemented in the program ‘Spatial Analysis in Macroecology’ (SAM; Rangel et al. 2006)] has been proposed as an alternative (Borcard & Legendre 2002; Dray et al. 2006). The PCNM base functions are computed by means of a principal coordinate analysis (PCoA) on a geographic distance matrix among the sampling locations. This distance matrix is first truncated such that only the smaller values (i.e. distances between neighbours) are retained and the remaining values replaced by an arbitrarily large number. The PCNM approach results in often a relatively large number of PCNM variables [for instance, between 28 and 176 in four examples in Borcard et al. (2004)] that describe the dominant spatial scales at which the variation is observed. Forward selection is then used to select the PCNM variables that are significant and can be used in subsequent regression analyses that also include environmental variables (Manel et al. 2010b).


In a study of the environmental factors related to the incidence of two kinds of infectious diseases in humans, trachoma and trichiasis, Schémann et al. (2007) employed multiple linear regression analysis on the untransformed predictor and response variables to define the relationship between disease prevalence and various environmental variables in Mali (Fig. 1). The authors found significant relationships between environmental predictors and disease prevalence. Their regression models explained a maximum of 19.6% of total observed variation and were used to map disease prevalence across the entire study area. Different environmental variables were important in the two diseases, resulting in geographic gradients running perpendicularly. These opposite spatial patterns could be related to differences in disease ecology and natural history, as well as human population susceptibility and behaviour.

Figure 1.

 Predictive map of trachoma in Mali, based on multiple linear regression of trachoma prevalence with several environmental factors (reprinted from Schémann et al. 2007). The inferred model was as follows: trachoma prevalence = 14.23 + 3.64 (latitude) – 0.24 (longitude) + 0.04 (altitude) + 0.002 (rainfall) – 0.08 (humidity) – 1.25 (mean temperature). Grey scale indicates prevalence. Black dots indicate sampling sites.

Recently, Grivet et al. (2008) used canonical trend surface analysis to assess spatial genetic structure of the threatened California valley oak (Quercus lobata). The specific goal was to determine the potential impact of habitat destruction and climate change on the genetic composition of the species. The authors tested the relationship between geographic location and elevation as predictor variables and multilocus genetic data as response variables. While this in effect is a purely spatial analyses in three dimensions, elevation is likely to correlate to a number of environmental variables important in the ecology of the oak species. Separate analyses were carried out on microsatellite data and chloroplast DNA sequences. Because of the high number of variables resulting from microsatellite analyses (78), a PCA was carried out to reduce the number of variables to the first 15 principal components, which explained 48% of the variation. These 15 components were used as the response variable set in canonical trend surface analysis. Maps of scores for the first two canonical axes showed distinct patterns for the two types of genetic markers (Fig. 2). For both marker sets, significant spatial heterogeneity in genetic composition was suggested by the trend surface analyses. The authors detected both patterns of isolation-by-distance, as well as distinct genetic clusters that could be related to cycles of glacial and interglacial periods, which caused contraction of populations into refugial areas, population divergence in allopatry and subsequent expansion to their current ranges. The resulting maps were used to identify distinct genetic groupings that could be lost without appropriate conservation action. This conclusion was examined in greater detail in a subsequent study using canonical trend surface analyses to identify associations between genetic structure and climate variables (Sork et al. 2010).

Figure 2.

 Results from canonical trend surface analysis on chloroplast genetic markers of California valley oak (Quercus lobata) using the population mean standardized canonical scores (reprinted from Grivet et al. 2008). Results are shown for the first canonical axis. Grey scale indicates classes of canonical scores. White dots indicate sampling sites.

Advantages and limitations

Simple regression methods can be useful in mapping biological variation, due in large part because of the relative ease in implementation and interpretation. Additionally, models with varying degrees of complexity (with greater or fewer predictor variables included) can be compared to one another using standard information criteria based on likelihood (Akaike’s information criterion, AIC) or Bayesian (Bayesian information criterion, BIC) penalization parameters, in an attempt to identify the fewest number of predictors that adequately describe variation in an environmental response variable. However, an important caveat of this methodology is that it becomes increasingly difficult to include multiple environmental variables, each having a non-linear and perhaps different, relationship with the response variable. Data need to be checked for normality and multicollinearity and scale differences between predictor variables can drastically affect which variables appear important in regression models. In reality, the spatial heterogeneity of biological diversity is the result of a complex interaction between geographic distance and various environmental and ecological factors. Unless there is a tight relationship between a biological trait and a single environmental variable, spatial predictions of biological variation using classical regression methods are potentially compromised by a lack of predictive power, because a single function in multivariate space is often able to explain only a small amount of the total observed variation, as opposed to methods that fit a function to each individual variable.

Recently, Novembre & Stephens (2008) discussed potential caveats of using PCA in spatial analyses when data are spatially autocorrelated. They found that clines inferred from PCA applied to spatial data may actually be the result of artefacts of the methodology (Fig. 3). The mathematical artefacts tend to be more pronounced when the effect of isolation-by-distance is stronger, because these data have highly structured covariance matrices, with eigenvectors related to sinusoidal waves, producing similar waves in the maps generated from PCA-transformed genetic data. Moreover, such artefacts are not limited to genetic data, but apply to all data that is highly spatially autocorrelated (Novembre & Stephens 2008).

Figure 3.

 Results from spatial interpolation of the principal components of human genetic variation in Asia, Europe and Africa (right three columns, Menozzi et al. 1978; Cavalli-Sforza et al. 1994) in comparison to artefactual patterns resulting from principal component transformation of the genetic data (left two columns, Novembre & Stephens 2008). The first column shows the theoretical expected PC maps for a class of models in which genetic similarity decays with geographic distance. The second column shows PC maps for population genetic data simulated with no range expansions, but constant homogeneous migration rate, in a two-dimensional habitat (figure reprinted from Novembre & Stephens 2008).

With respect to methods that are used to include a purely spatial component along with environmental variables to explain variation among populations, canonical trend surface analysis is typically only useful for describing broad-scale relationships, because of identified limitations inherent to the method (Wartenberg 1985). One should be cautious in extrapolating trend surfaces to areas that have not been sampled, because data closer to the centre are fitted more accurately than those at the edges of the study area (Wartenberg 1985). The main disadvantage of trend surface analysis is its sensitivity to cross-correlations between predictor variables (Borcard & Legendre 2002; Dray et al. 2006). In addition, the user of canonical trend surface analysis needs to make an arbitrary choice about the degree of the polynomial function used to estimate the relationship between predictor and response variables (Dray et al. 2006). Because of these issues, Borcard & Legendre (2002) and Dray et al. (2006) proposed PCNM as an alternative to trend surface analysis. The utility of PCNM in intra-specific studies is currently not well understood, because to our knowledge, in a mapping context, this approach has only been applied to model species richness (e.g. Legendre et al. 2009) and abundance (Ruggiero et al. 2009).

Decision trees and random forests

Technical description

Although decision tree techniques and their associated iterative approaches to test for variable importance, such as random forests (Breiman 2001a; b), have had a long history of use in non-biological fields, they have not been widely applied in ecology and evolutionary biology. Nevertheless, decision trees and random forests are useful modelling techniques, because of their high accuracy and predictive power (Breiman 2001a; Cutler et al. 2007).

Decision tree models use binary recursive partitioning procedures to measure the amount of variation in a response variable explained by each predictor variable used in the model. No a priori assumptions are made about the relationship between predictor and response variables, allowing for the possibility of non-linear relationships with complex interactions and for different predictor variables to assume importance at different points in the range of a response variable. Decision trees can handle both categorical as well as continuous variables, referred to as tree classifications and tree regressions respectively. The resulting model is presented as a bifurcating tree in which the nodes represent the predictor variables that split the response variable data set into two partitions such that the homogeneity within each partition is maximized (Fig. 4). Homogeneity is measured by the Gini index (Breiman et al. 1984) and splitting continues until further partitioning does not reduce this index. The length of the branches following each partition indicates the relative importance of the partitioning predictor variable, which is indicated on the nodes, along with its splitting value. Unlike typical regressions, these non-linear, non-parametric functions provide a robust means for assessing the relative importance of each variable included in explaining the observed variation (Breiman 2001b).

Figure 4.

 Example of a regression tree. The nodes represent the predictor variables that split the response variable data set into two partitions such that the homogeneity within each partition is maximized (measured by the Gini index) and splitting continues until further partitioning does not reduce this index. The length of the branches following each partition indicates the relative importance of the partitioning predictor variable, which is indicated on the nodes, along with its splitting value. In this hypothetical example, ‘percent tree cover’ is the most important variable, splitting the dataset at 60%. The next most important variable is ‘surface moisture’, followed by ‘elevation’. Values at the terminal ends of the tree indicate the means of that partition of the response variable (for instance haplotype diversity).

Several approaches to assess model performance and variable importance exist, all of which are based on an iterative randomization procedure. For example, in ‘bagging’ (Breiman 1996) a random subsample is taken from the original dataset, and tree regressions are performed on each new dataset. The samples that are not included in the random subsample, the so-called out-of-bag samples, are subsequently used to test the model predictions from the bagged samples. In random split selection (Dietterich 2000), bifurcations are selected at random from a set of best splits and Breiman (1999) describes a procedure to randomize the full original dataset. The most recently developed procedure, random forests, combines several of these approaches and in addition takes a subset of predictor variables to construct a large number of trees (Breiman 2001b).

In the application of tree regressions and random forests to spatial modelling of biological diversity, a tree and associated random forest are generated in the first step from the predictor and response variables. Using the splitting rules determined with the training dataset (whereby the same splitting values for each predictor variable are used and the relative importance of variables determines their order in the splitting procedure), a prediction can subsequently be made for the response variable across the entire landscape. Depending on the size of the study region and the resolution of the environmental data, predictions can be made for each grid cell or a number of random grid cells followed by spatial interpolation.


Due to its relatively recent introduction to ecology (Cutler et al. 2007), there are few studies in molecular ecology that have utilized random forests and fewer still that have produced predictive maps. In one example, Prasad et al. (2006) used random forest to predict suitable habitat for 135 tree species under current and future climate conditions. The predictor variables included climate and remote sensing variables (such as land-use and topography), as well as detailed information on soil composition. The response variable was not a simple presence–absence classification, as is often used in distribution modelling, but was based on the number of individual trees and the relative area occupied in each of approximately 100 000 survey plots in the eastern USA. Indicators of model performance, as measured by kappa statistics—the level of agreement between two different maps—suggested that random forest models consistently and robustly predicted current habitat suitability. The models indicated the best environmental determinants of occurrence for each species and generally suggested a northward shift in suitable habitat under changing climate conditions (Fig. 5).

Figure 5.

 Results from random forest models of the abundance of loblolly pine in the eastern USA (reprinted from Prasad et al. 2006). (A) Observed pattern based on records from the Forest Inventory and Analysis (FIA). (B) Predicted abundance under current climate conditions using FIA records in a random forest model in conjunction with environmental variables. (C) Projection of the pattern of abundance on future climate conditions.

Recently, large scale predictions of tropical biomass have been modelled and predicted in Africa using random forest models (Baccini et al. 2008). Using a suite of satellite data from across multiple years, approximately 82% of the variation in above-ground biomass was explained by variables that included tree height and height of median energy (HOME). These models were then used to generate a predictive ‘first map’ of biomass at a continental scale. While predictions may vary in their accuracy depending on sample numbers and breadth of habitats sampled, this study is demonstrative of the ability of satellite data and flexible, non-parametric regression techniques to provide insight into the mechanisms that drive continental-wide variations in biological communities.

Although few examples are available that highlight the utility of decision trees and random forest analyses to questions regarding intra-specific variation, it is important to note that these types of analyses are generally applicable to intra-specific level morphological and genetic data. For instance, Murphy et al. (2010) used random forests to distinguish the effects of habitat properties at multiple spatial scales on population connectivity in toads (Bufo boreas) in Yellowstone National Park. While the authors did not provide predictive maps of connectivity among toad populations, their study exemplifies the potential for random forest analyses in landscape genetics.

Advantages and limitations

Tree regressions and random forests are powerful tools for assessing spatial heterogeneity, in particular for predictive purposes. In a comparison of random forests, decision trees, logistic regression and discriminant function analysis applied to three different datasets, random forests consistently outperformed the other methodologies as assessed by several different accuracy measures (Cutler et al. 2007). In addition, random forests predicted the abundance of a large set of tree species based on environmental variables better than did tree regressions, tree bagging or Multivariate Adaptive Regression Splines (Prasad et al. 2006). In a multi-model comparison, the random forest approach was among those least sensitive to spatial autocorrelation (Marmion et al. 2009). However, there are a number of issues concerning these methods that users should consider. First, models produced by decision trees and random forests are based on a bifurcation process and no regression lines are fitted. Thus, these models are not explicit, in that standard single functions for a model are not assigned. As a consequence, while often producing highly accurate predictions of environmental variables, interpretation of the results may not be straightforward. A number of papers cover the technical description of these procedures and the theoretical contrast between these models as compared to traditional regression techniques (Breiman 2001a,b). In addition, tree regression techniques require response variables that are meaningful at the point locality level, such as morphological measurements, prevalence levels or indices of diversity. The approach is therefore not suited for examining genetic differentiation between sites without first transforming the data by, for example, PCA. Finally, geographic distance can only be included by means of two separate vectors for latitude and longitude or the one-dimensional distance from a fixed point, which is problematic in assessing the effect of isolation-by-distance, for which pairwise distances are often more appropriate.

Generalized dissimilarity modelling

Technical description

Generalized dissimilarity modelling (GDM) is a recently developed matrix regression technique (Ferrier et al. 2007) that relates dissimilarities in predictor variables (e.g. climate and vegetation variables) to dissimilarities in response variables (e.g. genetic distances or morphological differentiation among populations). While GDM was originally developed to study species beta diversity, it can be used to study turnover in virtually any trait that can be represented as dissimilarities among sampling localities.

Two types of non-linearity are often encountered in ecological modelling of biological traits. First, dissimilarities, such as Fst values, are often scaled between 0–1. Thus, population divergence cannot extend beyond Fst = 1, even if habitat differentiation or geographic distance keep increasing. The relationship between habitat heterogeneity and population divergence is therefore non-linear. The second non-linearity encountered is that the rate of change in response variables along environmental gradients is often not constant. GDM allows for modelling such non-linear relationships by fitting a linear combination of I-spline basis functions to the environmental variables in a linear predictor η. A link function is subsequently used to define the relationship between response and η. In an iterative process, predictor variables are added to and removed from the model. The significance of a predictor variable is determined by computing the difference in deviance explained by a model with and a model without the variable concerned. This difference in deviance explained is then compared to a null-distribution of differences in deviance resulting from a large number of models for which the order of the sites has been permuted. Only the variables that significantly improve the model are retained. The response to the individual predictor variables that are selected by the algorithm can be assessed by examining the response curves that are provided as output from GDM (Fig. 6). The maximum height reached by the response curves is indicative of the relative importance of the variables retained in the model. In addition, the slopes of the response curves indicate the rate of change in the response variable along the environmental gradient concerned. For those interested in the details of the mathematics underlying GDM, Ferrier et al. (2007) provide an excellent overview.

Figure 6.

 Example of generalized dissimilarity modelling response curves. The height reached by each of the response curves provides an indication of the relative importance of the corresponding predictor variables. In this hypothetical example, ‘precipitation of the driest quarter’ is the most important variable, followed by ‘elevation’ and ‘geographic distance’.

The first step in GDM is to fit dissimilarities among sampling sites in the response variable to differences in the local environments. The result of this procedure is a function describing the relationship between predictor and response variables. As the environmental conditions are known for every location in the study area, the response variable can subsequently be projected across the entire area. Theoretically, the response can be predicted for each individual grid cell, but this is often unfeasible due to computational limitations. The software that implements GDM, therefore, randomly picks any number of points (the default is 5000) across the landscape and interpolates the predictive surface among those points. For visualization purposes, classes of similar responses are colour coded, where larger differences between any two localities on the map represent larger differences in the response variable.


Recently, GDM was applied to conservation biology in a study aimed at incorporating evolutionary processes in conservation prioritization (Thomassen et al. 2010). The authors used GDM to model intra-specific genetic and morphological variation in the wedge-billed woodcreeper (Glyphorynchus spirurus) in Ecuador.

Nei’s D genetic distances from amplified length polymorphism (AFLP) markers and Euclidean distances for several morphological variables were used as measures of population differentiation. Generalized dissimilarity models showed that population divergence was strongly associated with environmental heterogeneity on both sides of the Andes (>60% of total observed variation explained; Fig. 7). The results suggest that environmental variation was likely important in shaping a large proportion of the observed intra-specific variation, whereas simple isolation-by-distance seemed to play only a minor role. Regions that comprise the highest rates of genetic and phenotypic change in wedge-billed woodcreepers in Ecuador occurred along steep elevation gradients in the Andes. These areas may be particularly important for conservation, because not only do they harbour high levels of phenotypic and genetic variation, but they also allow for migration along environmental gradients should climate conditions become unfavourable along part of the gradient.

Figure 7.

 Predicted patterns of morphological variation in the wedge-billed woodcreeper for separate generalized dissimilarity models of: (A) wing length; (B) bill depth; (C) tail length; and (D) tarsus length (reprinted from Thomassen et al. 2010). Grey indicates areas where the species is not present. Pairwise comparisons of colours between any two sites in the landscape indicate morphological differences, where large colour differences (see bars) represent large morphological differences. Blue (A) and red (B–D) dots indicate sampling localities.

Advantages and limitations

Dissimilarity approaches such as the one employed in GDM have two main advantages. First, they are particularly suitable for genetic data, for which pairwise comparisons of population genetic differentiation are often used. Because of issues discussed above with the use of PCA to summarize genetic data and because PCA does not take into account models of molecular evolution, pairwise comparisons that calculate population divergence based on assumptions specific to genetic data may be more informative. Any one of those pairwise genetic distances can be readily used in generalized dissimilarity models. The second advantage of the generalized dissimilarity approach is that a true measure of geographic distance (rather than a combination of two separate vectors for latitude and longitude) can be included to incorporate hypotheses regarding isolation-by-distance. In addition to geographic distance, it is also possible to include other measures of distance, such as least-cost-paths (Ray 2005) or resistance distances (McRae 2006), which are both based on a friction grid. However, if one of these alternative measures of distance is selected as an important variable in explaining the observed variation, it is difficult to visualize the predicted patterns on a map, because the calculation of least-cost-paths or resistance distances between a large number of localities is computationally intractable. Another advantage specific to GDM is its ability to model non-linear relationships between predictor and response variables.

One disadvantage of the dissimilarity method is that the resulting maps are not always easy to interpret. This is particularly true when the approach is used to model morphological traits. A simple regression between morphology and habitat can provide information on the direction of their relationship, yet the dissimilarity method can only be used to assess whether a difference in habitat may result in a difference in morphology; the resulting model provides no information on the direction of that change. However, direct inspection and analysis of response variables can generally clarify this issue.

Another disadvantage is that there is no formal significance testing procedure implemented so far, forcing the user to interpret the meaning of the ‘percent variation explained’. For example, is a model that explains 50% of the total observed variation a good model? One way to assess significance in relation to the percent variation explained is to first randomize the dataset repeatedly, subsequently recompute dissimilarity models for each repetition and generate a null distribution of the variance explained at random for comparison to the original model.

Finally, the current version of the software does not provide an indication of the uncertainty of predictions into areas that have not been sampled. This is of particular concern when the environmental conditions of those areas are outside the range spanned by the sampling sites and the model needs to extrapolate the response curves. Such extrapolation could potentially result in inaccurate predictions or at least lower levels of confidence in predicted variation in those areas. Despite its disadvantages, GDM is a promising addition to methodologies that focus on mapping biological variation continuously across the landscape, in particular because it is to our knowledge the only method that attempts to model population divergence in a non-linear fashion, simultaneously taking into account the effects of geographic distance and habitat heterogeneity. We believe the method is particularly useful for managers and reserve designers who are interested in mapping biological patterns and processes across the landscape (Thomassen et al. 2010).

Future challenges

Undoubtedly, with the availability of a growing number of spatially continuous environmental variables that capture the myriad habitat and climate properties, landscape-level modelling of biotic diversity has the potential to be useful for many fields. Despite advances in modelling, there are a number of challenges that need to be overcome in order to make full use of these methods. A multitude of software programs are now available that focus on specific aspects of spatial analyses, yet there exists no single platform that integrates both basic and sophisticated approaches. In the past spatially explicit predictive modelling of biological variation has been used without a clear sense of the workflow, data requirements and suitability of the existing methods for the types of questions asked. Perhaps these factors contribute to a relative lack of hypothesis-driven approaches, as opposed to attempts to find and describe correlations between biotic and environmental heterogeneity and then interpret them a posteriori.

Sampling design for spatial analyses

As researchers become increasingly interested in testing hypotheses regarding the influence of environmental variation on patterns of biological variation and projecting these associations across landscapes, careful consideration of sampling regimes will be important for separating the effects of distinct, but co-varying environmental factors. In particular, four issues are of primary importance: (i) spatial autocorrelation of environmental variables; (ii) the spatial and temporal scale of environmental data; (iii) adequate sampling of environmental parameter space; and (iv) accurate georeferencing. The design of efficient sampling regimes will be specific to the taxa and questions under study and Anderson et al. (2010) provide an excellent overview of these general issues.

Model performance and requirements

There is a general need for understanding the accuracy of the available modelling methods, specifically the minimum data requirements, the sensitivity to differences in sample sizes and how a model performs when the range of environmental space spanned by the predicted areas exceeds the range spanned by sampling locations. Analyses of data requirements using simulated datasets are essential for a better understanding of these issues. At the same time, there is a need for more rigorous statistical significance testing of the associations between predictor and response variables, such as the computation of confidence intervals, mapping of uncertainty levels and the use of information criteria to differentiate between alternative hypotheses regarding the influence of predictor variables and to avoid overfitting.

Integrative approaches to test alternative hypotheses

A further challenge is to develop a better integration of the effects of isolation-by-distance, landscape connectivity and environmental variables into a single analysis. Cushman et al. (2006) nicely demonstrated the importance of including landscape elements in spatial analyses by showing that even though models for isolation-by-distance were statistically significant, the addition of landscape effects much improved their performance. Although this study takes important steps towards disentangling the effects of spatial and environmental predictor variables, they used several variables as covariates, removing their respective effects in subsequent analyses, which may also remove the effects of other, correlated, predictor variables. Only by performing simultaneous analyses of spatial and environmental predictor variables can one assess their relative importance in population divergence. So far, however, most studies that test hypotheses regarding geographic distance and environmental variables do so in separate analyses. Whereas a number of other methodologies can include longitude and latitude as separate vectors, these are often inaccurate representations of geographic distance. To our knowledge, the only method currently available that is capable of including distance and environment simultaneously is GDM. Because the environment is spatially heterogeneous and not all areas in the landscape require equal dispersal effort geographic distance is not necessarily an accurate estimate of the level of population connectivity. This fact has been acknowledged by the development of least-cost-paths (Ray 2005) and resistance distances (McRae 2006) that take into account the various levels of conductivity imposed by heterogeneous habitats on dispersal abilities (see also Guillot et al. 2009; Spear et al. 2010). Although it is possible to include these types of distance into methods such as GDM and assess their relative roles in explaining the observed variation, predictive maps cannot currently be generated based on these distances, because of the computational limitations of computing least-cost-path and resistance distances among a large number of localities. The ability to create such predictive maps could considerably advance our ability of assessing the influence of landscape elements on population connectivity.

Further development of spatially explicit genetic tools

An important challenge facing molecular ecologists interested in applying spatial analyses to their work is that many of these methods were not originally developed for genetic data. On the one hand, current methodologies specific to genetic data exist that take into account the spatial configuration of sampling localities in population assignments. On the other, there are methodologies that integrate transformed genetic data with geographic and environmental data, such as GDM. Genetic characterizations in many of these methods are reliant on summary statistics of genetic distances, such as Nei’s D and Fst. The reality is that the relationship between genetic diversity and landscape elements is highly dynamic and complex, which requires tracking population divergence, connectivity and migration at the landscape-genetic level. Although distance-based metrics can provide important insight into the relationships between genetic differentiation and environmental variables, they fail to take into account the demographic and genealogical histories of alleles within populations which can confound inferences of population genetic parameters, such as gene flow and effective population sizes or the fitness values of particular alleles, that are often of interest to molecular ecologists (Kingman 1982; Rosenberg & Nordborg 2002). In fact, Dyer et al. (2010) showed the effects of phylogeographic history on population genetic differentiation and removed these effects from subsequent landscape genetic analyses. The development of new spatial methods that readily incorporate coalescent-based inference of population genetic parameters would be a useful addition to current modelling efforts. Such analytical techniques may allow for more robust analysis of the environmental factors influencing historical and contemporary population declines and gene flow patterns and be more effective in placing them in an adaptive context. Similarly, it is well-known that a coalescent framework has powerful statistical properties for summarizing patterns of divergence among multiple unlinked loci (Wakeley 2008). Such a framework may prove useful in developing analyses to summarize spatial patterns of population genetic divergence in multi-locus datasets, analogous to efforts to reconcile gene trees and species trees (Degnan & Rosenberg 2009). Within this spatially explicit framework, methods are also needed that identify outlier loci that are potentially under selection. Finally, coalescent-based approaches are a powerful means to simulate the evolution of molecular data, which can be used to assess the probability that a given historical scenario produced the extant spatial patterns of genetic diversity (e.g. Knowles & Alvarado-Serrano 2010). Analogous forward simulations based on classical population genetics would be particularly powerful for predicting the influence of future environmental change on patterns of genetic diversity, because they would allow for the inclusion of stochastic processes that influence patterns of population genetic diversity and structure. These stochastic processes are largely independent of the environmental changes of interest, but they can have profound effects on population genetic patterns and cannot be ignored if the goal is to predict future patterns of genetic diversity. As with many forward simulations in population genetics, these analyses are likely to be extremely computationally intensive and would require investment in dedicated bioinformatics specialists for successful implementation. In summary, spatially explicit modelling of biological variation based on environmental variables would benefit from the power of simulation studies. For further information, we refer to Epperson et al. (2010), who provide a detailed review of computer simulations of population genetics, demography and environmental stochasticity in a spatially explicit context.

Handling large genomic datasets

An additional and obvious challenge is the expansion of our analytical toolkit to deal with population genomic datasets that are generated using next-generation sequencing technologies (an issue that is discussed in detail by Manel et al. 2010a). While it is becoming increasingly feasible to genotype or sequence multiple individuals at thousands of loci, computational limitations to the processing of such large data sets will likely hinder the application of many techniques currently used to model relationships between spatial genetic variation and environmental heterogeneity. Together with similar restrictions imposed on other disciplines, this should spur advances in algorithm efficiency and parallel computing (e.g. Suchard & Rambaut 2009). Despite the computational challenges, the genome-wide perspective offers great promise for spatial analysis of gene-environment relationships. For example, it will expand the current focus on inter-population differentiation with respect to intra-locus diversity so as to include spatial variation in linkage disequilibrium, chromosomal inversion frequencies, copy number variation and other genome-wide structural features.

Spatial modelling of adaptive genetic variation

Neutral and adaptive genetic variation distributed across space are net manifestations of different, if overlapping sets of evolutionary processes (e.g. even though both are shaped by demographic history, its relative importance differs between them). Thus, it is important not to confound these two genetic partitions or the meaning of their respective associations with environmental variables when making inferences in a landscape genetics framework (Holderegger et al. 2006). Genome scans that distinguish loci under selection from background patterns of neutral genetic variation (e.g. Beaumont & Balding 2004; Excoffier et al. 2009) are elegantly revealing the multiple ways that natural selection influences the origin and maintenance of genetic variation (Nielsen 2005; Sabeti et al. 2006). Contrasting the environmental signals in neutral loci vs. those under selection should thus be a powerful way for testing hypotheses concerning the evolutionary processes that operate across environmentally heterogeneous space. For instance, for an African rainforest lizard Freedman et al. (2010) constructed generalized dissimilarity models for pooled neutral AFLP markers and separate GDMs of individual loci bearing a signature of natural selection. Combining GDMs with ecological niche models of species distribution change since the Last Glacial Maximum and demographic reconstruction from sequence data, they tested alternative diversification hypotheses, finding more support for ecologically-mediated divergence along the rainforest-savanna gradient than for divergence via isolation in rainforest refugia. In addition, Eckert et al. (2010) used Bayesian generalized linear mixed models to remove demographic effects among a set of single nucleotide polymorphisms (SNPs) typed from loblolly pine and identified SNPs that were associated to environmental variation, potentially under selection. With the exception of these studies, novel analyses of genomic data have focused more on the discovery of candidate loci under selection and their environmental associations in a non-spatially explicit context, rather than on generating predictive spatial maps of the variation resulting from selective processes. For example, Hancock et al. (2008) investigated how adaptation to climate influences spatial variation in allele frequencies for candidate genes playing a role in metabolic disorders in humans. The authors used a newly developed Bayesian statistical framework to relate allele frequencies to a set of environmental variables, while accounting for differences in sample sizes, as well as neutral covariance of allele frequencies among populations. Significant associations were found between loci linked to metabolic disorders, but not in control SNPs.

Overall, we suggest that studies of adaptive evolution at large spatial scales will benefit from modelling the spatial and environmental correlates of genetic variation. The development of spatial analytical methods that can incorporate functional molecular genetic information, such as gene ontology categories and molecular pathways or known coding and regulatory elements, will allow for comparisons of patterns of differentiation among genes that operate within interacting chemical pathways. In general, contrasting model predictions for adaptive and neutral loci will help determine whether different processes are responsible for the spatial patterns observed in these two types of loci, each of which may experience vastly different evolutionary pressures.


Spatially explicit models of biological variation across the landscape are increasingly being used and have great potential for addressing fundamental and applied questions in ecology and evolution. Recent advances in modelling techniques and the enhanced access to climate and high-resolution remote-sensing variables have widely benefited modelling approaches. To overcome the challenges we have described here, it is important for researchers to start considering clear workflows for spatially explicit analyses in their studies, including a sound sampling design that takes into consideration the input needs and performance capabilities of spatial analysis techniques and the development of new integrative modelling approaches. To move the field forward, it will be necessary to develop a better integration of tests for alternative hypotheses regarding the effects of distance, barriers and habitats. Specific to molecular data, there is a need for methods that not only include the spatial structure of sampling localities, but also landscape elements, population connectivity, demographic features and varying forces influencing marker types. Such methods should preferably be embedded in a coalescence-based framework aimed at the analysis of the large amounts of data that are becoming available through next generation sequencing technologies.

Box 1 Sources of environmental data

Recent developments in the availability and accessibility of spatially explicit, continuous surfaces of environmental data have boosted their use in a variety of ecological research topics. Below we will list the most important and often used environmental datasets.

Ground-based interpolated climate data

Global climate data is available free of charge from the WorldClim group (Hijmans et al. 2005; This dataset is based on a 30- to 50-year climatology (1960–1990 for areas with high-quality data and 1950–2000 for areas where data were spurious or missing), computed from ground stations. The ground-station measurements used to generate the WorldClim dataset are maintained by the Global Historical Climatology Network (, Food and Agriculture Organization of the United Nations (, the World Meteorological Organization (, the International Center for Tropical Agriculture (, R-HydroNET ( and a number of smaller, regional organizations. The dataset comprises 11 temperature and eight precipitation variables such as annual means, minima, maxima and seasonality. These variables are available at a resolution of 1 km, but the true resolution depends on the density of the regional ground station network. The WorldClim group also provides palaeoclimate reconstructions (approximately 6000, 21 000 and 130 000 years bp) and future predictions for 2020, 2050 and 2080 under various emission scenarios. The WorldClim data are ready for use in spatial analyses with minimal processing and manipulation.

Remotely-sensed climate, topography, and vegetation data

An ever increasing set of satellite-borne and air-borne remotely sensed variables and data products is available either commercially or freely at various native resolutions, ranging from 15 m to approximately 2.5 km (Turner et al. 2003).

Remotely sensed rainfall data is available for free through the Tropical Rainfall Measuring Mission (TRMM) at a 3-day, 0.25° resolution for a band spanning 50° N to 50° S, and a temporal resolution of about 7 h ( TRMM data are particularly useful over areas with low density of weather stations on the ground. However, because TRMM data are provided as 3-day accumulations, substantial processing of these data is required for most applications in spatial ecology.

The QuickScat product (coverage: 1999-present in 3-day intervals and 2.25 km resolution) is a measure of vegetation biomass and surface moisture (in areas with low-density vegetation cover) or canopy roughness (mainly in tropical forest systems) and is sensitive to seasonal changes, such as vegetation deciduousness (Frolking et al. 2006). Because this is a radar instrument, it is insensitive to cloud cover. QuickScat is available for download from These data in particular need to be checked for potential errors and processed before further use in spatial analyses.

Elevation data from the Shuttle Radar Topography Mission (SRTM) at 30 m, 90 m or 1 km horizontal resolution and a dataset that identifies different classes of landcover across the globe are available from the Global Landcover Facility (; the SRTM data are also downloadable from NASA/JPL: In addition, SRTM elevation data is available at 90 m and 250 m resolution from and at 1 km resolution from WorldClim ( All elevation data is available free of charge.

Several environmental layers can be extracted from images acquired by the moderate resolution imaging spectroradiometer (MODIS) aboard two separate satellites orbiting Earth. These images are 500 m–1 km native resolution images that cover the entire Earth’s surface every 2–3 days. Examples of these indices include surface temperature and emissivity (coverage: 2000), percent tree cover (coverage: 2001), leaf area index (LAI; available for 2000-present as monthly composites of measurements at 8-day intervals), vegetation continuous fields (VCF; coverage: 2001–2005) and normalized difference vegetation index (NDVI; coverage: 2001–2006 in 16-day intervals, see above), which capture properties such as forest cover and greenness (photosynthetic activity). These MODIS layers are available from a variety of online sources, for example, several free-of-charge land products (temperature and emissivity, albedo, land cover type) are available from the Land Processes Distributed Archive Center (LP DAAC, with others (tree cover, VCF, and NDVI) provided by either the Global Landcover Facility ( or from university-based research groups (LAI, NDVI data covering irregular 10-day intervals from 1992–1996 are available from the advanced very high resolution radiometer ( In addition, a comprehensive 22-year long (1982–2002) monthly NDVI dataset at quarter degree, half degree or one degree resolutions is available from the global inventory modelling and mapping studies (GIMMS) in the International Satellite Land-Surface Climatology Project (ISLSCP) Initiative II archive and can be downloaded from and Most of these NDVI data need to be processed according to the study needs (e.g. maxima, minima, means, phenology) before use in spatial analyses. NDVI is a vegetation index that correlates well with photosynthetic activity useful for calculating productivity and can be computed from the raw data from a variety of instruments through the normalized difference in surface reflectances at near-infrared (NIR) and red wavelengths. This can be derived by first atmospherically correcting the red and infrared bands using the software ATCOR 2 (Atmospheric Correction for Flat Terrain, ReSe Applications Schläpfer) with the default coefficients, in order to obtain accurate surface reflectance values, and then applying the equation:


LAI, VCF, and NDVI are sensitive to cloud contamination, and may not always be useful for all regions and one should be particularly cautious using these data over rainforest areas.

Advanced spaceborne thermal emission and reflection radiometer (ASTER) imagery ( and is available for a fee at 15 m–90 m resolution and captures data in 14 bands, ranging from visible to thermal infrared. Depending on the research questions (e.g. vegetation characteristics or phenology), these data may need to be processed before they are usable for any spatial analysis. For instance, high-resolution NDVI can be computed from these data such as described above. In addition to the 14 bands mentioned, ASTER-based elevation data (digital elevation model; DEM) is available free of charge at 30 m horizontal resolution from and

Raw data from the phased array type L-band synthetic aperture radar (PALSAR) aboard the advanced land observing satellite (ALOS) can be used to estimate for instance elevation, biomass and soil moisture. These data have very high spatial resolutions (10 m) and are available at a charge ( However, because these are raw data, substantial processing is required to create useful data layers for landscape genetic analyses.

In addition to vegetation cover, data on vegetation structure is increasingly becoming available through the use of very-high-resolution light detection and ranging (LIDAR) instruments. First, the Laser Vegetation Imaging Sensor (LVIS; provides detailed 3D information on elevation, biomass and vegetation structure. However, because these data are based on an airborne instrument, the data are available for only a limited set of regions. The earliest missions were flown in 1998 and new flights continue to be scheduled. As a derived product from LVIS, ICEsat/GLAS, ALOS/PALSAR and other sensors, three-dimensional canopy height maps with near-global coverage are also available ( These data provide a much better spatial coverage, but at lower resolutions. The geoscience laser altimeter system (GLAS) aboard the Ice, Cloud and land Elevation satellite (ICEsat) is another often used lidar instrument. However, these data require substantial processing before they can be used in spatial analyses. Examples of products based on ICEsat/GLAS data are forest canopy height (Lefsky et al. 2005) and biomass (Lefsky et al. 2005; Baccini et al. 2008).

Finally, information about the environment does not need to be limited to direct measurements of habitat characteristics. Useful information for conservation biology and infectious disease research include, for example, human population density, road density or economic variables. The availability of these types of data is highly dependent upon the location of the study region. For many developed countries, detailed socio-economic information is available, for instance for the US from the US Census Bureau (


This research was partially funded by grants to TBS from the joint NSF-NIH Ecology of Infectious Diseases Program award EF-0430146, the National Institute of Allergy and Infectious Diseases award EID-1R01AI074059-01, NSF (IRCEB9977072), EPA (#FP-61669701) and NASA (IDS/03-0169-0347).We thank L.P. Waits and two anonymous reviewers for comments on a previous version of this manuscript.

Henri Thomassen is interested in the evolutionary processes generating and maintaining biodiversity. Zachary Cheviron’s studies population genetics, ecological geonomics, and evolutionary physiology. Adam Freedman is interested in spatial population genetics, the interplay between natural selection and demography, and speciation. Ryan Harrigan’s main research interests include the evolutionary biology and ecology of species complexes. Robert Wayne applies molecular genetic techniques to study questions in ecology and evolutionary biology. Thomas Smith is interested in evolutionary genetics, speciation and the conservation of tropical vertebrates.