Correlation and process in species distribution models: bridging a dichotomy
Carsten F. Dormann,
Helmholtz Centre for Environmental Research – UFZ, Department Computational Landscape Ecology, 04318 Leipzig, Germany
Biometry and Environmental System Analysis, Faculty of Forest and Environmental Sciences, University of Freiburg, D-79106 Freiburg, Germany
Carsten Dormann, Biometry and Environmental System Analysis, Faculty of Forest and Environmental Sciences, University of Freiburg, Tennenbacherstrasse 4 D-79106 Freiburg, Germany. E-mail: firstname.lastname@example.org
Carsten Dormann, Biometry and Environmental System Analysis, Faculty of Forest and Environmental Sciences, University of Freiburg, Tennenbacherstrasse 4 D-79106 Freiburg, Germany. E-mail: email@example.com
Within the field of species distribution modelling an apparent dichotomy exists between process-based and correlative approaches, where the processes are explicit in the former and implicit in the latter. However, these intuitive distinctions can become blurred when comparing species distribution modelling approaches in more detail. In this review article, we contrast the extremes of the correlative–process spectrum of species distribution models with respect to core assumptions, model building and selection strategies, validation, uncertainties, common errors and the questions they are most suited to answer. The extremes of such approaches differ clearly in many aspects, such as model building approaches, parameter estimation strategies and transferability. However, they also share strengths and weaknesses. We show that claims of one approach being intrinsically superior to the other are misguided and that they ignore the process–correlation continuum as well as the domains of questions that each approach is addressing. Nonetheless, the application of process-based approaches to species distribution modelling lags far behind more correlative (process-implicit) methods and more research is required to explore their potential benefits. Critical issues for the employment of species distribution modelling approaches are given, together with a guideline for appropriate usage. We close with challenges for future development of process-explicit species distribution models and how they may complement current approaches to study species distributions.
The most commonly used approaches to describe distributions of species and biodiversity are known as correlative (syn. phenomenological) species distribution models (Elith & Leathwick, 2009). These methods aim to describe the patterns, not the mechanisms, in the association between species occurrences and environmental data (mainly climatic data). They have provided useful insights for conservation of biodiversity and ecological understanding of large-scale patterns. However, predictions based on such correlative models are usually limited in their biological realism and their transferability to novel environments (Loehle & Leblanc, 1996; Davis et al., 1998; Vaughan & Ormerod, 2003).
Process-based distribution models (here used synonymously with mechanistic models) can address these deficits by explicitly including processes omitted from the correlative approach (Kearney & Porter, 2009). However, process-based models often demand a large number of parameters to be estimated, many requiring data of limited availability at often high spatio-temporal resolution. Thus, to date, such models have been used for far fewer species than have correlative models.
Here we will show that current approaches to modelling species distributions represent a continuum with respect to the explicit inclusion of processes. The aim of this paper is to compare correlation and process in species distribution modelling, thereby exposing strengths and weaknesses, and differences and similarities, between these approaches. From this comparison we identify some key challenges for species distribution modelling and indicate promising avenues of integration.
Correlative species distribution models statistically relate environmental variables directly to species occurrence or abundance. In contrast, process-based models formulate the ecology of a species as mathematical functions in a reductionist sense, defining causality; the species’ occurrence or abundance is an indirect, emergent consequence. These functions are also often empirical correlations, not related to the species’ occurrence or abundance, but to the species’ functional traits (morphology, behaviour and physiology) and associated life history (development, growth, reproduction).
Some process-based models are developed entirely ‘forward’, i.e. without any calibration of the model (Kleidon & Mooney, 2000; Morin et al., 2007), while correlative models are necessarily data-driven. However, correlative models employ explanatory variables that are expected to represent causal mechanisms (Austin, 2002). Furthermore, many process-based models also use distributional data to evaluate model structure or to calibrate and fine-tune some unmeasurable parameters. The common perception that process-based models are generally more complex is not true either, as machine learning-based correlative models are usually of high complexity (Elith et al., 2006) while process-based models can be structurally simple (e.g. Kleidon & Mooney, 2000). We hence propose the following criteria and definitions.
In correlative models, parameters have noa prioridefined ecological meaning and processes are implicit. In contrast, process-based models are built around explicitly stated mechanisms and parameters have a clear ecological interpretation that is defineda priori. Functional relationships in process-based models are specified as causal: x affects y. This is not the case in correlative models, although their post hoc interpretation is usually (and sometimes erroneously) causal.
This definition allows us to differentiate models that are described as process models (e.g. Heisey et al., 2010), but are not always seen as such (Hodges, 2010; Lele, 2010), from models that are explicitly process-based. While some models can clearly be placed at the extreme ends of the correlation–process continuum, most models will fall somewhere in between, depending on the extent to which they represent processes explicitly (Fig. 1). For example, by adding dispersal to the results of a correlative projection, hybrid models can be constructed (see Appendix S1 in Supporting Information for examples). There is, as yet, no consensus what defines hybrid models, as opposed to integrated models. Here, we use the term ‘hybrid model’ to refer to the sequential application of different models (e.g. dispersal after correlation, Thuiller et al., 2006; process-derived explanatory variables subsequently used in correlative models, Rickebusch et al., 2008). As a subset of hybrid models, ‘integrated models’ refer to models where both modelling strategies are fitted simultaneously to data (e.g. demography within suitable habitats, Pagel & Schurr, 2011).
Process-based models also differ in the degree of calibration. We distinguish between ‘forward’ process-based models, where no parameter is fitted to the data to be explained, and fitted (statistically calibrated or manually tuned) process-based models, where some parameters are adjusted to match at least a subset of the data to be predicted. Examples for the former include PHENOFIT (Chuine, 2000; Chuine & Beaubien, 2001), most individual-based models (Grimm & Railsback, 2005) and the Jena diversity (JeDi) model developed by Kleidon & Mooney (2000). The latter case is more common, as unknown parameters are always easier to fit to data than to estimate independently. Examples in the context of the distribution of species functional types include CLIMEX (Sutherst & Maywald, 1985), LPJ (Sitch et al., 2003), LPJ-GUESS (Smith et al., 2001) and ORCHIDEE (Krinner et al., 2005). In the extreme, a process-based model may be completely parameterized by distribution data (e.g. Van Oijen et al., 2005; Arhonditsis et al., 2007; also Hartig et al., unpublished).
Model domain, (micro-)evolution, constancy of limiting factors and interactions
Transferability to other species, sites and times
Functional types, correlation structure
When to stop: accuracy versus complexity
Deployment time, re-parameterization, sensitivity analysis
Communicability/transparency of the model
Documentation, open source code/software
Knowledge potentially gleaned from the model
Common errors and misuses
Lack of uncertainty analysis, use beyond purpose, overconfidence in communication
Purposes of species distribution models
An obvious but sometimes forgotten point is that the usefulness of a model must be assessed with respect to its purpose. Species distribution models are used to ask a wide range of questions that can be broadly categorized as seeking understanding or seeking prediction (see also Perry & Millington, 2008). For example, understanding what limits the distribution and abundance of species is a classical question in ecology and evolution as well as in conservation and biosecurity. In very few cases can we claim to understand distributional constraints, and species distribution models are thus valuable for generating hypotheses that may be tested experimentally (e.g. Angert & Schemske, 2005; Doak & Morris, 2010). Process-based models can also be used to falsify hypotheses by formulating a hypothesis as a model and comparing it formally with data (e.g. Morin et al., 2007).
Identification of the environmental factors influencing distributional limits is also useful in conservation and management. Addressing these questions using correlative models may involve interpolation within the environmental domain used to develop the model, and extrapolation beyond this domain. Correlative models often provide a single, static prediction of a species distribution in these human-driven environmental scenarios. In contrast, process-based (and hybrid) models are often used to predict dynamic features of species distributions, such as invasion rate (Kearney et al., 2009a), succession and the influence of disturbance, land use and management measures on species persistence (Schumacher & Bugmann, 2006; Jeltsch et al., 2011). In general, correlative models are essentially static and without access to the description of non-equilibrium, periodic, chaotic or alternative stable states (but see claims by De Marco et al., 2008).
Thirteen features common to or different between correlative and process-based models
Both purely correlative models and fitted process-based models have the same statistical analysis assumptions: error structure assumptions (such as independence of data, homogeneity and stationarity of variance), homogeneity of sampling effort and constant observation error. Both approaches can be adjusted to accommodate violations of these assumptions (e.g. spatial autocorrelation, Dormann et al., 2007; non-stationarity, Hothorn et al., 2011; detection probability and observer error, Royle et al., 2005), but this is rarely done (for example see Latimer et al., 2006; Bierman et al., 2010). Both approaches assume that the relevant mechanisms influencing a species’ distribution are captured. Usually, the assumption is made that the functional forms of the relationships between species occurrence and environmental variables are correct. In contrast, forward process-based models do not use distribution data for their development, and therefore such data can be used to validate forward-process models (Chuine & Beaubien, 2001).
Correlative models require species to be in equilibrium with their environment, i.e. occurring throughout the suitable environmental space (although not to fill the geographic space completely). Process-based models can abolish the equilibrium assumption and use data from the non-equilibrium trajectory to fit the model (for potential bias resulting from transient phase data see Moilanen, 2000). When the equilibrium assumption is removed it is possible to assess range dynamics under climate change (Kearney et al., 2008; Keith et al., 2008) or commercial exploitation (Cabral et al., 2011).
2. Information required
Ecological knowledge of a given target species is used in both correlative and process-based models. In correlative models, it guides the pre-selection of explanatory variables. For example, when they are correlated, direct variables are preferable over indirect (e.g. temperature has a potential direct effect on thermal regulation, while elevation serves as a less specific proxy; Austin, 2002; Guisan & Thuiller, 2005). In process-based models, ecological knowledge guides the selection and formulation of processes represented in the model. Relevant ecological knowledge can be derived from experimental results or process observations (e.g. seed dispersal, phenology, growth). Correlative models require data on species distribution and relevant environmental factors for deriving correlations, while fitted process-based models require these same data to calibrate their unknown parameters. Data on species distribution and environmental factors are also used in model validation (see below).
Processes that occur at smaller spatial or temporal scales than environmental data used in model building can be difficult to capture. For instance, a species’ range may be limited by a few days of frost not captured in commonly used monthly temperature averages. A small refuge may result in a species’ presence in a 100-km2 cell of unsuitable habitat, but land-cover data might not be at a sufficient resolution to capture this important microhabitat (Trivedi et al., 2008). The effect of temperature on many organisms is complex and cannot be represented by interpolated monthly mean air temperature from weather station records. The latter problem has been dealt with by combining models of microclimate with models of behavioural thermoregulation (Kearney et al., 2009b). Finding additional ways to represent subscale heterogeneity for the analysis at larger scales is an open challenge for both correlative and mechanistic models.
3. Determination of model structure
At first glance, it would seem that correlative and process-based models have little in common when it comes to the selection of which variables/processes are incorporated in the model. If, however, a likelihood can be formulated for the process-based model, information-theoretical model selection (Burnham & Anderson, 2002; Johnson & Omland, 2004) can proceed similarly for both approaches (see O’Hara & Sillanpää, 2009 for a Bayesian perspective). With Bayesian approaches becoming more widespread, more similarities with respect to the determination of model structure between correlative and process-based models may emerge, informing the analyst on processes relevant for the given data (e.g. Van Oijen et al., 2005).
A more crucial distinction is how processes are included. In a correlative approach, allowing for nonlinearity, functional relationships are derived by fitting species occurrences or abundance to environmental data. In process-based models, choices about the specific process structure have to be made based on theory or observation, with potentially large ramifications for the model output even from seemingly small choices (such as between frequency- and density-dependence of disease transmission: Wasserberg et al., 2009). Evaluation of alternative choices is rare (or rarely published), however, and Smith et al.’s (2008) study on different density-dependence schemes for cormorant population dynamics is a rare exception. Effectively, this means that in addition to the validation of the complete model all its components need to be validated as well (e.g. LaDeau, 2010).
Verification refers to testing the technically correct implementation of the model, i.e. that the model does what it was specified to do (Schmolke et al., 2010). The use of this technical term is somewhat unfortunate, because, philosophically, verification of models is impossible (Oreskes et al., 1994), but we use it in line with other publications. Verification of a process-based model is usually carried out by running the model using settings for which the outcome is known, or can be derived analytically, and by comparing the model output with the expected result. Also, dimension analysis is a crucial ingredient, i.e. checking that the units of the right- and left-hand sides of model equations are identical. Essentially, the aim of verification is to try hard to find a flaw in the implementation by producing inconsistent results. In correlative models, verification comprises double-checking of settings, assumptions (error distributions) and pre-processing steps. In that sense, model diagnostics (i.e. distribution of residuals, check for spatial autocorrelation) are the most common steps in the ‘verification’ of statistical models. For process-based models, beside the consistency with fundamental physical laws such as conservation of mass and energy, reproduction of analytical results or simulations using, for example, the virtual ecologist approach (Zurell et al., 2010) are options. Model verification is more complex than these lines suggest, and we thus refer to Starfield et al. (1990) and Grimm & Railsback (2005) for further details.
Validation refers to the assessment of the correctness of model predictions using data not used for the building or calibration of the model. When independent data are available (ideally from another time and region; Lebreton et al., 1992; Schröder & Richter, 1999; Araújo et al., 2005), both modelling approaches can be validated externally. The commonly used cross-validation (also called internal validation) of correlative models is intrinsically optimistic compared with external validation, because it ‘only’ validates the model for data from the same region and time. The generality of the model hence remains unassessed. For fitted process-based models, external validation can also be carried out by comparing their parameter estimates with independent parameter estimates (Cabral & Schurr, 2010; Hartig et al., unpublished). In contrast to tuned process-based and correlative models, external validation is the rule in forward process-based models, where processes are usually parameterized on separate data sets. The absence of independent parameter estimates in the literature gives an important feedback to empiricists for improving the knowledge of the species’ biology.
In addition to validation, both model approaches should also be assessed for their sensitivity and specificity (in the statistical sense). First, by using a simulated species, the ability of the model to recover the known true distribution/parameters can be assessed (i.e. the sensitivity of a method; Reineking & Schröder, 2006). Second, by randomizing the validation data, the model’s tendency to find patterns in data where there are none can be gauged (specificity). These tests seem to be more established (but not necessarily published; Grimm & Railsback, 2005) for process-based models than for correlative models (but see Dormann et al., 2008a).
6. Sources of uncertainty in model predictions
Generally, model uncertainty is poorly quantified (Clark et al., 2001). Five sources of reducible (or epistemic) uncertainty pertain to both modelling approaches (Beven & Freer, 2001; Barry & Elith, 2006; Refsgaard et al., 2007): input data uncertainty, model misspecification, equifinality (see next section), parameter uncertainty, model stochasticity and regression dilution. Even worse, these errors may be non-independent, thus amplifying their effects rather than outbalancing them. Obviously, incorrect input data used for parameterization of processes or fitting of the correlative model will bias predictions. For correlative models it has been shown that bias in presence–absence data (e.g. due to the so-called botanist effect; Applequist et al., 2007; Pautasso & McKinney, 2007) is more serious than undersampling per se (Dennis et al., 1999; Royle et al., 2005). Similarly, an incorrectly specified model, for example one where a nonlinear process is represented by a linear relationship or a relevant predictor/process is absent, can distort the output. In correlative and fitted process-based models the distortion may be difficult to detect, as it may be compensated by altered values of other fitted parameters, leading to good model fits. In forward process-based models, the incorrect representation of processes is likely to yield stronger bias in model predictions because there is no room for such compensation. Because correlative models are typically more flexible than process-based models, the latter tend to be more biased (Hartig et al., unpublished). However, this also implies that a fitted model may be giving the right results for the wrong reasons (see next section ‘Equifinality’). The inclusion of stochastic processes in process-based models (e.g. dispersal, mortality) or the use of randomization steps in correlative models (cross-validation, bootstrap aggregation) will yield different model outcomes despite identical initial conditions/data. Thus, even attempts to include more and more processes in order to make a model more realistic are ultimately confronted with such stochastic and irreducible (= aleatory) variability, defining the fundamental limit of a model’s accuracy.
A large, but in its effects difficult to quantify, uncertainty derives from the lack of representation of small-scale processes in large-scale data. Animals can avoid microclimatically adverse conditions so that the climate encountered by an organism is often different from the regional average. The difficulty of deciding whether to include a microscale process into a process model is conceptually similar to having an indirect explanatory variable in a regression model. In addition to this scale problem, coarse environmental data usually have large errors, which leads to ‘regression dilution’ and hence underestimation of the strength of a relationship (McInerny & Purves, 2011). (Note that error in the response variable does not cause a bias in ordinary regression, while error in the predictor variable does; Draper & Smith, 1998.)
For a given data set, several parameterizations may exist that equally fit the data (‘non-identifiability’). This equifinality (Beven & Freer, 2001) is the consequence of a statistically ill-posed problem, where the information content in the calibration data set is insufficient to filter out a single parameter set from all possible sets. One consequence is that we cannot identify a single best parameter set that is likely to produce the right results for the right reasons (Kirchner, 2006). The causes of equifinality differ between correlative and process-based models (collinearity and over-parameterization, respectively), but the problem of resulting prediction uncertainty is the same. In ecology, this problem has not received much attention (but see Penteriani, 2008; Luo et al., 2009), mainly because ecological models are complex and data are sparse, and hence fitting models other than very simple models is rare (Schulz et al., 2001; Lele et al., 2010). Even though we can use the full set of equifinal solutions for averaged prediction (both in a Bayesian as well as in a frequentist setting; Link & Barker, 2006; Dormann et al., 2008b), we do not learn much about our system from this fitting exercise.
When using distribution models for prediction beyond the data range (extrapolation in geographical space or in time, where new environmental conditions occur), more assumptions become relevant for both approaches. So far, studies have commonly considered stationarity, i.e. that model parameter estimates remained constant through space and time (but see Kearney et al., 2009a; Hothorn et al., 2011). Specifically, this means that the environmental niche of the species does not change (e.g. through microevolution, genetic drift or acclimation; Aitken et al., 2008). Process-based models can alleviate this problem by trying to explicitly represent microevolutionary environmental niche shifts in the model (Kearney et al., 2009a; Chevin et al., 2010).
Furthermore, both correlative and process-based approaches assume that the way variables/processes interact will be the same in the extrapolated case as they were with the original data. For correlative models this means that the correlations found when the model was built will remain the same in the (far) future. For process-based models this means, for example, that the functional forms of the processes and parameter values stay the same. This may be quite likely for some processes (e.g. those depicting thermodynamic laws, such as body temperature or water balance), but less likely for others (e.g. a dispersal function or biotic interactions). The palaeoecological record indicates clearly that plant species in the past have reacted idiosyncratically to climatic changes (Huntley, 1991). Furthermore, the extent to which it is reasonable to extrapolate also depends on whether a process has been described empirically or from the structure and assumptions of a general theory. For correlative models, the model should not be extended outside the conditions under which the measurements were performed (e.g. elevated CO2). For process-based models, it seems reasonable to extrapolate to conditions under which the general theory is supposed to hold.
9. Transferability to other species, sites and times
We are not aware of many comparative studies of this type for process-based models (but see Bugmann & Solomon, 1995, 2000; Bugmann, 1996), possibly because the choice of parameters is usually tailored to a certain species (see ‘Information required’, above).
10. When to stop: accuracy versus complexity
Correlative models can be fitted to data in a matter of minutes to hours. In fact, preparations of environmental and occurrence data usually take much longer than the actual statistical modelling process itself. This fast deployment time is probably the main cause of the proliferation of statistical methods in our data-rich times.
Process-based models commonly take a long time to develop, as they often simulate nonlinear dynamics and hence have to deal with issues such as numerical diffusion and time stepping (Press et al., 2007). Furthermore, they are usually very sensitive to initial conditions and need burn-in periods to achieve a reproducible, stable steady state. This can take considerable computation time and hence slow down the developmental cycle even more. Even the use of an existing process model for a new species can take considerable time and effort, as the parameterization requires either collection of experimental or observational data with respect to the phenology and physiology of the species or model calibration to an existing distribution data set.
Scaling-up and sensitivity analyses of complex process-based models can be time-consuming (Bolker et al., 1998; Pagel et al., 2008). Due to the computational demand of dynamic process-based models, this cannot always be achieved using automated tools. In fitted process-based models, an accuracy–complexity return curve (depicting gain in accuracy over model complexity) is likely to be similar to that of a correlative model, levelling off fast once a ‘sufficient’ level of complexity is reached. For correlative models this is described by the ‘variance–bias trade-off’ (Hastie et al., 2009), but for process-based models we are not aware of any modelling study systematically investigating the accuracy–complexity curve (but see the studies of Cox et al., 2006; Crout et al., 2009; and Martínez et al., 2011, for systematic exploration of simplified versions of their process-based models).
11. Communicability/transparency of the model
Communication of a model requires: (1) a precise documentation of the steps/processes included, and (2) sufficient scientific background of both writer and reader to be able to judge their appropriateness. Model documentation is traditionally poor in process-based models and many efforts have been made to improve this situation (reviewed in Schmolke et al., 2010). Also the reluctance of many ecological modellers to make their code publicly available contributes to low reproducibility of all but the simplest of models. Reasons for closed code include inelegant coding or insufficient documentation (Barnes, 2010) as well as the wish of the scientist to prevent others from using the model inappropriately or so as to diminish the modeller’s own publishing prospects. For models of moderate complexity, re-implementation is actually a good way of testing the implementation, because potential errors are unlikely to be reproduced identically.
For correlative models the de facto standard statistical tool is R (R Development Core Team, 2010), and hence analyses can be transparently communicated through the exchange of software code (which is similarly true for other code-based software such as Python, Sage, Matlab or Mathematica). Software tools that are configured through a graphical user interface have the disadvantage that they often do not record the choices made by the user and hence require special care by the user to record and communicate all the chosen settings. Unfortunately, this is rarely carried out and therefore efficient logging of the chosen options by the program should be the standard (as it is for example in Maxent:Phillips & Dudík, 2008). Given the many choices available in correlative models, an analysis of the sensitivity of the results to alternative choices would be desirable, as stated above for process-based models.
12. Knowledge potentially gleaned from the model
Any useful model should be able to reproduce expected outcomes (see section 4), but it may also yield counterintuitive results, which are actually one of the most useful outcomes of modelling. They may identify new connections between processes and should generate new hypotheses that can be confirmed by experiment or other data. Surprising results obtained using a correlative model, such as an unexpected correlation between a species’ distribution and a particular environmental variable, may in fact lead to the discovery of a new process, while surprises resulting from the use of process-based models usually relate to unexpected, emergent patterns as a result of nonlinear interactions between processes that are already in the model (for an example of this see Eisinger & Thulke, 2008).
We speculate that, in general, correlative models used in the exploratory sense are more likely to result in discoveries of new processes or process interactions than process-based models, where the processes and interactions have to be defined a priori. Formally comparing forward process-based models with data may detect process deficits, but will not necessarily identify the missing processes.
13. Common errors and misuses
The most common ‘error’ of any modeller is to ‘believe’ a model. Models are abstractions of reality and their correctness of abstraction has to be demonstrated (Krakauer et al., 2011). No ecological model can be right a priori, because fundamental laws do not exist in ecology (Lawton, 1999), and, with respect to species distributions, there is no ‘quantum biogeography’. From this first error in ‘attitude’ follow three common misuses. Firstly, either model type – correlative or process – is often stretched beyond (sometimes far beyond) the range of data underlying it. For example, constructing a model correlating the abundance of a freshwater fish species in temperate Europe with environmental data cannot be expected to ‘work’ when taking it to the Mediterranean. This is not because of the distance involved, but because the wet season is the cold season in the Mediterranean, while it is the warm season in Central Europe. We would extrapolate the parameter estimates to combinations of temperature and precipitation never encountered in the region for which the model was developed. Climate change predictions using correlative models often fall into this category (Ohlemüller et al., 2006).
The second misuse is to employ the model for an application for which it was not developed, without due validation/justification. If an ecologist builds a model to understand home-range size of a passerine bird and includes landscape composition as a parameter, then this model does not automatically qualify as an assessment tool for landscape structure with respect to bird abundance. The reason is simply that for his or her initial purpose the ecologist may not have looked at abundance at all, instead inferring it as a by-product of home-range packing. But where in the model does it state that each of these ‘virtual home ranges’ must be occupied?
The final common misuse follows from overconfident communication of a model’s predictions (even within the parameter range). A simple diagnostic is whether uncertainty was quantified or discussed: if it was not, the user/modeller is likely to be overconfident in the model’s prediction. Not an error, but a missed opportunity for any model, is to omit to specify a set of predictions to sites with environmental conditions not encountered when assembling the model. Collecting data in exactly these conditions would then serve as a critical test.
Critical issues for species distribution models
Critical issues for correlative models
Data! Everything depends on the amount, quality and appropriateness of data. Many statistical papers have developed fixes for biased sampling, missing values, unbalanced designs and so forth, but when the relevant ecological driver has not been quantified, no amount of data will be able to generate ecologically interesting hypotheses.
The causality of detected correlations is a critical issue for the use of correlative models, where the input variables are often correlated among themselves. For example, the observation that the occurrence of a species is correlated with mean annual temperature does not necessarily imply that temperature is itself a direct limiting factor, it could also be solar radiation, which is usually not represented in the data sets as accurately as temperature, or the presence of a competitor that is itself limited by temperature. If the temperature correlation is used to make a prediction of the species distribution under climate change, this could lead to incorrect results, as climate change causes temperatures to increase, but not solar radiation.
Problems can arise when extrapolating in space, as the correlations between input parameters may be different in different places and hence a non-causal correlation found in one place could lead to incorrect results when extrapolated to another place. An additional problem when extrapolating in time is that the increasing atmospheric CO2 concentrations are likely to have an impact on plant species ranges due to the alleviation of water stress (Farquhar, 1997). This effect is unlikely to be detected in the present data because the spatial variability of CO2 is negligible. For this reason, cross-validation using present data sets that were not used for the derivation of the model may shed some light on the uncertainty of the model predictions when extrapolating in space, but it does not necessarily serve as an indication of the uncertainty when extrapolating in time, particularly under climate change.
The use of historical records for validation (e.g. pollen records in combination with historical climate, where available), however, could give some indication of the uncertainty when extrapolating in time, but this is rarely carried out (e.g. Pearman et al., 2008). Also, with genetic information becoming increasingly available, genetic structure may reflect historical developments and hence provide additional opportunity for validation.
Critical issues for process-based models
Fitted process-based models rely on the implicit assumption that the model structure and process formulations are correct and that the unknown parameter values can be obtained by inverse modelling or available observations. Because the model parameters are fitted to reproduce observations, the same observations cannot be used to test for the correctness of the model structure and process formulations. The accuracy to which a tuned model reproduces the data is not an indication of correct process representations, as any data stream can be reproduced to an arbitrary level of accuracy using, for example, a polynomial function with sufficient degrees of freedom. On the other hand, even if a model was a true representation of the relevant processes, there is no guarantee that the correct parameter values can be obtained through inverse modelling, as the available data may not be sufficient to allow identification of an unambiguous set of parameters that best reproduces the data. In fact, many different parameter sets can reproduce the data equally well (see Equifinality). The different, apparently equally valid parameter sets can yield very different predictions when used under changed conditions (e.g. Schulz et al., 2001).
Forward process-based models often rely on empirical parameterizations of the processes considered. This again introduces problems akin to those described above for correlative models, as the causality of observed correlations is not necessarily assured. If, for example, the observed correlation between mean daily temperatures and the onset of leafing or flowering was due to a cross-correlation between solar irradiance and temperatures, while solar radiation was the directly responsible variable, the model could lead to a wrong prediction under climate change, where temperatures increase but not solar radiation. Furthermore, neither empirical process-based nor correlative models would capture the adaptation of a species’ phenology to changed climate.
Conclusions and outlook
Our review of the similarities and differences between correlative and process-based species distribution models emphasizes that they sit on a continuum defined by the extent to which processes are explicitly represented. When these two broad types of models are fitted to observed data, there is considerable overlap in their assumptions, validation challenges and reproducibility problems. Although representing two very different conceptual starting points for species distribution modelling, they may well converge onto the same problems with respect to prediction of environmental and management change. Neither approach warrants the inference that reproduction of observations is indicative of the model being ‘true’ (‘right for the wrong reason’; Judd, 2003). Both the causality of correlations found using a correlative model and the interplay of mechanisms proposed in a process-based model should be considered as hypotheses. However, in the former, the model itself and the data cannot be used to test the hypotheses, as they have already been used to generate the hypotheses. In a forward process model, on the other hand, mechanisms are proposed based on theoretical grounds or independent data, and hence, in theory, they can be tested using the match between model results and observations. However, in practice, most process-based models have a large number of adjustable parameters that need to be calibrated against observations. This precludes the use of the same data for hypothesis testing and reduces the use of the model to an extrapolation tool.
The future development of both correlative and process-based approaches is likely to see a mixing of their strengths (Mokany & Ferrier, 2010): data-driven implementation conveys trustworthiness because it is based on ‘real data’; modelling of actual processes emanates scientific rigour and mechanistic understanding. The key test for either approach, however, is its usefulness for the question at hand. Can either of the two approaches identify previously unknown mechanisms thus generating knowledge? Are models accurately predicting suitable sites as confirmed by transplant experiments? Are uncertainties small enough to allow selection between different management scenarios?
We would like to highlight three avenues for research on species distributions, as follows.
1 Bayesian fitting of process-based models. To understand how certain the knowledge we put into the model actually is, we can fit the model to observed data sets (‘model inversion’). Allowing the uncertainty of model parameters to enter the fitting process as priors, estimated distributions of model parameters are indicative of the statistical support of the data for this specific parameter. Note, however, that such Bayesian process modelling is still being developed and that we may ‘fit models that are far beyond our ability to understand them’ (Hodges, 2010, p. 3497). If a parameter’s posterior distribution largely overlaps with 0, we would conclude that under this model there is no evidence for the process in this data set. For a given system, generic models could thus be tailored and simplified. Model inversion could hence be used in an inferential way.
2 ‘Forward’ process-based models. To avoid the need for parameter fitting, unknown parameters in process-based models can be determined using detailed observations or experiments, such as in PHENOFIT (Chuine & Beaubien, 2001) or Niche Mapper (Kearney et al., 2008). Alternatively, models can be formulated that simulate natural selection from a randomly generated pool of virtual species resembling the species of interest, akin to the JeDi model (Reu et al., 2011). Forward process-based models avoid the problem of equifinality and can be used for hypothesis testing as they are much less likely to produce the right result for the wrong reasons. Forward process-based models, if based on first principles, may also lead to more reliable predictions of species distributions under environmental change, as their probability of matching species distributions under new conditions should be compared with their probability of matching them at present (first principles do not change).
3 Combined workflow. Although we juxtaposed correlative and process-based models, it may actually be fruitful to join them in a combined workflow (Mokany & Ferrier, 2010; Peng et al., 2011). Scientific understanding of nature starts with observations, i.e. descriptive data. Correlative models efficiently sift through such data, thereby generating hypotheses on potentially underlying processes. These can then be taken up, along with ecological theory and experimental evidence, by process-based models, based on ecological theory and experimental evidence. Unknown parameters in process-based models could guide experimental and theoretical research to gather relevant knowledge for their quantification. The resulting process-based models can then generate predictions specifically designed for a formal test on independent data. In such a comprehensive approach, researchers with different interests, expertise and focus can synergistically progress the field in a way neither correlative nor process-based approaches can do by themselves.
In conclusion, we find no reason why a proponent of either of the two extremes of correlative and process-based species distribution modelling should hold the moral high ground. ‘Correlationists’ should be humble: their model’s success may be due to spurious correlations. ‘Mechanists’ should be unassertive about their approach, because they will only find effects of processes that they included. Either approach must comply with nature, statistically or mechanistically, and be aware of the kinds of questions they are best suited to answer.
We are grateful to the following colleagues whose comments helped us to improve the clarity and focus of this publication: Rampal Etienne, Lee Hannah, Thomas Hickler, Steven I. Higgins, Bob O’Hara, Peter Linder, Greg McInerny, Frank Schurr, Ralf Seppelt and Konstans Wels, as well as Jens-Christian Svenning, Robert Whittaker and two anonymous referees. The work was initiated through a workshop ‘The ecological niche as a window to biodiversity’, organized by Christine Römermann, Bob O’Hara and Steven Higgins and funded by the LOEWE- BiK-F Biodiversity and the Climate Research Centre Frankfurt. Funding to C.F.D. by the Helmholtz Association (VH-NG 247) and the German Research Foundation DFG (DO 686/5-1), to S.J.S. by the Max Planck Society and to C.R. by the DFG (RO 3842/1-1) is gratefully acknowledged.
Carsten Dormann is a statistical ecologist with an interest in extending correlative approaches into process-based modelling. His fields of research comprise non-predictive areas of species distribution modelling, experimental community ecology and ecological networks.
Stan Schymanski is an ecohydrological process-based modeller. He seeks common thermodynamic principles behind the organization and growth of vegetation. Together with their co-authors they represent a diverse background and attitude towards the correlative–process continuum of species distribution models.
Author contributions: C.F.D. and S.J.S. led the discussion and wrote the first draft. All authors structured the study and co-wrote the final manuscript.
Editor: Jens-Christian Svenning
The papers in this Special Issue arose from two workshops entitled ‘The ecological niche as a window to biodiversity’ held on 26–30 July 2010 and 24–27 January 2011 in Arnoldshain near Frankfurt, Germany. The workshops combined recent advances in our empirical and theoretical understanding of the niche with advances in statistical modelling, with the aim of developing a more mechanistic theory of the niche. Funding for the workshops was provided by the Biodiversity and Climate Research Centre (BiK-F), which is part of the LOEWE programme ‘Landes-Offensive zur Entwicklung Wissenschaftlich-ökonomischer Exzellenz’ of Hesse’s Ministry of Higher Education, Research and the Arts.