Towards causal relationships for modelling species distribution

1. Understanding the processes underlying the distribution of species through space and time is fundamental in several research fields spanning from ecology to spatial epidemiology. Correlative species distribution models (SDMs) involve popular statistical tools to infer species geographical distribution thanks to spatiotemporally explicit observations of species occurrences coupled with a set of environmental predictors. 2. So-called SDMs rely on the niche concept to infer or explain the distribution of species, though often focusing only on the abiotic component of the niche (e.g., temperature, precipitation), without clear causal links to the biology of species under investigation. This might result in an over-simplification of the complex niche hypervolume, resulting in a single model formula whose estimates and predictions lack ecological realism. 3. We believe that a causal perspective associated with a finer definition of the modelling target is necessary to develop ecologically more realistic outputs. Here, we propose to infer the geographical distribution of a species by applying the modelling relation approach, a causal conceptual framework developed by the theoretical biologist Robert Rosen, which can be formalized through structural equation modelling (SEM). 4. Implementing the modelling relation into SDMs would improve the inclusion of the causal processes underlying the spatial distribution of species into an inferential formal system, potentially highlighting the methodological steps where uncertainty arises and eventually resulting in model outputs which are tightly linked to the ecology of the target species.


| INTRODUC TI ON
Understanding the processes underlying the distribution of species through space and time is a fundamental topic in several research fields including ecology, epidemiology and biodiversity conservation (Franklin, 2023).The geographical distribution of a species is commonly inferred using the so-called species distribution models (SDMs).Here, we define SDMs as correlative models (e.g., generalized linear models, random forest, maxent) that establish a statistical relationship between an observed response variable describing the species distribution in the geographical space (e.g., presence-absence) and a set of predictors describing the environmental space occupied by the species over large geographical extents.The rapid availability of open-access biodiversity data (e.g., BIEN, sPlotOpen, GBIF; Enquist et al., 2016;GBIF: The Global Biodiversity Information Facility, 2023;Sabatini et al., 2021), environmental predictors (e.g., WorldClim, Fick & Hijmans, 2017) and open-source statistical languages like R, contributed to the tremendous diffusion of these correlative approaches over the past two decades (Araújo et al., 2019;Franklin, 2023).
Nevertheless, numerous authors have raised concerns regarding the capacity of SDMs to accurately infer species distributions (Araújo et al., 2019;Kearney & Porter, 2009;Lee-Yaw et al., 2021), expressing specific criticisms about (i) the conceptual background of correlative SDMs (Austin, 2007;Kearney, 2006); (ii) the quality of the input data used to train the models (e.g.spatial and temporal biases when sampling distribution data; Hortal et al., 2008;Fourcade et al., 2014;Rocchini et al., 2023);(iii) the mismatch between the environmental conditions actually experienced by the target species and the spatial and temporal resolution of the abiotic predictors used in SDMs ( Urban et al., 2016;Lembrechts et al., 2020); and (iv) the ecological realism of SDMs outputs (Lee-Yaw et al., 2021).
These pitfalls have been widely discussed in the scientific literature and several methodological papers on the best practices were proposed (see for instance Araújo et al., 2019;Sillero et al., 2021;Zurell et al., 2020).The correlative aspect of these modelling exercises however remains, making SDM predictions often interpreted and evaluated mostly from a statistical perspective (e.g.models' predictive accuracy) rather than from their ecological realism (Austin et al., 2006;Hellegers et al., 2020;Merow et al., 2014).
In contrast, many scientists have argued for a causal approach to SDMs, incorporating biological knowledge into the models, and defining the hierarchical structure among the various factors influencing the geographical distribution of species (e.g., Austin, 2007;Chapman et al., 2019;Kearney & Porter, 2009;Purse & Golding, 2015;Urban et al., 2016).For instance, models based on species life-history traits (i.e., the characteristics influencing individuals' performance or fitness; Dawson et al., 2021;Nock et al., 2016) have been proposed as an implementation of classic correlative SDMs, since these life-history traits may reflect the different responses of a species to processes that modulate its distribution (Regos et al., 2019).
These models have the advantage of making explicit the causal links between the biology of the target species and its environment, although their complexity and the huge amount of information they require for parameterisation make them less tractable.
The use of Bayesian approaches and the tuning of Bayesian priors, which entail the incorporation of prior knowledge through the use of Bayes' rule, constitute another method to include causal mechanisms while remaining within the framework of correlative methods (van de Schoot et al., 2021).These approaches proved particularly useful when hierarchical structures had to be incorporated into models, as when dealing with complex spatiotemporal dynamics or when sampling efforts varied (Mäkinen & Vanhatalo, 2018).
An alternative approach to account for prior knowledge and hierarchical structure relies on the use of structural equation modelling (SEM).The SEM approach provides a comprehensive framework for modelling and analysing complex systems by incorporating both observed and unobserved variables, allowing researchers to go beyond simple correlations and examine the underlying structural relationships among variables (Grace, 2006).A central concept in SEM is the meta-model, which defines the hierarchical structure among several response and explanatory variables.This meta-model is essentially a theoretical framework that represents the researcher's understanding of how the variables are interconnected, describing the relationships between the variables based on prior knowledge, theoretical foundations or empirical evidence.Such a graphical representation of the links and interconnections among several response and explanatory variables is borrowed from graph theory and computer science, usually referred to as directed acyclic graphs (DAGs) with a set of rules that can be applied for observational causal inference in ecology (Arif & MacNeil, 2022).
Independently from the type of algorithm or statistical approach used in SDMs, incorporating causal relationships and drawing a DAG diagram for SDMs' applications require a deeper understanding of the species biology and the formulation of clear causal hypotheses about the drivers underlying the geographical distribution of the focal species.Given the widespread use of SDMs and their critical role in various research fields, we believe that embracing a causal perspective in SDMs is not only timely but also essential.Therefore, in this paper, we propose a conceptual and technical solution, borrowed from the SEM approach and graph theory relying on DAG representations, to take causal relationships into account in SDMs exercises.From a pure conceptual-level perspective, we introduce Robert Rosen's modelling relation framework (Rosen, 1978(Rosen, , 1986(Rosen, , 1993) ) as a causal scheme to guide the design of SDMs.Robert Rosen (1934Rosen ( -1998)), a theoretical biologist, introduced the conceptual framework called 'modelling relation' as a fundamental principle in understanding and representing complex systems like living organisms, arguing that traditional mathematical models often fall short in capturing their complexity (Rosen, 1978(Rosen, , 1986)).The modelling relation highlights the idea that a model should capture the essential organizational relationships and constraints of a system, capturing the underlying organizational principles that guide the system's behaviour rather than merely describing its components and interactions (Rosen, 1993).Rosen's emphasis on organization was a reaction against reductionist approaches that focus solely on the individual components of a system without considering a more holistic view of the systemic interactions and causal constraints that give rise to the system's properties.
From a more technical viewpoint, we propose to use SEM as the inferential approach within the modelling relation framework (the formal system in Robert Rosen's modelling relation scheme; Figure 1), aiming to better integrate the underlying causal processes behind the distribution of a species.We highlight the importance of a carefully constructed conceptual model, using SEM approaches or DAGs that are built upon the hierarchical nature of the relations linking a species distribution with its environment, to implement meaningful causal relationships and increase the ecological realism of SDMs.To illustrate this, we use a set of virtual species, transferring our hypothesized causal diagram or DAG into an SEM framework and comparing its results with those of a generalized linear model (GLM), a common algorithm used in correlative SDMs.

| IN CORP OR ATING HYP OTHE S IZED C AUSAL REL ATI ON S HIPS INTO S DMS
The niche concept is a fundamental notion in ecology and represents the conceptual backbone of SDMs.Different definitions of the niche concept have been proposed (Pocheville, 2015;Sales et al., 2021) but, essentially, the niche concept aims to define the environmental space in which a species could exist, persist and reproduce, allowing  and Peterson (2005), the so-called biotic, abiotic and movement (BAM) framework.According to the BAM framework, biotic and abiotic factors, as well as species dispersal limitations, determine the geographical distribution of a species.The intersection between the biotic and abiotic components returns the realized niche of the species (sensu Hutchinson, 1957).Consequently, the intersection between the realized niche and the accessible areas defines the actual or realized geographical distribution of the species (Soberón & Peterson, 2005).The BAM framework provides a way to operationalize the niche concept in the geographical space, making it appealing for inferring the distribution of a species through SDMs.Since its introduction in 2005, the BAM framework has become a mainstay in correlative SDMs exercises and has been applied in multiple scientific fields (e.g., Bible & Peterson, 2018;Escobar & Craft, 2016;Franklin, 2023).
Correlative SDMs' outputs depict (and synthesize) the distribution of a species as a detailed and spatially contiguous map representing an index of environmental/habitat suitability (Guisan et al., 2017), with the maximum values of this index typically interpreted as the areas that are most suitable for the target species.These maps are often visually attractive and are assumed to be straightforward to read and interpret, thus contributing to the promotion and dissemination of SDMs.These outputs, however, are primarily assessed from a statistical perspective (e.g., the model's predictive accuracy) rather than in terms of their ecological realism.Many efforts have been devoted to solving various methodological issues of SDMs, mainly dealing with statistical techniques, spatial and temporal autocorrelation in the data, spatial and temporal sampling bias of the response variable, variable selection, model selection, and predictive accuracy.The scientific literature is very rich in that respect (e.g., Aiello-Lammens et al., 2015;Bazzichetto et al., 2023;Brun et al., 2020;Fourcade et al., 2014;Hallgren et al., 2019;Muscarella et al., 2014;Qiao et al., 2015Qiao et al., , 2019;;Simmonds et al., 2020;Varela et al., 2014;see Sillero & Barbosa, 2020 for a summary of common methodological pitfalls of SDMs and Sillero et al., 2021 for a step by step methodological guide to SDMs).
However, the conceptual background necessary for generating meaningful and hypothesis-driven SDMs has been much less discussed (but see Araujo & Guisan, 2006;Austin, 2007;Thuiller et al., 2013).Interest in alternative modelling approaches looking for deeper causal relationships between the distribution of a species and its potential determinants has been growing (Arif & MacNeil, 2023;Briscoe et al., 2019;Feng & Papeş, 2017;Hartemink et al., 2011;Kearney & Porter, 2009;Kraemer et al., 2019;Staniczenko et al., 2017;Urban et al., 2016).Indeed, a modelling perspective based on the biology of the target organism and associated with a finer definition of the objective of the model might help to develop more ecologically realistic outputs with explicit causal links.This would help to avoid correlative SDMs outputs biased by spurious correlative spatial structure underlying both the response variable and predictors, especially when the predictors have no direct causal links with the response variable (Fourcade et al., 2018;Journé et al., 2020;Lozier et al., 2009), and to foster more meaningful and scale-appropriate interpretation of the results.
Incorporating causal relations into a model requires a basic knowledge of the study system or organism under investigation in order to formulate specific hypotheses that can later be translated into model equations.In this paper, we define a causal relationship as one for which scientists have a mechanistic basis for expecting that variations occurring in a predictor variable can lead to a change in the distribution of a response variable.This definition corresponds to the general scientific definition employed in the natural sciences and is the definition associated with the enterprise of causal modelling (Grace & Irvine, 2020).We recognize that the alternative enterprise of inferring causal relations from data in the absence of mechanistic knowledge, a common situation in the social sciences, introduces additional requirements.
Several authors have proposed practical suggestions or guidelines to clarify the model assumptions and increase the model's biological realism (e.g.Araújo et al., 2019;Chapman et al., 2019;Srivastava et al., 2021;Zurell et al., 2020).Conceptually speaking, we believe the so-called modelling relation framework developed by Robert Rosen in the 1980s (Rosen, 1986) could be especially relevant to incorporate causal relationships into SDMs.

| Rosen's modelling relation
Robert Rosen's modelling relation framework is a conceptual framework designed to understand how a biological system could be coded into an inferential mathematical system through causal inference (Mikulecky, 2001).The modelling relation can be defined as the process of relating two structures, a material one governed by causality and a mathematical one governed by inferential rules (see Chapt. 2-3 in Rosen, 1986).The former is the natural system, hence the causal system of investigation, while the latter is the formal system used to infer the natural one (Figure 1a).The relation between these two structures is given by 'encoding' the causality of the natural system into a formal system of inference and by 'decoding' such inference back to the causal phenomenon.The encoding arrow drawn from left to right of Figure 1 represents the observations and measurements of the natural systems aiming to capture its causality, while the arrow from the formal system towards the natural system represents the decoding operation of the prediction into the natural system made by the mathematical formal system.
Although the view of an inferential model in Rosen's modelling relation is not completely new (Pattee, 2007) and shares the same rationale of the backdoor criteria used when building DAGs (i.e., it uses domain knowledge, above all else, to determine the best causal model for a given causal query; see Arif & MacNeil, 2022), the modelling relation framework represents a valid epistemological tool to guide (and refine) the incorporation of ecological knowledge into biologically more realistic SDMs.To design the inferential model structure, the encoding section requires that the user summarizes the main assumptions and the uncertainties about the natural system (e.g., the main determinants of the distribution of a given species following the niche theory, such as the BAM diagram; Figure 1a), and to define them as mathematical equations and relations (e.g., translating the BAM diagram into a causal and mathematical diagram; Figure 1b).Clearly, if these assumptions are wrong or imprecise, we would obtain biased predictions, eventually resulting in a lack of ecological realism.In this view, Siekmann (2018) proposes Rosen's modelling relation as a type of process-based model where the model outputs from the formal system can be compared to the natural system and used to validate the assumptions.Similarly, an ecological process-based model generally focuses on a particular aspect of the natural system such as a given life-history trait of the target species, thus providing a possible explanation according to the underlying assumptions of the formal system (Siekmann, 2018).It follows that various models can be built under different assumptions (e.g., different and competing causal diagrams), and their results are compared and interpreted in the light of the ecological assumptions they, respectively, made on the natural system (Fudge & Turko, 2020).
Rosen's modelling relation can thus be used to design and compare different competitive hypotheses about the investigated natural system, therefore treating modelling as an experimental exercise (Metcalf, 2019;Siekmann, 2018).

| Applying Rosen's modelling relation
To date, few attempts have been made to include the modelling relations in SDMs exercises.For instance, Kineman (2007Kineman ( , 2009) ) as well as Kineman and Wessman (2021) applied a correlative approach where response curves between the predicted habitat suitability and the environmental factors were mostly tuned by visual interpretation and expert-based assessment.In particular, Kineman (2007) highlighted how his approach was mainly designed as an exploratory tool to learn about ecological relationships and test ecological hypotheses.However, we could not find a broader application of Rosen's modelling relation aiming at modelling species distribution.
As a conceptual framework, the modelling relation is independent of the statistical method used (Metcalf, 2019;Siekmann, 2018), but we suggest that the rationale behind the SEM approach (Grace, 2006) fits well within the modelling relation of the formal system.
The SEM approach provides a comprehensive framework for analysing complex relationships (both direct and indirect) among variables by combining elements of factor analysis, regression analysis and path analysis (Grace, 2006).All relies on a causal diagram, a graphical representation of the hypothesized causal structure of the studied system (Fan et al., 2016;Garrido et al., 2022).One effective approach is the utilization of DAGs (Greenland et al., 1999;Pearl et al., 2016), which are constructed to represent researchers' hypotheses regarding how explanatory variables influence the response variable(s).Each variable can be defined as exogenous, endogenous or mediator.Exogenous variables are only independent variables (i.e., only pointed towards other variables).Endogenous variables are dependent variables (i.e., pointed at by other variables), but can also be used as independent variables pointing towards other endogenous variables in more complex structures, playing a mediating effect (i.e., mediators).For instance, variable A may affect variable C either directly or indirectly via a mediating effect from variable B, which means that variable A is exogenous while B and C are endogenous.Through SEM, DAGs can unveil confounding factors that must be considered in regression analysis to obtain unbiased coefficients.Moreover, they can reveal mediation pathways or situations involving multiple response variables (Grace, 2006).
The strength of SEM relies on testing different hypotheses (i.e., different causal diagrams that can be used as candidates and competing 'meta-models') about the causal relationships between the variables considered in the studied system.Recent advances in SEM allow us to deal with a wide range of error distributions (e.g., Poisson and binomial families) and data structures (e.g., hierarchical or longitudinal data set), thanks to the piecewiseSEM r package (Lefcheck, 2016;Lefcheck et al., 2020).Indeed, the hypothesized set of causal pathways can be validated only if the proposed model is consistent with the observations.In other words, if the modelestimated variance-covariance matrix can predict the variance-covariance matrix of the observational data set (Equation 1): where Σ is the observed variance-covariance matrix, and Σ(Φ) is the model-estimated variance-covariance matrix expressed in terms of Φ, the matrix of model-estimated parameters (i.e., coefficients).Austin (2007) was one of the very first scientists to propose the application of SEM to SDMs, advocating the importance of including and evaluating a causal structure in the modelling exercise.However, due to technical limitations such as the application of SEM to data not fitting a Gaussian error distribution and the estimate of only linear relationships prevented a broader application of this methodology to data types commonly found in ecological studies (Grace, 2022;Lefcheck, 2016).

| C A S E S TUDY
To illustrate the potential of using SEM directly embedded into Rosen's modelling relation (cf. the formal system) and rooted in the BAM framework of the niche theory used in most SDM studies (cf. the natural system), we used a virtual species approach (Leroy et al., 2016;Meynard et al., 2019).We first simulated the geographical distribution of two virtual species.The first one is fully dependent on the abiotic conditions while the second one is influenced by both the abiotic conditions and the presence of the first species.
Then, we provided a causal diagram or DAG aiming to explain the spatial distribution of the second virtual species employing both direct and indirect (mediating) effects from both abiotic and biotic (the first virtual species) constraints.

| Virtual species
The virtual species approach provides the great advantage of knowing precisely the species' ecological niche and its predicted distribution into the geographical space (Meynard et al., 2019).
Here, for the sake of simplicity, we considered only two bioclimatic variables retrieved from the WorldClim2 database (BIO1 for mean annual temperature and BIO12 for mean annual precipitation; Fick & Hijmans, 2017).The spatial extent of the area of interest (AOI; spatial resolution of ~10 minutes, ~18.6 km at the Equator) was cropped to match that of Central and Southern Europe to reduce the computational effort of this illustrative application (Figure 2a,b).Specifically, we created a virtual tree species whose geographical distribution depends on its response to both BIO1 (thermal range: 5-13 °C) and BIO12 (precipitation range: 526-1257 mm; Figure S1.1a,b).This results in a tree species mostly distributed in the mountainous area of Europe (Figure 2d), displaying a continentality gradient (East-West macroclimatic gradient) coupled with higher suitability at the cold end of the BIO1 gradient.The geographical distribution of the second virtual species, a shadetolerant herbaceous species, is driven by the same abiotic variables as the virtual tree species, but favoured by a warmer range of mean across the AOI was generated using binomial GLMs, or logistic regressions, assuming sigmoid (i.e., non-quadratic) response curves between the occurrence of the species and the chosen predictors (Equation 2), and following the approach described in Bazzichetto et al. (2023).
where logit(p i ) is the natural logarithm of the odd ratio p i /(1-p i ), α is the model intercept, β pr is the regression parameter for the linear term (i.e., sigmoid shape) of precipitation and β tm is the regression parameter for the linear term (i.e., sigmoid shape) of temperature.Regression parameters for the tree species were set to 1 (α), 0.01 (β pr ) and −1 (β tm ), while for the herb species, they were set to 1 (α), 0.015 (β pr ) and −0.85 (β tm ).
Logit-transformed probabilities were turned to the unit interval [0,1] using the logistic function available through the plogis function in the stats r package (R Core Team, 2023).
We decided to constrain the geographical distribution of the herb species by the occurrence of the virtual tree species, to simulate an obligate biotic interaction (i.e. the herbaceous species benefits from growing in the shade of the virtual tree species).To simulate this biotic constraint, we computed the germination rate of the virtual herbaceous species as a function of the habitat suitability of the virtual tree species: namely, the germination rate of the virtual herbaceous species increased logarithmically with the habitat suitability provided by the virtual tree species (Figure S1.1c). (2) The set of abiotic variables (BIO1 and BIO1) used to create the two virtual species.(c) The germination rate of the virtual herb species is computed as a function of the habitat suitability of the virtual tree species.(d) The habitat suitability of the virtual tree species.(e) The habitat suitability of the virtual herb species.(f) Sampling locations.The geographic projection used is the WGS84-World Geodetic System 1984, EPSG: 4326.
Eventually, the resulting geographical distribution of the virtual herbaceous species (Figure 2e) was defined by the intersection between its climatic niche and the biotic constraint of its germination rate depending on the habitat suitability of the virtual tree species (Figure 2a-c).The obtained habitat suitability maps of the two virtual species (Figure 2d,e) were then converted into presence-absence maps using the function convertToPA of the virtualspecies R package.
To add stochasticity in this simulation exercise, we generated three different scenarios for the dispersal capacity of the virtual herb species, by varying its geographical prevalence (the number of pixels occupied by the species out of the total number of pixels available in the geographical space), while keeping fixed the virtual tree species geographical prevalence.As a result, we assigned a fixed geographical prevalence equal to 0.4 to the virtual tree species, while for the herbaceous species, we simulated three dispersal scenarios (low, medium, high) whose underlying geographical prevalence was set to 0.25, 0.50 and 0.75, respectively (Figure S1.2).We then randomly sampled 500 locations across the AOI to extract information on the presence-absence of each of the two virtual species, the value of the germination rate of the virtual herbaceous species, as well as the values of BIO1 and BIO12 (Figure 2f).We repeated this operation 10 times, and the predictive accuracy of each simulation was estimated using spatial cross-validation with 15 spatial folds retaining 80% of the observations for training and 20% for testing.This allowed us to generate a toy data set to calibrate our SEM models built within Rosen's modelling relation.A detailed description of the virtual species simulation, the sampling methodology and the R codes used to generate this modelling exercise are available on GitHub https:// github.com/ danddr/ SEM_ SDMs.

| Statistical analysis
The main goal of this modelling exercise is to demonstrate the applicability of the SEM approach (cf.causal diagrams) within Rosen's modelling relation and to compare its predictive accuracy along with the stability of the models' coefficients with respect to a traditional SDM algorithm not relying on causal diagrams such as GLMs.By presenting the modelling relation as a hypothesis testing conceptual exercise, we hypothesized a causal diagram aiming to describe the distribution of the target forest herb species (Figure 3), whereby the geographical distribution of the forest herb species represents the natural system and the causal diagram from the SEM approach represents the formal system.In the causal diagram or DAG (Figure 3): 1. BIO1 and BIO12 (abiotic components) have a direct effect on both the virtual tree and the virtual herb species distribution (Equations 3 and 5); 2. the occurrence of the virtual tree species has a direct effect on the germination rate of the herb species and an indirect (via the germination rate) effect on the actual distribution of the virtual herb species (Equation 4); 3. the germination rate (biotic component) of the virtual herb species has a direct effect on the actual distribution of the virtual herb species (Equation 5).
The causal diagram was then converted into a set of candidate models (Equations 3-5) using the piecewiseSEM and semEff R packages (Lefcheck, 2016;Murphy, 2020).The congruence of the estimated variance-covariance matrix hypothesized in the SEM with the observed variance-covariance matrix in the data was evaluated for each geographic prevalence and cross-validation iterations using a Fisher's C test, whose null hypothesis (H0) is that the model variance-covariance matrix can predict the observed variance-covariance matrix.Hence, a p-value > 0.05 for the Fisher's C test implies that the estimated variance-covariance matrix from the causal diagram mirrors the observed one in the data, therefore validating it (Lefcheck, 2016).
Finally, for comparison purposes and as an example of a classic non-hierarchical SDM, we computed a binomial GLM, where the presence-absence of the virtual herb species (cf. the only response variable) was modelled as a function of three predictor variables: BIO1, BIO12 and the germination rate.We also computed a set of metrics routinely used to assess the predictive performance of SDMs: (i) the area under the ROC curve (AUC); (ii) sensitivity; (iii) specificity; (iv) the true skill statistic (TSS); (v) the coefficient of determination (R 2 , here to be intended as a pseudo-R 2 computed using the Nagelkerke approach); (vi) and the root mean squared error (RMSE).The R 2 and the RMSE were computed by comparing the true (i.e., simulated) habitat suitability of the virtual herb species with the one predicted by each combination of models and geographical prevalence (Meynard & Kaplan, 2012).A detailed description of the validation metrics is available in Guisan et al. (2017).

| Results
The Fisher's C test did not initially support the causal diagram proposed in Figure 3  The predictive accuracy metrics computed for the models of the virtual herb species on the testing data set showed comparable outcomes for both SEM and GLM, whose variation was mainly related to the geographical prevalence of the virtual herb species rather than to the modelling technique used (Figure S1.3).The RMSE values of the SEM, in particular, showed a rather stable behaviour across the different geographical prevalence values, whereas in the GLM, these RMSE values tended to increase with the geographical prevalence.Furthermore, the SEM showed more stable coefficient estimates with different geographic prevalences compared to the GLM: while the coefficients estimated by the SEM are stable and always significant, coefficients estimated by the GLM varied greatly across the cross-validation iterations and geographical prevalences (Figure S1.4).The variation in the estimated coefficients affected the spatial predictions: the inclusion of a mediating effect may lead to more stable spatial predictions of the SEM across the three dispersal scenarios compared to the spatial predictions of the GLM (Figure 4).As a consequence, the spatial variability of the RMSE computed between the observed (i.e., simulated) herb suitability and the median of predicted cross-validated iterations for each geographical prevalence and models showed a similar spatial pattern, but the magnitude of the RMSE tended to increase across the different geographical prevalences more for the GLM than for the SEM (Table S1.5).

| DISCUSS ION
We introduced Rosen's modelling relation and proposed its application for SDMs by means of causal diagrams or DAGs borrowed from the SEM approach.Based on the results of our virtual species exercise, the modelling relation and SEM approach are valuable tools to incorporate biological knowledge into correlative SDMs and to account for the hierarchical structure of the links between variables, by encoding the assumptions related to the distribution of a species (natural system) into the formal system of Rosen's modelling relation.effective way to assess how appropriate different models are.Prior studies demonstrated that these metrics are influenced by a variety of factors, such as sample prevalence (Guisan et al., 2017;Leroy et al., 2018;Marchetto et al., 2023), sample location bias (Dubos et al., 2022;Fourcade et al., 2018;Jiménez-Valverde, 2021;Rocchini et al., 2023) and the size of the study region (Lobo et al., 2010).
Predictive models and causal inference are two different tools, the former attempting to find the best model predicting the response variable and the latter attempting to disentangle the effects of the predictors on the response variable (Arif & MacNeil, 2022;Pichler and Harting, 2023).Therefore, our SEM application for SDMs might be used to assess causal relationships between variables affecting the geographical distributions of species (i.e., attribution) but may not always be the most appropriate tool for generating accurate predictions on the actual species distribution.In other words, model prediction and model attribution are two different applications that may prove complementary but one cannot replace the other.
In our view, one of the most interesting aspects of SEM application to SDMs is the capacity to unravel unanticipated mechanisms through conditional independence testing, for example, that there are direct effects between species that were not considered before, or revealing the effect of a latent variable not yet measured or discovered (Arif & MacNeil, 2022;Lefcheck, 2016;Lefcheck et al., 2020).
While the natural-to-formal systems relationships presented in Rosen's modelling relation are made explicit in the SEM rationale (causal diagrams), the modelling relation can be applied in any correlative method to introduce causality into ecological modelling.Rosen's modelling relation can help modellers in their conceptual definition of a causal model, which can then be put into practice using different modelling approaches (correlative and process-based).However, other methodological approaches aiming to include biological realism or accounting for causality in correlative models exist, even though their application in ecology is extremely limited.For instance, the parametric g-formula proposed by Robins and Hernan (2008) employs a causal diagram to account for time-varying factors and time-varying confounder effects.Specifically, the g-formula allows for estimating the causal effects of sustained treatment strategies from observational data with time-varying treatments and has been applied prevalently in epidemiological studies (Keil et al., 2014;Meisner et al., 2022;Naimi et al., 2017).Bayesian SDMs are another way of introducing hypothesized causality by adding ecological or physiological knowledge in the model using informative priors, representing a prior belief regarding the probability distribution of an unknown parameter.For instance, Feng et al. (2019) gathered thermal limits and survival information for the zebra mussel Dreissena polymorpha from the literature and used these to calibrate correlative Bayesian models.
Unlike correlative models, process-based models are usually independent of geographical observations of the taxa under investigation.These typically express biological (or other) processes by a mathematical equation (e.g., ordinal differential equation or matrix population models) relating an indicator of the process (e.g., a lifehistory trait such as the number of offspring) to different factors affecting its performance (e.g., environmental conditions) (Da Re et al., 2022;Kearney et al., 2010).For instance, Larter et al. (2017) showed how a single plant functional trait (xylem resistance to cavitation) displayed a strong statistical relationship with its species distribution in relation to aridity across the climatic range of the species.Process-based SDMs have also been successfully used in invasion ecology to simulate and forecast invasion risk under different global change scenarios (Carboni et al., 2018;Strubbe et al., 2023).
Within the family of process-based models, agent-based models (ABMs) aim to predict species population or community dynamics by modelling multiple individuals (agents) that interact with their environment and among each other.For each agent, ABMs require the specification of state variables, which can include age, size and spatial location, as well as physiological and behavioural traits (Zhang & DeAngelis, 2020).
Rosen's modelling relation coupled with the SEM approach, as advocated here, is one of the methods allowing to design and refine ecological hypotheses, thus treating modelling as an experimental exercise.Within the field of SDMs, the modelling relation can represent a wider conceptual tool to model species distribution based on causal and ecologically based assumptions, potentially increasing the ecological realism of SDMs.Inferring the spatial distribution of a species of high interest (e.g., a vector-borne species, a species of conservation concern, an invasive alien species) using a correlative approach and bioclimatic variables only, not accounting for uncertainty in the data and without a solid causal approach, may ultimately lead to ecological inconsistencies and subsequently to inaccurate estimates, with strong ecological and even socio-economic repercussions (Escobar & Craft, 2016;Hellegers et al., 2020).Furthermore, such inconsistencies in the outcomes generated by ecological models may undermine trust in ecological research (Currie, 2019;Lee-Yaw et al., 2021;O'Grady, 2020).Certainly, when knowledge of the target organism is scarce, a correlative approach may be the only option available, but a causal-oriented definition of the modelling exercise is crucial to enhance the ecological realism of the models (Getz et al., 2018) and to ensure the models' transferability to novel conditions.
Ecologists aspire to foster knowledge of global environmental changes induced by human activities, such as climate change, biological invasions and habitat loss.To efficiently tackle such challenges, clear, robust and well-defined epistemological premises about the main determinants of species distribution and species distribution changes are needed to design realistic experiments (Currie, 2019;Pigliucci, 2002).Epistemological premises are not just philosophical murmuring but allow us to set the boundaries of the modelling exercise, increasing model robustness in depicting natural patterns and resulting in clear practical applications (Currie, 2019;Dawson et al., 2023).Rosen's modelling relation and its implementation by means of the SEM approach requires to clearly define the natural system (the key response variable of interest), such as the niche, habitat or biome (see Box 1), which inherently defines different biological BOX 1 Natural systems glossary.
Biotic abiotic movement (BAM): Heuristic framework which defines the species population distribution as those areas where abiotic, biotic and accessible areas intersect.
Biome: A large cluster of plant species that are defined in terms of the recognizable physiognomy of the dominant species (e.g., tundra), sensu Pennington et al., 2004).
Ecophysiology: A branch of biology studying how the environment surrounding an organism (both abiotic and biotic components) interacts with its physiology.Habitat: The actual spatio-temporal configuration of environmental conditions where an organism either actually or potentially lives (sensu Kearney, 2006).
Hutchinsonian niche concept: n-dimensional space (hypervolume), where each dimension is an abiotic or biotic condition and the relations among them allow the species to exist in a self-maintained population without immigration.
Mechanistic niche: Those sets of environmental conditions that allow an organism to complete its life cycle and successfully reproduce (sensu Kearney, 2006).
Realized niche: A smaller fraction of the fundamental niche constrained by biotic interactions.
us to identify the geographical area where those environmental conditions are met.The design and interpretation of correlative SDMs are usually framed within the niche concept provided by Soberón F I G U R E 1 (a) Robert Rosen's modelling relation.(b) Example of application of the modelling relation to model the distribution of a species, depicted in green within the biotic abiotic movement diagram (BAM; natural system), by means of a structural equation model (SEM; formal system).
annual temperature conditions (thermal range: 11-20 °C) and a drier range of mean annual precipitations (precipitation range: 255-739 mm; Figure S1.1a,b), resulting in a wider potential geographical distribution compared to the three species if considering abiotic component only.The true species habitat suitability (p) as a valid hypothetical causal structure representing the variance-covariance matrix observed in the training data set (p < 0.05), suggesting the inclusion of direct effects for both BIO1 and BIO12 on the germination rate of the herb species (Equation 4).Once these two additional direct effects were integrated in Equation 4, Fischer's C test supported the updated causal diagram (p > 0.05).
Our findings suggest that building a model relying on a strong conceptual basis improves the stability of the estimated model's coefficients, without necessarily increasing the predictive accuracy metrics of the model.We speculate that the hierarchical structure of the causal diagram helped to reveal the relationships between the virtual herb species and its determinant, independently of the sampling (cross-validation iteration) and the geographic prevalence of the species.Despite the generally favourable results in terms of predictive performance for both modelling approaches, we argue that comparing predictive accuracy metrics may not be the most F I G U R E 3 Hypothesized causal diagram explaining the distribution of the virtual herb species.Purple boxes indicate abiotic variables, orange boxes indicate biotic variables and green box displays the main response variable.FI G U R E 4The observed (A) and predicted (B) habitat suitability values for the virtual herb species in a subset of the study area under different combinations of geographic prevalences and models (SEM and GLM).The geographic projection used is the WGS84-World Geodetic System 1984, EPSG: 4326.

Fitness
: individual reproductive success.Functional trait: Those characteristics influencing the performance or fitness of an individual (sensu Nock et al., 2016).Fundamental niche: The region of the n-dimensional space (Hutchinsonian hypervolume) where the biotic interactions are excluded, and thus, only the abiotic conditions affect the fitness.