Causal inference and large- scale expert validation shed light on the drivers of SDM accuracy and variance

Aim: To develop a causal understanding of the drivers of Species distribution model


| INTRODUC TI ON
Species distribution models (SDMs; also known as habitat suitability models) estimate species' environmental preferences. Put very simply, they do so by comparing the environment at locations where a species was observed with the environment at locations where it was not. Once this comparison has been made, the SDM can be used to predict habitat suitability at any geographic location and point in time for which the relevant environmental data are available.
SDMs make several assumptions. One example is that species are at equilibrium with their environment (Zurell et al., 2020); otherwise, there will be habitats that are suitable for a species but in which it has not been recorded (because it does not occupy them).
In this case, the model might erroneously predict that those habitats are unsuitable. This is just one example of how violations of a model's assumptions can affect its predictive performance.
The predictive performance of a SDM may be decomposed broadly into its accuracy and precision (Bazzichetto et al., 2022).
Accuracy is a measure of how close the model's predictions are to the "truth" on average. The most commonly used measure of a SDM's accuracy is its discrimination ability, that is, its ability to predict higher habitat suitability at locations where the species was observed than locations where it was not (Jiménez-valverde et al., 2013). Precision, on the other hand, is a measure of the variability among predictions from replicate model fits, which might include variability among SDM algorithms where multi-model ensembles are constructed . Models with high accuracy and precision will consistently make predictions that are close to the truth; clearly, it is desirable to know the situations in which this might be expected.
Assessing the accuracy of a SDM generally involves comparing its predictions to data. These data might be the same data that were used for model fitting, data withheld when fitting the model or completely independent data (e.g. from a separate survey). Alternatively, in simulation studies, where virtual species are used, SDM predictions can be compared to those species' true distributions directly. Regardless, predictive accuracy is typically evaluated using skill statistics, such as the area under the receiver operator curve (AUC), the true skill statistic (TSS) and Cohen's Kappa (Allouche et al., 2006;Leroy et al., 2018).
Although widely-used, AUC, Kappa and TSS have been criticized on several grounds. They measure rates of false presences and false absences so are challenging to interpret where presence-only datasets are used for evaluation. Of course, this limitation does not apply where presence-absence data are available (e.g. Phillips et al., 2009;Valavi et al., 2022). AUC, Kappa and TSS also depend on the ratio of presences to (pseudo-)absences, that is, sample prevalence, in the dataset (Jiménezvalverde et al., 2013;Leroy et al., 2018;Lobo et al., 2008). Alternative metrics have been developed to circumvent this latter issue (e.g. Kaymak et al., 2012), but we find that they are rarely used for SDM evaluation.
Whilst most studies evaluate SDM accuracy using skill statistics, an alternative is to solicit expert opinion. For example, Smart et al. (2019) sought expert opinion on the realism of species response curves estimated by small-scale niche models for vascular plants and bryophytes in the United Kingdom (UK). Similarly, Beck et al. (2014) sought expert opinion on the spatial predictions produced by various SDMs for a European butterfly. These latter authors found that model accuracy increased when the occurrence data were thinned to reduce spatial clustering. However, this finding was evident only to the expert: it was not reflected by an increase in AUC. This clearly demonstrates that expert validation can, at the very least, provide a different perspective to skill statistics on what determines SDM accuracy. Whether using expert opinion or skill statistics, appropriately quantifying SDM performance is only the first step towards understanding its determinants. The researcher must then quantify the relationships between the performance measures and predictors thereof. This is often achieved using some form of regression analysis-e.g. multiple regression, partial regression, ANOVA or t-tests ( Barbet-Massin et al., 2012;De Marco & Nóbrega, 2018;Feeley & Silman, 2011;Steen et al., 2019;Tessarolo et al., 2021;Watling et al., 2015;Wisz et al., 2008)-or even simpler measures of correlation (Hernandez et al., 2006).
Whilst clearly useful, regression does not necessarily tell the full story when it comes to ascertaining the effects of independent variables on a response variable. It is well known that regression coefficients vary as independent variables are added to and removed from the model (Angrist & Pischke, 2009). Indeed, using regression for causal inference requires assumptions about all confounders having been measured and included in the model (Gelman & Hill, 2006;McElreath, 2020). Another limitation is that, as it is typically usedthat is with one response variable-regression cannot deal with indirect effects, which occur where one variable mediates the effect of a second variable on the response (Baron & Kenny, 1986).
In other disciplines, and to a lesser extent in ecology (but see Grace, 2006), the limitations of regression mentioned above have been long recognized and overcome using graph theory and causal analysis. Directed Acyclic Graphs (DAGs; Greenland et al., 1999;Pearl et al., 2016) are constructed to codify researchers' theories about how the explanatory variables affect the response variable(s).
DAGs might reveal confounders that must be included in a regression analysis in order to produce unbiased coefficients. They might also reveal mediation pathways, or multiple response variables; in this case, path analysis or more complex structural equation models, can be used to estimate the effects of interest (Grace, 2006).
Here, we used graph theory, causal analysis and expert validation to understand the drivers of SDM predictive performance. We fitted SDMs for 1216 species of invertebrate and bryophyte in the United Kingdom (UK) using a fairly typical presence/pseudo-absence modelling workflow. We evaluated the performances of a subset (518; 43%) of these models, both in terms of variance among replicate model fits and accuracy as assessed by taxon experts. (For the speciose bryophytes, a random subset of 100 species were assessed, leaving 698 species unassessed.) We used DAGs to conceive plausible models describing the effects of explanatory variables on SDM performance and used multilevel path analysis to quantify those effects.

| Species occurrence data
We fitted SDMs using presence-only species occurrence records. The data were supplied by national recording schemes in the UK, who collate records made by volunteer expert naturalists for their taxon group of interest. For most taxa, we used the same data as Outhwaite et al. (2019) but applied additional filters. We only used gridded records collected at 1 km 2 or finer between 2000 and 2015 to match the SDM covariate data (Appendix S1) and removed records that were duplicated in terms of grid cell and species (standard practice for species distribution modelling).

| Species distribution models
In this section, we briefly outline the SDM workflow ( Figure 1), but refer the reader to the ODMAP (Overview, Data, Model, Assessment and Prediction; Zurell et al., 2020) document in Appendix S1 for full details. We used three SDM algorithms to estimate species' habitat suitability: Maxent, regularized logistic GLMs and random forests. We used the species occurrence data outlined above, and pseudo-absences generated according to the "non-overlapping target group" approach (Cerasoli et al., 2017;Phillips et al., 2009), as response variables. Twenty-five topographic, land cover and climate variables were used as covariates.
We split the data randomly into five equally-sized subsets (crossvalidation folds) then fitted each algorithm five times, leaving out one subset each time. Hence, we fitted 15 models for each species, which enabled us to assess the variability among replicate fits. The models were fitted at a spatial resolution of 1 km 2 on the British Ordnance Survey grid (EPSG:27700). Ensemble predictions were generated for each species by taking a weighted average (based on AUC) of the 15 replicate model fits.
F I G U R E 1 Species distribution modelling and assessment workflow. See the supplementary ODMAP document for full details (Appendix S1). Note that the SDM predictions were presented to the experts on a continuous scale (relative habitat suitability) rather than a binary one. Also note that the data subsetting in the "Model fitting" box was performed after the data were degraded to unique specieslocation combinations.

| Expert assessments of SDMs and data
Taxon experts (Table 1) assessed the available records and ensemble SDM predictions, in geographic space, for all species in their group of interest (or a random subset of 100 species in the case of the more speciose bryophytes; Table 1). Among other questions, they were asked (1) whether the available records for each species cover its environmental niche; (2) whether the available records for each species cover its geographic range; and (3) whether the map of predicted habitat suitability for each species (i.e. the ensemble SDM) reflects its true environmental niche in geographic space.
The experts provided their answers to these questions on Likert scales ranging from 1 (excellent coverage/excellent habitat suitability predictions) to 5 (extremely poor coverage/extremely poor habitat suitability predictions). We used five-point Likert scales to avoid creating a false sense of certainty about our results, having felt that it was unrealistic to expect that the experts could provide answers on a continuous scale (e.g. could one model be classed as 62% accurate and another 72%?).
Each expert was provided with a tailored R Shiny app, which included the predicted maps of habitat suitability (as a continuous measure), a map of the records used to fit the SDMs, maps of the environmental layers used to fit the models and various questions including those listed above. Example code, containing all of these questions, can be found in Pescott (2022).

| Measures of SDM performance
We considered two distinct aspects of model performance: accuracy and the variability among replicate models fits. Accuracy was assessed by the experts (see question 3 above). It can be considered a measure of discrimination ability because the experts based their judgements on whether habitat suitability was predicted to be higher at more suitable locations and vice versa. The variability among replicate model fits was calculated as the sum of the variance of habitat suitability across grid cells (hereafter "variance"). This measure includes the variability among algorithms and models fitted to different cross-validation folds, which are two major sources of variability in SDM predictions.
SDM algorithm contributes more than cross-validation fold to our measure of variance. To test this, we calculated mean habitat suitability across grid cells for each replicate fit. The median (across species) proportion of the total variance in mean habitat suitability explained by SDM algorithm is about 77%, compared to about 23% for cross-validation fold.

| Explanatory variables
We assumed that SDM accuracy and variance are functions of eight variables. Definitions of the variables, and information on if and how they were measured, are provided in Table 2.
TA B L E 1 A taxonomic breakdown of the number of species modelled, the number of models assessed, the number of models "fully" assessed, the assessor initials (see author list) and their affiliations.

| Conceptual models
We used DAGs to conceive plausible conceptual models depicting the effects of the explanatory variables on SDM accuracy and variance.
DAGs are non-parametric and distinct from the statistical models used to analyse them (see "Statistical analysis of conceptual models" below).
Our general strategy was to start with a theoretically plausible DAG, test whether it was empirically plausible, then refine it accordingly (similar to steps 1-3 in Grace & Irvine, 2020). The primary goal of model testing was to ascertain whether a DAG's (conditional) independencies were consistent with our data. If these were consistent, we then assessed the support for the DAG's implied mediation pathways using the "joint significance" method (MacKinnon et al., 2002). At no point did we posit a theoretically implausible DAG just to satisfy these criteria.
Having started with one DAG, which was not empirically plausible, we tested a further nine (Appendix S2). The final DAG, which we present and analyse in this paper, is both theoretically plausible and empirically supported by our model/data combination. Full details of the model conceptualisation and testing process can be found in the R Markdown document in Appendix S2. Theoretical justifications for the effects posited by the causal model are laid out in Table 3.

| Statistical analysis of conceptual models
We used piecewise path analysis to estimate the effects of the explanatory variables described above on SDM accuracy and variance using the R package piecewiseSEM (Lefcheck, 2016). Path analysis is the process of estimating path coefficients for each arrow, or "edge", in a DAG (Grace, 2006). They are equivalent to the coefficients estimated by regressing the variable on the receiving end of an edge on the variable from which the edge originates; that is to say, by regressing the "child" on its "parent" in DAG parlance. Where one variable affects another via more than one pathway (i.e. where a child has more than one parent), the path coefficient for one parent is equal to the partial regression coefficient obtained by regressing the child on that parent whilst conditioning on all other parents (i.e. multiple regression). In our analysis, for ease of interpretation, we standardized the path coefficients using the z transformation.
Path coefficients indicate the direct effect of each parent on its child, but these can be used to calculate indirect effects (Sobel, 1982).
One variable has an indirect effect on another where there is an intermediate variable (mediator). Indirect effects may be subdivided into specific and total indirect effects. A specific indirect effect is the product of all path coefficients in one pathway: for example, prevalence → n → accuracy in Figure 2. The total indirect effect of one variable on another is the sum of the specific indirect effects over all pathways linking them (Preacher & Hayes, 2008;Tarling, 2009).
A reviewer rightly pointed out that the effects listed in Table 3 might be nonlinear. Indeed, it was apparent on visual inspection of our data that sample size is nonlinearly related to SDM accuracy and variance. This is not surprising: others have noted an asymptotic relationship between sample size and SDM accuracy (Hallman & Robinson, 2020), and it is well known that the variance of an estimator (here the SDM ensemble) is an asymptotic function of sample size.
To enable the use of path analysis, which is based on linear regression, we (natural) log transformed sample size before fitting the path models. It was also necessary to log transform prevalence to obtain a linear relationship between it and the log of sample size. Accuracy and variance are plotted against the log of sample size in Figure 3.
To assess the uncertainty associated with the effects estimated by the path model, we used nonparametric bootstrapping. We resampled the data by species (both response and explanatory variables) with replacement to create 1000 bootstrap samples, fitted models to each sample and report the 95% (percentile) confidence intervals for each effect across samples.
One might expect the expert-assessed variables in our analysis to differ systematically among taxon groups and assessors (recalling TA B L E 2 Explanatory variables that we assume affect SDM accuracy and variance.

Variable Definition Derivation
Equilibrium Degree to which species are at equilibrium with their environment. Unobserved.

Niche breadth
Breadth of species' environmental niches (ranging from habitat specialist to habitat generalist).
Unobserved, but see Appendix S4 where we use a proxy measure.
Niche completeness Degree to which the occurrence data cover a species' environmental niche.
Assessed by the experts on a five-point Likert scale.

Prevalence
Species' true range sizes. Calculated as sample size divided by range completeness (1 being poor coverage and 5 being full coverage). Higher where sample size is high despite poor coverage of the species' range, and vice versa.
Range completeness Degree to which the occurrence data cover a species' geographic range.
Assessed by the experts on a five-point Likert scale.

Recorder behaviour
Recorders' decisions about where to sample geographically (and hence environmentally). Unobserved.

Sample size
The number of 1 km grid cells (EPSG:27700) in which the species has been recorded (between 2000 and 2015). Empirical.
that one assessor evaluated the models for each taxon group). For example, the experts might simply differ in what they perceive to be an accurate model or what constitutes "very good" coverage of a species' range. Or perhaps expert-assessed accuracy will vary between taxon groups if, say, the environmental covariates are more appropriate for some groups than others.
To assess the extent of any systematic differences between taxon groups in terms of expert-scored accuracy, we calculated their intraclass correlation coefficients. The respective values of 0.08, 0.25 and 0.23 (Appendix S2, p. 41) indicate that the data are not independent within assessors. Hence, we include a random intercept for assessor identity in the portions of our path model in which an expert-assessed variable is the response.

| Sensitivity analysis
Piecewise path models are based on linear regression and so are bound by the same assumptions. These include the assumptions that the response variables are numeric and normally distributed, which TA B L E 3 Theoretical justifications for the effects posited by our causal model. Justifications are stated ceteris paribus.

Explanatory variable Response variable Theoretical justification
Equilibrium Niche completeness A species cannot be recorded in portions of its niche that it does not occupy.

Equilibrium
Prevalence For a given niche breadth, a species closer to equilibrium with its environment will be more widespread.

Niche breadth Niche completeness
It is harder to sample a given portion of a generalists' than a specialists' niche.

Niche breadth Prevalence
A generalist has the capacity to be more widespread than a specialist.

Range completeness
It is harder to sample a given portion of a widespread species' range than a common one.

Prevalence
Sample size It is possible to record a widespread species in more locations than a rare one.

Recorder behaviour
Niche completeness For a given niche breadth and equilibrium, the geographic locations sampled determine the fraction of a species' niche that is sampled.

Recorder behaviour
Range completeness For a given prevalence, the geographic locations sampled determine the fraction of a species' range that is sampled.
Niche breadth Accuracy Pseudo-absences are likely to be placed in habitats that are suitable for generalists. Hence, their discrimination ability should be poorer than models for specialists.

Niche breadth Variance
There is less scope for variation in the types of habitats that specialists are recorded in. This should reduce sampling variability.
Niche completeness Accuracy SDMs estimate species' environmental niches so adequate coverage of those niches is important.

Niche completeness
Variance Maxent and random forests are more likely to find spurious signals where niche completeness is low (i.e. overfitting), compared to the regularized GLMs. This means that they will make different predictions in non-sampled portions of environmental space (Werkowska et al., 2017), increasing variability among algorithms.
Sample size Accuracy Larger samples will yield a more accurate point estimate in the absence of systematic error.

Sample size
Variance Small samples are more likely to be unusual (different from the population) by chance, which increases sampling variability (Lohr, 2022).

F I G U R E 2
Directed acyclic graph (Greenland et al., 1999) Table 2).
our data violate. Nevertheless, we proceeded with piecewise path models because it has been often demonstrated that linear regression is robust to such violations (e.g. Norman, 2010).
The robustness of linear regression notwithstanding, we assessed the sensitivities of our results to the choice of analytical method. By analytical method, we mean statistical model, which is different to the non-parametric causal models described above.
We analysed both causal models using several analytical methods, which varied in terms of how they treat the response variable expert score (ordinal or numeric), how they accounted for assessor identity (either by complete pooling, random intercepts or fixed effects) and how model fitting was achieved (e.g. covariance-or piecewise least-squares-based). Four of the five additional analytical methods gave roughly identical results (Appendix S3), so we only present the results from the multilevel piecewise path models here.

| RE SULTS
Our

| DISCUSS ION
In this paper, we used expert validation, graph theory and causal analysis to shed light on the drivers of SDM performance. We considered two components of model performance: accuracy, as assessed by the experts, and the variance among replicate model fits. We constructed DAGs depicting the effects of various explanatory variables on SDM performance, then analysed those DAGs using piecewise path models.
We suggest that the experts' knowledge is likely to be more informative than any one dataset that could have been used for model validation. Each expert is a national curator of the data for their taxon group. Cumulatively, they have considerable local, national and international field knowledge gained by writing distribution atlases, field guides, species status reviews and autecological papers.
Some also undertake their own modelling using similar tools to those in this study. Hence, their assessments arguably reflect an unrivalled synthesis of information.
Although experts, there are limits to, and possibly biases in, the assessors' knowledge. In total, 554 species were assessed, but the experts were only able to provide answers to all of the relevant questions for 518. It is possible that these missing species (6.5%) share common attributes that have biased our analysis (i.e. they are missing not at random; Rubin, 1976). While we have confidence in our experts' experience and knowledge for our area, it is of course possible that this does not generalize across all disciplines or datasets. In some cases, the available expert knowledge might still be biased towards particular places, such as areas of high population density or higher species richness, or particular latitudes. We do not claim that expert validation is a panacea for assessing SDMs: as always, the appropriateness of the tools and information available for making inferences need to be assessed for the case in hand.
Our model explains a relatively low proportion of the variation in SDM accuracy (20%; but see Møller & Jennions, 2002). To some extent, this is likely to reflect noise in the experts' assessments (of both accuracy and niche completeness). It might also reflect the fact that we missed important explanatory variables. One example is the appropriateness of the SDM covariates for a given species, which could affect model accuracy (Barry & Elith, 2006). We asked the assessors to report on this, but they felt unable to do so for the majority of species (56%). Another example is niche breadth, which we treated as unobserved. Omission of these variables could have biased the effects estimated by our model (Angrist & Pischke, 2009).
We suspect that the omission of niche breadth will not have appreciably biased our estimated effects. As a proxy for niche breadth, we calculated the number of land cover classes, using the UKCEH 2007 Land Cover Map (Morton et al., 2011), on which each species was recorded. This is not a perfect proxy for niche breadth, but we suspect that it will at least be a correlate thereof at the scale of our models (1 km 2 ). Including this proxy measure of niche breadth in the models barely changed our estimated effects (Appendix S4 Assessing niche completeness is more difficult than calculating sample size, but there are several ways to achieve this. One option is to consult appropriate experts as we did here. Another is to use range completeness as a proxy for niche completeness on the assumption that these are highly correlated; the analyst could then compare the distribution of records to published range maps, for example. Tools to assess the environmental representativeness of species occurrence data also exist (e.g. Boyd et al., 2021). Where additional data thought to cover a species' niche are available-e.g.
coarse-scale data from an atlas, or a digitized range map-these tools could be used to calculate niche coverage relative to the more complete data.
Another key insight from our analysis concerns the relationship between species' prevalence and SDM accuracy. Previous studies reported negative correlations between prevalence and accuracy conditional on sample size (Hernandez et al., 2006;Stockwell & Peterson, 2002;van Proosdij et al., 2016). It is important to remember, however, that larger sample sizes-when defined as the number of grid cells in which a species has been recorded-are possible for widespread species, which occupy more grid cells and are readily recognized by recorders with limited expertise. This is likely to be one reason why we found a positive indirect effect of species prevalence on accuracy (Figure 3).
F I G U R E 3 SDM accuracy and variance plotted against the log of sample size. Each point denotes one species. All variables are scaled and centred-after the log transformation in the case of sample size.

F I G U R E 4
Effects of the explanatory variables on SDM expert-assessed accuracy and model-based variance. Effects are the z transformed path coefficients from Figure 2. The dots indicate the mean effect obtained by bootstrapping and error bars demarcate 95% (percentile) confidence intervals. The colour of the points and the error bars signify the type of effect: direct or (total) indirect.
Our model also suggests that prevalence has a negative indirect effect on SDM variance. Like accuracy, this effect is mediated by sample size. Hence, where sample size is correlated with prevalence, as in our dataset, more accurate and precise SDMs should be attainable for widespread species than rare ones. This notion challenges the conventional wisdom that rare species lend themselves better to modelling.
It should be noted, however, that our estimates of the indirect effects of prevalence on SDM accuracy and variance could be biased. Part of the total indirect effects of prevalence on accuracy and variance is the direct effect of prevalence on range completeness ( Figure 2). Range completeness is also affected by recorder behaviour, which is unobserved. The omission of range completeness could bias the effect of prevalence on sample size, which could, in turn, bias the total indirect effects of prevalence on SDM accuracy and variance. Whether and to what extent this bias exists is unclear.
An important implication of our results is that the common practice of "stacking" individual species' SDMs to estimate species richness or similar is a risky business. Model performance is not random; rather, as we have shown, it varies with species traits and data characteristics. Hence, there is no reason to suppose that the errors will average out over many species.
We do not claim that our causal model is true. However, in depicting it as a DAG we have laid bare our assumptions about what determines SDM performance in a falsifiable manner. We believe that this is an improvement on much of the (vast) literature proffering advice on fitting SDMs, and that it clarifies the causal basis of much of this advice in a way that can be built upon clearly.

ACK N O WLE D G E M ENTS
We thank Diana Bowler and two anonymous reviewers whose comments improved this manuscript. We also thank the committee program delivering National Capability.

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors have no conflicts of interest to disclose.

PEER R E V I E W
The peer review history for this article is available at https:// www.webof scien ce.com/api/gatew ay/wos/peer-revie w/10.1111/ ddi.13698.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data and code underpinning this manuscript are available in the Supporting Information.