Species associations in joint species distribution models: from missing variables to conditional predictions

The abundance and distribution of multiple species are interconnected through various mechanisms (e.g. biotic interactions or common responses to the environment) shaping communities. Joint species distribution models (jSDM) have been introduced as a potential tool to integrate these mechanisms when modelling multiple species distributions, by inferring a residual matrix of species associations that could inform on biotic interactions. However, the direct link between these residual associations and biotic interactions has been challenged. Here, we test how the data type, resolution and sampling size affect the species associations identified by jSDMs and their benefits for predicting species given the known state of others (i.e. conditional prediction).


| INTRODUC TI ON
In recent decades, many species have seen their abundance and distribution strongly impacted by both recent climate and land use changes (Brondizio et al., 2019;Díaz et al., 2019;Pereira et al., 2012).While the individual responses of species to these changes can greatly vary (O'Brien & Leichenko, 2003), they may also be influenced or constrained by their biotic environment through species interactions (Tylianakis et al., 2008).The most widely used statistical tools to study species distributions are Species Distribution Models (SDMs) (Guisan & Thuiller, 2005).
Historically, SDMs model each species independently and one can then assemble the results of different models later on to look at aggregated measures at the community level (Calabrese et al., 2014).One of the major challenges of this 'model first -assemble later' approach (Ferrier & Guisan, 2006) is to consider the biotic and abiotic dependencies among species of the same community in the modelling process (Wisz et al., 2013).
Joint species distribution models (jSDM) were introduced in the aim of addressing this pitfall, by modelling all species jointly (Pollock et al., 2014;Warton et al., 2015).JSDMs infer a residual covariance matrix that describes species co-occurrence patterns after accounting for the environment.One of the expected advantages of this approach is the possibility of separating the response of species to abiotic factors (e.g.climate, habitat) with those of biotic interactions that could hypothetically be represented by the residual co-associations (Ovaskainen et al., 2016).In other words, the idea of these biotic interactions, (i.e.discrete real-world events), could be retrieved from positive/negative associations, (i.e.statistical events/co-occurrences that may or may not be the result of repeat interactions).Yet, this interpretation of the residual covariances has been severely challenged.First, the residual covariances can also be induced by environmental variables, like local habitat configuration, that are not included in the model (Dormann et al., 2018).In addition, Poggiato et al. (2021) showed that jSDMs, by construction, only estimates species' realized niches, and so not much of the signal of biotic interactions can remain in the residuals.Finally, several empirical studies showed that jSDM using presence-absence data could hardly retrieve biotic interactions (König et al., 2021;Zurell et al., 2018).This echoes multiple studies that argued that biotic interactions cannot be inferred from presence-absence data, whatever the method (Blanchet et al., 2020;Connor & Simberloff, 1979).
On the contrary, abundance data could be more informative, as they could better capture additional information related to demographic processes (Anderson, 2017) and provides more details due to their higher variability, and thus represent better theoretical indicators of biotic interactions (Blanchet et al., 2020;Losapio et al., 2021).So far, few empirical studies have explored whether jSDM with abundance data could prove more informative on biotic interactions than presence-absence jSDM (continuous abundance of marine communities: Roberts et al., 2022, parasite communities: Brian & Aldridge, 2021).
This could be partly justified by the fact that jSDM with abundance data are computationally much more intensive (but see advances from Chiquet et al., 2021) but also that large-scale co-abundance data are less common that presence-only data.
Having the right spatial resolution of co-abundance data is also crucial to infer biotic interactions from empirical data.Biotic interactions are usually taking place locally, which often mismatches the coarse resolution at which species are usually modelled across large spatial extent (>5 km resolution, Jetz et al., 2012;Beck et al., 2013).At such a coarse resolution, the effect of biotic interactions might be completely diluted in the observed patterns (Belmaker et al., 2015;Thuiller et al., 2015).Using simulations, Araújo and Rozenfeld (2014) provided theoretical evidence that signals of negative interactions, such as competition, can be diluted at coarser resolutions (as confirmed by Zurell et al., 2018 in the case of jSDM).These simulations emphasize the necessity of modelling negative associations at resolutions aligning with expected interactions for effective retrieval.However, empirical resolution values can significantly vary across environmental and community contexts (e.g.mobile vs. sessile organisms, differences in home ranges).Furthermore, Zurell et al. (2018) emphasized that even under ideal theoretical conditions, jSDM is restricted to uncovering simple symmetric interactions (e.g.competition, facilitation) due to its reliance on singular coefficient inferences for association pairs.In avian communities, examples of interactions potentially captured through spatial data across diverse environmental gradients should encompass competition resulting from niche partitioning (Maurer, 1984), including both interference competition (e.g.physical confrontations for nest sites, territorial disputes, direct competition for limited food resources, Miller, 1967) and exploitation competition (e.g.competition for shared food sources without direct aggression, competition for available space, Grover, 1997).Positive interactions can also emerge from missing environmental variables.Nonetheless, these residual associations contain valuable information to enhance predictive performance through currently underutilized conditional predictions.
a facilitation process (e.g.forming interspecific flocks for easier food acquisition or predator protection; Diamond, 1981).These interactions are expected to occur at a local scale (Dhondt 2012), tied to species territory sizes (which can vary but roughly correspond to about 1 hectare during the breeding season for passerine birds, the main species in this study; Cramp et al., 1994).In this regard, König et al. (2021) identified the highest number of negative associations detected by jSDM at 125 m (the smallest buffer tested, roughly corresponding to 5 hectares) within bird forest communities (compared to 250 m, 500 m and 1 km), using a high sampling size co-occurrence dataset (with a minimum of 1754 sampling units at 125 m covering only national-scale forest areas).
Moreover, Blanchet et al. (2020) demonstrated that in a simple theoretical case involving two species with occurrence probabilities of 0.4 and 0.6, the required sample size to detect their associations could range from 500 to several thousand sampling units, contingent on the strength of the interactions.Therefore, sample size is an important characteristic to consider when attempting to retrieve interactions with jSDM and can also be highly dependent on the empirical context considered.
Regardless of the ability of jSDMs to infer biotic interactions, the predictive performance of these models has also been questioned.It has been shown that marginal predictions of jSDM (prediction utilizing only environmental explanatory variables) do not differ from classical SDM predictions (Caradima et al., 2019;Poggiato et al., 2021).However, jSDM can leverage on the inferred residual covariance matrix to provide conditional predictions, that is, predictions of a (set of) species given the known occurrence state (e.g.observed presence-absence) of another (set of) species (Wilkinson et al., 2021).Despite the call of using these conditional predictions to exploit the covariance matrix of jSDMs rather than interpreting it (Poggiato et al., 2021), only few studies have since exploited it.Yet, this has only been done on small sample designs or reduced species pools, and thus lacks a comprehensive evaluation of conditional predictions by large-scale studies (9 fish species: Wagner et al., 2020; 118 sample units of a microbial community: Odriozola et al., 2021).
In this study, we propose to jointly analyse how these above-mentioned limitations affect the inferred residual covariance matrix of jSDM and to explore the predictive improvement through conditional predictions.We often rely on citizen science programs that, in this context, offer large multi-species monitoring datasets with fine resolution over a wide geographic range, we relied on a citizen science program, the French Breeding Bird Survey (FBBS), that offers a standardized multi-species protocol in abundance data (and converted to occurrence data), at two nested spatial resolutions, with a detailed description of the local habitat.Calibrating jSDMs on both multi-species presence and abundance data, we expected to detect a significant difference in the inferred residual association patterns (H1.a).Given the nested sample design of our data (10 points at 200 m in 2 km plots), we tested the impacts of spatial resolution on the inferred residual associations with the hypothesis that the proportion of negative species associations should increase at the finest resolution, at which intraguild competitive interactions take place (~200 m, Dhondt, 2012; H1.b).In addition, we expected that a small sample size would affect the ability to detect the full range of species associations (H1.c).Then, we tested if this high-quality dataset by incorporating co-abundance associations at fine resolution with high sample size allows us to detect biotic interactions that would be expected as a result of niche partitioning.More specifically, we expected to detect negative associations between species sharing similar diet, habitat and functional traits (traits chosen to synthesize expectation from this niche partitioning principle) and phylogenetic proximity (being generally more functionally related, H1.d).
Finally, to ensure that these species associations actually contain useful information (as they may result from overfitting or model misspecification/noise, Tikhonov, 2018), we tested whether conditional predictions improved predictive performance on new sites over marginal predictions alone (H2).

| Bird data
We used abundance data for the 40 most common breeding bird species (Table S1, mostly Passeriformes but also Columbiformes and Piciformes) representing ca.86% of the total abundance of the 240 species from the French Breeding Bird Survey (FBBS).Due to computational complexity limitations, we used this restricted species pool (>350 occurrences) with the most recent year available (2018), and thus focusing our study on associations between the most common species.This standardized participatory monitoring program consists of 2 × 2 km plots of 10 count points (at least 300 m apart) that are sampled by trained volunteer birders during a 5-min period at dawn around mid-May with two visits spaced 3 weeks apart (to ensure detection of early and late breeders).The maximum value of the two counts is kept for each species.We also retained only bird observations within a radius of 100 m around the count point, to limit habitat-based detection bias (as Gaüzère et al., 2015).A more detailed description of the protocol is available in Jiguet et al. (2012).
Final dataset consisted of 7040 points in 724 plots (*10 points except in rare cases due to missing volunteers' data or unreachable habitats), located at an elevation threshold below 800 m (to exclude particular bird communities) and covering all mainland France (Figure S1).
In order to globally interpret the obtained association matrices, we extracted seven traits from the global databases of bird traits of Devictor et al. (2010) and Storchová and Hořák (2018) selected to synthesize the dietary, functional or small-scale habitat niche: body mass, main diet component (vertebrates, invertebrates, plants), trophic index, main foraging method (pursuit, gleaning, grazing, digging, scavenging), nest type (ground, hole, open-arboreal, closed-arboreal, ground close), main foraging substrates (ground, vegetation, air) and habitat specialization (generalist, forest, farmland and urban).
To explore the impact of phylogenetic proximity of species on their associations, we extracted the pairwise distance between each species from Jetz et al. (2012) using the 'ape' R package version 5.7 (Paradis et al., 2022).

| Environmental data
We used the habitat classes detailed by volunteer birders within a 100-m radius of each FBBS point, based on a hierarchical description adapted from Crick (1992) and grouped into 18 classes (Table S2) set out by Julliard et al. (2006), each of which was included as dummy variable.This 18-class local habitat description covers all habitats used by our studied-species pool, which is composed of several specialists and generalists (Barnagaud et al., 2012).The spatial distribution of these 18 habitat classes closely matches their current national distribution (Julliard & Jiguet, 2002).Vegetation density and dynamics also influence bird communities' patterns and can be estimated through remote sensing (Hobi et al., 2017).Here, we used the yearly NDVI (Normalized Difference Vegetation Index) that seems most appropriate for our dataset (Bonthoux et al., 2018) by calculating the sum of monthly data at 250 m resolution around each point, from MODIS rasters using the 'MODISTools' package version 1.1.2.At a landscape scale, land cover composition and diversity are important factors for bird community patterns (Rodewald & Yahner, 2001).We extracted from the 2018 national land cover data (CESBIO, Inglada et al., 2017) the relative percentage in urban, agricultural, forest and semi-natural areas within a 2 km buffer around each FBBS point.Additionally, to better capture landscape diversity, we also calculated a Shannon diversity index from 23 initial CESBIO land cover categories (SHDI).Finally, there are many anthropogenic pressures on bird communities, such as light pollution and noise (Cabrera-Cruz et al., 2018;Senzaki et al., 2020).We therefore used a harmonized light pollution dataset as a proxy for the level of anthropogenic activities, available at 1 km resolution that we directly extracted at each FBBS point (Li et al., 2020).We summarized the information of these six landscape variables through a PCA dimension reduction approach to avoid multicollinearity (Cruz-Cárdenas et al., 2014).We kept the first four dimensions of this PCA explaining 92.5% of the variance.
Distribution patterns and trends in bird communities are known to be influenced by climate (Eglington & Pearce-Higgins, 2012;Gaüzère et al., 2015).We thus used two climatic variables known to affect the breeding bird communities: mean reproduction period temperature (Jiguet et al., 2010;Julliard et al., 2004) and reproduction period precipitation (Illán et al., 2014).For each count point, we calculated these two variables (over the period April to June, of the year 2018) at 10 km resolution from the European climate rasters (E-OBS) provided by ECA&D, using the R package 'climateExtract' version 1.25 (Schmucki, 2021).In summary, we used a total of eight explanatory covariates: two local habitat variables (including one with 18 categories), four landscape variables and two regional climate variables.

| Statistical methods
We analysed the bird data using the Hierarchical Modelling of Species Communities (HMSC) framework (Ovaskainen & Abrego, 2020;Ovaskainen et al., 2017).HMSC is a particularly suitable joint species distribution model that allows -through a latent variable approach -to infer multiple residual covariance matrices, one for each random effect provided by the user.As sampling units, we used the counting points (n = 7040).We included as fixed effects all our 8 selected environmental variables (plus a quadratic term for mean reproduction period temperature).We first fitted a lognormal Poisson model for the complete dataset with counts (i.e.abundance data and hereafter M full-abun ) of each of the 40 most common species as response variables.
As random effects, we included a point-level (200 m) nested in a plot-level (2 km) to consider the hierarchical sampling design.We checked the temporal stability of the detected associations in fitting the same abundance model but for the year 2017 instead of 2018 (M full-abun-2017 ; n = 7008).The explanatory variables were also available for this year except for landscape metrics where we used 2018 data.We explored how fixed effects modify the detected associations by fitting an abundance model without any fixed effects (intercept only; M full-abun-intercept ).
Beforehand, we tested the need to account for spatial autocorrelation by using the method adapted to large datasets presented by Tikhonov, Duan, et al. (2020).Autocorrelation disappears when including the climatic and 18-local habitat description as covariates.Therefore, we did not include spatial autocorrelation in our models.We also checked that our predictor variables do not exceed a correlation of 0.7 and a VIF (Variance Inflation Factor) of 5 to avoid multicollinearity problems (with the VIF function of the 'car' package version 3.1.1(Fox et al., 2019).We compared the different association matrices obtained from these models using the statistics of the Mantel test (Mantel, 1967).We used the followings models to disentangle the individual effects of the three main datasets limitations (nature data, resolution and sampling size) on associations matrix: 1.To explore the impact of a switch from occurrence to abundance data (H1.a),we compared M full-abun ) to a second model where we converted the abundance data into occurrence data (1 when abundance >0) (M full-occu , binomial distribution with a probit link function).2. To assess the impact of resolution on the pattern of association (H1.b), we compared the distinct association matrices inferred at two scales from the two M full-abun random effect point-level (200 m) and plot-level (2 km).We also fitted an additional model only with a point-level random effect (M full-abun-1rf ) to better understand how the residual association matrix is influenced by the inclusion of a higher hierarchical (plot-level) random effect.
3. We studied the effect of increasing the sample size on the inferred association matrix at the point-level (H1.c).For this purpose, different intermediate models (of abundance and occurrence) were fitted on datasets of increasing sample size (100, 250 and 43 5 plots), after performing a random sampling procedure at the plots level (meaning keeping the entire plot with 10 points and hereafter, M sub-type with sub:100, 250, 435 plots and type: abun versus occu ).
We then tested the ecological consistency of the inferred association matrix with the full abundance model (M full-abun ) at the point-level (200 m) by contrasting it with species traits (described in section 2.1) and phylogeny data (H1.d).Species that are functionally similar (meaning they share the same traits in terms of position and width along ecological niche gradients) have a higher probability of engaging in competition (Gause, 1934) and of being encountered together less frequently in the same community (i.e.negative associations).This trait-based approach should thus allow for the detection of competition interactions resulting from niche partitioning, but it becomes more challenging to interpret other types of interactions (e.g.facilitation, commensalism…).We used linear regression with associations between each species as a response variable as a function of single trait similarity variables (univariate Gower distance for each trait).To ensure that some traits do not partially synthesize the effect of others, we fitted a single model with all traits and then kept only those improving AIC (backward and forward procedure) and checked that the final model did not include highly correlated covariates (VIF <5, correlation <0.7).As this method does not account for dependencies between pairwise associations involving the same species, we also ensured that individual correlation between pairwise associations and each trait (and associated p-value) are consistent with a method that does, namely the Spearman Mantel test.To test phylogenetic signal, we used a Spearman Mantel test between species associations and their pairwise phylogenetic distance.Next, we used a ward.D2 clustering type method (Murtagh et Legendre 2014) to obtain groups of species sharing the same type of associations and then compared to their traits thanks to graphical figures to visualize patterns.
All analyses were conducted with the R package 'HMSC' version 3.0.11(Tikhonov, Opedal, et al., 2020).We fitted the HMSC models with three Markov chains (MCMC) of 9,000,000 iterations each (the first 3,000,000 were removed as burn-in).The posterior distributions were obtained from 3000 posterior samples (1000 of each chain on which a thin of 6000 was applied).We checked the MCMC chains convergence with the potential scale reduction factors of model parameters (Gelman & Rubin, 1992).The computation time for the complete models (M full-abun and M full-occu ) was around 70 days.The MCMC convergence was satisfactory: the potential scale reduction factors for the B-parameters (responses of the species to the fixed covariates; Ovaskainen et al., 2017) were on average 1.03 for M full-abun (Figure S2; see Table S3 for other models).
To explore the advantage of jSDM as a prediction tool (H2), we evaluated the performances of marginal and conditional predictions of the abundance models by randomly partitioning the plots into a training dataset (435 plots, 60%) and a test dataset (40%) because a proper cross-validation was not possible due to computational costs.
In the test dataset, we predicted the abundance of each species with environmental variables but also used the observed abundance of all other species (i.e.conditional predictions, see detailed explanation in supplementary information) or not (i.e.marginal prediction), using M 60-abun .We then compared the relative performances of the two types of predictions in terms of R 2 .Comparing these two types of prediction allowed us to assess if the residual association matrices contain relevant information that can improve predictions, and not only error and/or model misspecification (Ovaskainen & Abrego, 2020;Poggiato et al., 2021).We also calculated the correlation between predictions and improvement between marginal and predictive with species prevalence (i.e. total species occurrence).
To explore the effect of sample size on predictive performance (R 2 ), we also compared conditional and marginal predictions using the intermediate models (M 250-abun and M 100-abun ) as training sets.The computation time was 39 days for M 435-abun fitted with the training dataset (4,500,000 iterations used to obtain comparable convergence) and then 10,000,000 additional MCM iterations were needed to obtain conditional predictions to estimate the latent variables at the new counting points of the test data set (requiring 51 more days for M 60-abun ).All these calculations were performed on a high-performance computing (HPC) computer cluster.

| RE SULTS
A significant part of the variance explained by our full abundance model M full-abun was due to our fixed effects (mean: 48.6%).Habitat first explained the most important part of the variance (mean 23.1% for the 18-class description, mean 7.5% for NDVI) followed by landscape metrics (mean 13.2%) and then climate (mean 4.9%, see Figure S3).In terms of explanatory performance, M full-abun showed a moderately good fit to our data (considering its very fine resolution) with an average R 2 of 0.315.

| Detected species associations in residual matrix
For the abundance model M full-abun , out of 780 possible associations, we found 334 significant positive and 167 negative associations.Abundance model using the previous year's data M full-abun-2017 (2017) gave similar results than M full-abun (Mantel statistic r = 0.87, Figure S4).Shifting from occurrence to abundance data (M full-occu/ M full-abun ) did not lead to major differences in the residual association matrix patterns at the point-level (H1.a;Mantel statistic r = 0.92).
Shifting from plot-level (2 km) to point-level (200 m) led to the appearance of many negative associations at finer scales (H1.b; Figure 2; M full-abun ).The ratio of significant positive/negative associations decreases from 624/21 (plot-level) to 334/167 (point-level) making the pattern more balanced.We also observed a decrease from 83% to 65% of significant associations detected.
Increasing the sample size from 100 to 724 plots raises the number of significant associations detected from 13% to 65% at the pointlevel (H1.c; Figure 3).Thus, we found a strong difference between associations inferred in the M full-abun residual co-abundance matrix

| Traits correlation
When evaluating whether the detected association at the pointlevel (200 m) aligns with the expectations derived from the niche F I G U R E 1 Species residual association matrices from jSDMs of the 40 selected species, estimated from presence/absence (probit M full-occu ; lower triangular matrix) and abundance data (log-normal M full-abun ; upper triangular matrix).Associations with high statistical support (posterior probability of at least 95%) are shown in a red (for positive) and a blue gradient (for negative).These two association matrices were computed at the point-level (~200 m).Correlation defined as the Mantel test correlation between the two matrices compared.
partitioning principle based on traits and phylogeny (H1.d), we detected a significant relationship with main foraging substrates, nest type, body mass and habitat specialization within a linear model (R 2 = 0.08) (Figure 4a, same conclusions were drawn from individual Mantel test, see Table S4).We did not detect any significant phylogenetic signal.The 'ward.D2' clustering method revealed a pattern where species often exhibiting similar covariances tended to exhibit a common type of habitat specialization, while also displaying negative covariances with other species (Figure 4b).

| Prediction metrics
On the test set, M 435-abun models showed relatively poor marginal predictions (R 2 = 0.08), but these were greatly improved when conditioning on all other species (mean R 2 = 0.27, min = 0.06, max = 0.68).The difference between these two approaches was significantly large, indicating better R 2 for the conditional prediction leading to an +235% average improvement in respect to marginal prediction (paired t-test, p < 0.05; Figure 5).We noted that all species benefited from the information of the association matrices with conditional predictions with an increase from +0.03 (Mistle thrush: Turdus viscivorus) to +0.52 (Skylark: Alauda arvensis) of the predictive R 2 (and positive correlation between improvement and species prevalence, Table S5).Conditional predictions using intermediate abundance models (M100 -abun , M 250-abun ) also provided a significant increase in mean R 2 (0.16-0.21) compared to the respective marginal predictions (0.06-0.07) (paired Wilcoxon test, p < 0.05; Figure S6), but significantly less than that using the entire training dataset (M 435-abun, R 2 = 0.27) (paired Wilcoxon test, p < 0.05, see Figure S5 for occurrence results).
F I G U R E 2 Species residual associations matrices from jSDM on abundance data M full-abun , estimated at the plot-level (2 km; upper triangular matrix) and at the point-level (200 m; lower triangular matrix).Associations with high statistical support (posterior probability of at least 95%) are shown in a red (for positive) and a blue gradient (for negative).Correlation defined as the Mantel test correlation between the two matrices compared.
Thanks to analyses on this large-scale high-resolution multispecies dataset covering all mainland France (n = 7040, 40 species, 200 m resolution), we provided answers to some methodological uncertainties of jSDM at a national spatial extent.We disentangled the individual impact of dataset characteristics (abundance-occurrence, spatial resolution; sample size) on the inferred association matrices.We also found that the detected species associations were significantly related to species traits linked to ecological niche (main foraging substrates, nest location, body mass and type of habitat specialization) in a way that suggests that the associations correspond to similar responses to local environmental conditions rather than the effect of negative biotic interactions expected from partitioning niche (but not precluded from other interactions that have similar impact to environmental filtering such as heterospecific attraction).In capturing critical information on local conditions that are notoriously difficult to measure in the field, the retrieved high-resolution associations between species strongly improve the predictive performance of modelling for all species (by 235% on average).
Switching from occurrence to abundance data (H1.a)did not lead to a very different residual covariance matrix but reduced the number of significant associations (Figure 1), partially refuting our hypothesis (H1.a).This suggests that remaining inferred covariance could be more robust, as they are preserved with the addition of information but also that some co-presence and co-abundance patterns might be determined by different mechanisms or underlying factors.These minor differences may also be due to the fact that our abundance data had relatively little variance (many 0 s and 1 s) and thus differed little from occurrence data.Larger differences might indeed be found with more variable abundance data (plant community cover data, activity data, etc.).However, these small differences are consistent with Roberts et al. (2022) between species associations inferred from continuous abundance and occurrence data in marine communities.
In line with our hypothesis (H1.b) concerning the appearance of negative associations at a fine scale, we found more negative associations at the point-level (200 m) whose resolution better matches those of competitive interactions in avian communities (Dhondt, 2012).At the plot-level (2 km), we found a pattern largely dominated by positive associations in line with predominantly positive large-scale associations pattern found by Elo et al. (2023), 10 km resolution, 59 common bird species).At this coarse scale, we assume that deterministic processes (e.g.landscape or regional environmental filtering) dominate and thus our association matrix captures a multitude of environmental effects not described by our included variables.Moreover, including the higher resolution random effect allows separation of these coarse missing variables from the fine scale associations of the point-level (Figure S7).While we still detected many positive associations at this fine scale -as found by Tobler et al. (2019) and Zurell et al. (2020) for forest birds' communities -we also detected many more negative associations for common species compared to Tikhonov, Opedal, et al., 2020, n = 914 counts over 200 transects for 50 common species).Moreover, by integrating these two scales in the same model, we could conclude that these detected positive associations at the plot level (2 km) are not induced by a lack of description at the local level (e.g.habitat structure, Pearman, 2002).We thus confirmed the empirical detection of the negative fine-scale associations that Araújo and Rozenfeld (2014) predicted by simulations and that empirical studies F I G U R E 3 Species residual associations matrices from jSDMs on abundance data M sub-abun , estimated from different selected sample sizes of 100, 250 and 435 plots.Sampling was done randomly, ensuring spatial homogeneity.The comparison is made at the level of the residual association matrix of the point and thus includes (~10*plots) counting points Species on the y-axis have the same ordinate as in Figures 1 and 2. Correlation defined as the Mantel test correlation between the two matrices compared.
However, in contradiction with our hypothesis (H1.d), we found a positive association rather than negative between species sharing the same habitat specialization, foraging method and substrate, and nest type (Figure 4).By finding individual correlations with specific traits, we extended König et al. (2021)'s results with negative correlations between woodland bird associations and a global functional similarity (synthesizing eight traits) but we cannot conclude that these matrices capture only biotic interactions.
Although the habitat description was specifically designed for this protocol and has a resolution (200 m) corresponding to the expected variation in habitat structure (Pearman, 2002), these unexpected results may indicate that a significant part of the environment is still insufficiently described.In parallel, we do observe that our environmental variables have captured a significant portion of the variance (as depicted in Figure S3) and concurrently modified the detected residual association pattern (in contrast to a model without any environmental variables, as illustrated in Figure S8).For example, positive associations among species sharing the same nest type (Figure 4a) can indicate the absence of a variable describing specific fine habitat structure (e.g.presence of hedges, tree cavities).However, data on such fine features are technically difficult to collect in the field, over large spatial extents.These association matrices could thus be an interesting tool to synthesize several missing variables through the abundance of other species.The correlation with species traits also allowed us to make assumptions about the type of variables that are missing: shedding light on the type of variables that future protocols might include to be surveyed in-situ as they are difficult to synthesize with data like the remote sensing data used here.If detecting negative associations resulting from niche partitioning using current spatialized data is challenging, an alternative approach could in- multivariate autoregressive models; Certain et al., 2018) to model temporal effects between species.We also emphasized that these association matrices can reflect information from a wide range of phenomena and not only from the abiotic environment (evolutionary history, dispersal capacity, interactions with other taxa or other bird species as we focused here on common species).Moreover, our findings do not preclude that any biotic interaction are captured by these association matrices but mainly those predicted by niche partitioning but for instance, we emphasized that positive association between competitors can arise from heterospecific attraction (birds using other species as a cue to profitable breeding sites).Although our case study focused on birds, we believed that these findings are generalizable to associations of all taxa and even inter-taxa with synthesized missing variables depend only on the type of taxon, the resolution of the data, and the descriptive variables already included in the study (and thus can be used in a predictive context).
We confirmed our initial hypothesis (H1.c) on the appearance of a high number of associations by increasing the sample size of the dataset and illustrate the low number of these associations detected with a 'small' sample size (ca.1000 FBBS points).However, it is interesting to note that increasing the sample size did not lead to the modification of associations already detected with a smaller sample (e.g.no sign inversion), nor it led to a drastic change in the ratio of positive to negative associations.We also showed that the pattern detected with 60% of the dataset (ca.4000 FBBS points) and that with the whole dataset (ca.7000 FBBS points) are quite similar, which allowed us to approach an order of magnitude of the necessary sample size.However, these conclusions need to be placed in the context of our nationwide study on bird communities, which covered a large environmental gradient.It should be noted that sample sizes may differ when focusing on a more local study that covers a more homogeneous environmental gradient (and a smaller species pool).While this threshold is undoubtedly highly dependent on the community data and the system studied, we agree with Blanchet et al. (2020) that detecting associations probably needs community datasets with multiple thousand sampling units that are rare (only four have >100 samples units out of 294 gathered by Atmar & Patterson in 1995) even if recent examples with jSDM can be cited (e.g.König et al., 2021;Ovaskainen et al., 2017).This last F I G U R E 5 Comparison of predictive performance in terms of R 2 for each species using only the fixed effects of the jSDM (marginal predictions) or the fixed effects and the information of the residual association matrices (conditional predictions).
point is probably also exacerbated if the community is mostly made of rare species.Moreover, a recent empirical study by Thurman et al. (2019) showed that if a species interacts with several other species, the statistical signal from these individual interactions will be statistically difficult to capture (thus requiring more data) in line with simulation results from Cazelles et al. (2016).The relevance of the additional associations detected is notably illustrated in the improved conditional predictions using more data, as discussed below.
Our last hypothesis was that these association matrices can improve predictions through conditional predictions (H2).We confirmed the drastic improvement of conditional predictions (mean = +235% R 2 improvement) and that all species benefit from these association matrices (min = +0.03,max = +0.52R 2 increase), which is an interesting perspective for modelling rare or invasive species.For instance, it will be possible to use the associations between a local community and an invasive species in a geographical area where the latter has become established to better predict the colonization probability of sites.We have emphasized here that our conditional predictions used all other species to predict one (since we were primarily testing the overall quality of the information set of these associations).However, in practice, only a small number of species can be used, and improving this type of prediction remains to be explored.For instance, we could rely on the abundance/ presence of common (or more easily sampled) species to calculate the probability that rare and less frequently detected species were missed during a survey.Finally, including multiscale associations as we did explore the possibility of including the abundance of a regional species pool that could represent a way to better account for the evolutionary and biogeographic history shared with the focal species.It is important to note, however, that these conditional predictions require the observed abundance of a species set (or its estimation via spatial extrapolation of latent variables, Tikhonov, Duan, et al., 2020) to predict a focal species (although methods such as cascade modelling in the case of prey-predator networks are possible, Poggiato et al., 2022) that can explain their current limited use.
Increasing the sample size showed that marginal conditions (not using these association matrices) did not improve much with higher sample size (Figure S6) illustrating that fixed effects can be well estimated with a reduced sampling effort when well designed.However, there was a significant improvement in the conditional predictions with increasing sample size (Figure S6) illustrating that the additional associations detected at larger sample sizes (H1.c) are indeed informative for conditional prediction.Thus, while these association matrices seem to capture mostly missing environmental variables, they can nevertheless represent valuable tools for prediction when only a part of the community has been surveyed (such as the most common ones).
compared to those using reduced sample size M sub-abun (M 100-abun / M full-abun Mantel statistic r = 0.39 and M 250-abun /M full-abun Mantel statistic r = 0.82).However, this difference in detected associations was strongly reduced with a large sample size (M 435-abun /M full-abun Mantel statistic r = 0.94, Figure 3).Detected positive/negative associations ratio seems to be weakly impacted by this sample size (M 100-abun = 2.73, M 27-abun = 1.58,M 435-abun = 1.58 to M full-abun = 2).We also noted the absence of sign change of the previously detected associations when increasing sample size.This difference between matrices of increasing sample size M full-occu and full residual co-occurrence association Matrix from M full-occu was quite similar using occurrence data than abundance data except with the smallest sample size (M 100-occu /M full-occu Mantel statistic r = 0.71, M 250-occu /M full-occu Mantel statistic r = 0.81 and M 435-occu /M full-occu Mantel statistic r = 0.94) instead of (0.39, 0.82 and 0.94).
volve integrating time series analysis.For instance, studying the asynchrony of population dynamics among co-occurring species in similar habitats with shared functional traits.Using time series analysis could also enable the application of other recent statistical advancements (e.g.convergent cross mapping, Clark et al., 2015, F I G U R E 4 (a) Best-supported linear models fitting residual associations at point-level (N = 780 pairwise species comparisons) as a function of traits similarity: Foraging Method (Pursuit, Gleaning, Grazing, Digging and Scavenging), Foraging substrate (Ground, Vegetation and Air) Similarly Nest type (Ground, Hole, Open-arboreal, Closed-arboreal and Ground close), Body mass and Habitat Specialization (Generalist, Forest, Farmland and Urban).Individual correlation between residual associations and similarity functional traits are available in Table S4.(b) Joint species association matrices at point-level (200 m) reordered by a ward.D2 clustering method illustrating the clear clustering of species.Grey boxes represent the five unique clusters of residual species associations identified by this clustering method.The colours represent species' habitat specializations (black = generalist, green = forest; orange = farmland; grey = urban; blue = none, see Figures S9-S12 with other traits).