PointedSDMs: An R package to help facilitate the construction of integrated species distribution models

Ecological data are being collected at a large scale from a multitude of different sources, each with their own sampling protocols and assumptions. As a result, the integration of disparate datasets is a rapidly growing area in quantitative ecology, and is subsequently becoming a major asset in understanding the shifts and trends in species' distributions. However, the tools and software available to construct statistical models to integrate these disparate datasets into a unified framework is lacking. This has made these methods inaccessible to general practitioners and has stagnated the growth of data integration in more applied settings. We therefore present PointedSDMs: an easy to use R package used to construct integrated species distribution models. It provides functions to easily format the data, fit the models in a computationally efficient way and presents the output in a format that is convenient for additional work. This paper illustrates the different uses and functions available in the package, which are designed to simplify the modelling of integrated models. A case study using the package is also presented: combining three datasets coming from different sampling protocols, all containing records of Setophaga caerulescens across Pennsylvania state.


| INTRODUC TI ON
Ecological research in the 21st century has been characterized by the accumulation of species occurrence data, due mainly to the advancements in digital technology and online data repositories (LaDeau et al., 2017).
While this accumulation of data has expanded the potentials of ecological analysis on the spread, range shifts and relationship species have with the underlying environment, a multitude of challenges have arisen.
In particular, the data are likely to have come from disparate sources, resulting in heterogeneous attributes, assumptions and sampling protocols inherent in each (Fletcher Jr et al., 2019).
Typically, analysis of such data uses species distribution models (SDMs), which model the relationship between species' distributions and the underlying environment. They are fitted to data using a variety of different estimation procedures and software packages (examples of such provided in Norberg et al., 2019). However, when several datasets were available, the standard approach was to favour one dataset and discard the others, or use them in some form of secondary analysis .
As a result, a myriad of statistical methods to perform analysis, make predictions and efficiently use all available data have been producedthe so-called integrated species distribution models (ISDMs; see Miller et al., 2019, for a detailed review). A common result among research on the topic is that integrated data models appear to not only expand the spatial scope of a study, but also appear to be superior to models with a single data source by providing improvements to the results and estimates in comparison to using only a single dataset (see e.g. Bowler et al., 2019;Fithian et al., 2015;Miller et al., 2019).
Despite the development of ISDMs, a significant problem is the lack of general software and tools to make inference with them; thus, the overall uptake has been generally slow. Here we introduce PointedSDMs, an easy to use R (R Core Team, 2022) package designed to fit SDMs using data obtained from heterogeneous sources, and integrate them all together in a unified statistical framework. It does so using a hierarchical state space formulation-in which we link a process model (which provides a description of the true distribution of the model) with observation models for each dataset, dependent on their underlying sampling protocols ).

The integrated model for this package is fitted using integrated nested
Laplace approximation (INLA)-a computationally efficient method used by Bayesian statisticians to fit latent Gaussian models. The theory behind the INLA methodology is discussed in detail in Rue et al. (2009), and estimating models with this methodology is made simple with the now established R-INLA package (Martins et al., 2013). The PointedSDMs package constructs a wrapper around the R package inlabru (Bachl et al., 2019), which, in turn, builds on the R-INLA package to help provide a userfriendly method to simplify the modelling of spatial process models.

| Statistical model
The aim of our state-space point process model is to use the available species' location data to make inference about the 'true' distribution of the population of the species'; since this distribution cannot be directly observed, it is referred to as a latent state . To do inference, we use a hierarchical modelling structure with an underlying process model which provides a statistical description of how points are distributed in space; the role of such is a reflection of how multiple data types emerge from the same system . This process has a spatially varying intensity function (denoted here by (s)) which is some function of environmental covariates X and parameters such that a higher intensity implies that the species is more abundant in a location.
A visual representation of the hierarchical setup of this model is presented in Figure 1.
For this model, we assume that the underlying process model is a log-Gaussian Cox process (LGCP) with an intensity function given as (s) = exp{ (s)}, which describes the expected number of species at some location, s. The log of this intensity function is thus given as: where is a dataset-specific intercept term, u is the coefficient associated with the uth environmental covariate and (s) is a zero-mean spatially continuous Gaussian random field (GRF), included in the model to account for potential spatial autocorrelation and the effects of all the environmental covariates not included in the model. Therefore, the expected number of species' presences within a region Ω is given by the integral of the intensity function across the entire region: (1) (s) = + k ∑ u=1 u X u (s) + (s), F I G U R E 1 Representation of the structure of the integrated species distribution model, where each dataset is a separate realization of the 'true' species distribution. This is done by assuming each dataset has its own observation process, with a common latent, which is described by ecological covariates and parameters.
Next, we assume that each dataset process (Y i , i = 1, 2, … , n) has its own sub-model (observation model), which provide a statistical description on the data-collection process . These models link the intensity function to the dataset's assumed likelihood, given by,  Y i | (s), i , where i are the parameters for the i th observation model. Table 1 provides a description of the three types of datasets allowed in PointedSDMs: presence-only (modelled as a thinned Poisson random variable), presence-absence (modelled as a Bernoulli random variable with a cloglog link function (see Kéry & Royle, 2016) and counts (modelled as a Poisson random variable) datasets.
Then, by combining the process model with the observation models, the full likelihood for the data processes Y = Y 1 , Y 2 , … , Y n is given by: that is, the model component for the latent state of the model, multiplied by the product of the individual likelihoods for the data processes.
In addition to the species location data, datasets sometimes include additional trait variables (often referred to as marks). These data may also be included in the point-process modelling framework to supplement the amount of information in the SDM through the joint-likelihood method described above, by treating each mark as its own observation model. That is, we assume within the datasets there are marks M l , l = 1, 2, … , p with associated observation models  M l | (s), l , which results in the full likelihood:

| PACK AG E FUN C TIONALIT Y
PointedSDMs was developed to streamline the modelling process and provide a general framework for integrated SDMs for ecologists who have a collection of heterogeneous datasets at hand. It does so by re-formatting and assigning appropriate metadata to the species' location and covariate data, and then constructing the relevant objects required by R-INLA (Martins et al., 2013) to do the model fitting. The package contains four primary functions for model pre-preparations (intModel), fitting and inference (fitISDM) and cross-validation (datase-tOut and blockedCV), as well as several generic functions related to plotting, printing and predicting the results of the model. intModel is the first function used in the integrated modelling process, and is built using R's R6 (Chang, 2021) object-orientated system. Here, the user adds the species location data, environmental covariates, as well as additional R-INLA and sp (Pebesma & Bivand, 2005) objects required; most of the other arguments for this function are used to define variable names and terms to be included in the model. Since this is an R6 object, there are a handful of slot functions which allow further specification and adjustments of the components in the model. A description of each of these slot functions and their intended use is available in Table 2.
PointedSDMs allows datasets from three sampling schemes: presence-only, presence-absence and count data, where the latter two are defined in the model through their response variable names, using the intModel's arguments responsePA and responseCounts, respectively.
If the user defines a spatial partitioning of their data points using intModel's slot function, '.\$spatialBlock', spatial cross-validation may be performed using the function, blockedCV: which iteratively calculates a cross-validation score by leaving a certain block of data out of the model based on their spatial location. fitISDM is used for the modelling and estimation of the integrated model. The data argument of the function is an object created by the function intModel, which contains the necessary information and metadata required in the model. The second argument, options is used to control any additional R-INLA or inlabru options.
After the model has been estimated, another form of spatial cross-validation may be completed using the function, datasetOut.
The function works by calculating a cross-validation score from the following steps: TA B L E 1 Details on the observation models for the species location data which may be used in the PointedSDMs R package.

| Description of spatial covariates
Two standardized and continuous spatial covariates describing the study area were used in this analysis. The first, elevation, describes the height in metres above sea level, obtained from the package, elevatr (Hollister et al., 2021), and the second, canopy, describes the percentage of tree canopy covered in the area, obtained from the package, FedData (Bocinsky, 2022), which, in turn, accesses the data from the National Land Cover Database (NLCD). Both of these packages produced spatial covariates in the form of Raster objects, which we stacked into a single RasterBrick object before analysis.

| Description of datasets
The data used in this analysis come from three heterogeneous sources, where we assumed the underlying sampling protocol for each dataset is unique to that dataset (we considered datasets representing: presence-only, presence-absence and count data). These datasets and their unique sampling protocols are displayed graphically in Figure 2. We see that the three datasets combined have a better spatial representation of Pennsylvania compared to any dataset individually. However, each has a different sampling protocol, which implies an ISDM is appropriate to use for this example.
The citizen science presence-only data were obtained from eBird (Sullivan et al., 2009), a citizen science project launched by the Cornell Lab of Ornithology where amateur birders are able to submit checklists of avian detections to an online data repository, which has grown significantly since its inception, and has established itself as a significant tool in scientific research. Since the eBird data are collected by non-scientists, as so we expect the biases typically found in citizen science data. Given the nature of such data, we modelled these data as a thinned version of the intensity surface, with an additional spatial random field to account for biases in the collection process . Mathematically, this is given by: where (s) is the thinning parameter of the intensity function since we assume imperfect detection from these data (and as a result, estimate relative abundance instead of true abundance). This parameter cannot be directly estimated, and is therefore confounded within the intercept term of the model.
The other two datasets used come from structured survey data.

| Model preparations
The first step to running an integrated model with PointedSDMs is to organize and assign appropriate metadata to the individual datasets, using the intModel function, which is used to initiate and prepare the  We would also like to account for spatial autocorrelation in the model through a GRF with a Matérn covariance function, which may be computationally expensive for large point process models. R-INLA counters this issue by approximating these GRFs via the stochastic partial differential equation (SPDE) approach (Lindgren et al., 2011), which requires the construction of a Delaunay triangulated mesh (interested readers who would like further details on the mesh construction are referred to Krainski et al., 2018;Lindgren & Rue, 2015).
The mesh for this example was created with the inla.mesh.2d function by supplying a SpatialPolygons boundary of the study region as well as the max.edge, offset, and cutoff arguments. Furthermore, the SPDE models for this example were specified using penalizing complexity (PC) priors (Simpson et al., 2017), which are designed to control the spatial range and standard deviation in the GRF's Matérn covariance function to reduce over-fitting in the model. spatial_data$specifySpatial(sharedSpatial = TRUE, prior.sigma = c(5, 0.01), prior.range = c(1, 0.01)) Simmonds et al. (2020) demonstrated in a simulation study that running a second spatial field for opportunistically collected presence-only data is a useful method to account for bias when knowledge of the sources of bias is scant or when covariates to adjust for bias are unavailable. Therefore, we use the `.\$addBias` function to add a second spatial field to our citizen science data to account for potential biases not reflected in the shared field. spatial_data$addBias('eBird_caerulescens')

| Results
The integrated model is easily fit using the fitISDM function as below, which takes two arguments: data (which is an intModel object created above) and options (which is a list of R-INLA and inlabru options used to configure the model). In this model, the two fixed covariates and separate intercept terms for the three datasets were considered. In addition to the bias field for eBird_caerulescens, a common spatial field was used across the datasets; and to speed up computation time, R-INLA's empirical Bayes' integration strategy was used.
spat_model <-fitISDM(data = spatial_data, options = list(control.inla = list(int.strategy='eb'))) PointedSDMs also includes the function datasetOut to carry out a form leave-one-out cross validation, which iteratively omits one dataset (and its associated marks) out of the full model. Table 3

| Predictions
A crucial part of the process of making SDMs is creating prediction maps (such as those in Figure 3 to help researchers understand the species' spread. Predictions of the ISDMs from fitISDM are made easy using the predict function. The function will automatically create individual formulas to predict per dataset after the user has specified which components they would like to predict (with the arguments: covariates, spatial and intercept); and all components used TA B L E 3 Leave-one-out cross-validation score as well as the changes in fixed effects as a result of leaving a dataset out. The cross-validation score is calculated by finding the difference between the marginal likelihood of the full model, and the marginal likelihood of the model with the dataset left out.

F I G U R E 3 Mean and standard
deviation of the predicted intensity (log( (s))) of the integrated species distribution model, which gives a reflection of relative abundance across the spatial map.
in the model may be included by setting the predictor argument to TRUE. However, any formula may be predicted by using the function's formula argument. PointedSDMs also provides methods to plot basic predictive maps for a variety of statistics. By setting the plot argument to FALSE, the ggplot (Wickham, 2016) object of the predicted statistic is given, which would allow for more custom plotting functionality.

| CON CLUS IONS
PointedSDMs is an R package that provides the tools to make the most of the vast volume of species location data available today, by promoting and facilitating the integrated modelling of marked point process SDMs in a convenient way. It does so using the now wellestablished INLA methodology, and by constructing wrapper functions around the R package, inlabru.

| Opportunities for future work
A multitude of different R packages have been developed in the past to assist with the construction of SDMs (see: dismo [Hijmans et al., 2022], sdm [Naimi & Aráujo, 2016], biomod2 [Thuiller et al., 2023], HMSC [Tikhonov et al., 2020] to mention a few); however, none of them have methods to create ISDMs-thus providing a novelty of PointedSDMs.
Despite this, there are still extensions to the PointedSDMs framework which should considered to extend the project further.
Different groups of species influence each other through a multitude of processes (such as predation and competition), thereby affecting each other's distribution across space and time.
A method to account for these processes would be to add interspecies interactions between different species, therefore changing the model framework to a joint species distribution model (JSDM).
A limitation of this model is that it only incorporates a small subset of the types of data used in ecology (presence-only, presenceabsence and count data). Therefore, there is an opportunity to incorporate other types of data (such as biomass and movement data) into this framework, which would thereby extend the possibilities of research within a project.
Furthermore, providing tools to assist users in adding more custom components into the model, for example being able to both add random effects and change their precision matrix, should be considered. Incorporating this into the package would allow PointedSDMs to be used in additional analyses, such as studying phylogenetics.
Finally, constructing the necessary tools and data pipelines to move species and environmental data from online repositories to create a complete workflow in a way that is not only reproducible, but also easy enough to use for ecologists and policymakers with minimal basic skills would allow a package like this to show off its full potential (a reflection of the steps to develop such a workflow is discussed in Mostert et al., 2022). Before this is completed, tools required to simplify the sharing and standardization of ecological data on a large scale need to be developed.

AUTH O R CO NTR I B UTI O N S
Philip S. Mostert wrote the script for the R package, PointedSDMs, provided the graphics and led the writing for the first draft of the manuscript. Robert B. O'Hara provided conceptualization of the project, supervision and reviewed the manuscript.

ACK N O WLE D G E M ENTS
We want to thank Walter Jetz and Petr Keil for help and discussions early in this project. We would also like to thank Ben Moore for providing us with the koala dataset used in the marked point process vignette in the package.

CO N FLI C T O F I NTE R E S T S TATE M E NT
We declare no conflicts of interest.

PE E R R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/2041-210X.14091.

DATA AVA I L A B I L I T Y S TAT E M E N T
All data are freely available for the reader in the R package,