Integration of presence-only data from several sources. A case study on dolphins' spatial distribution

Presence-only data are a typical occurrence in species distribution modeling. They include the presence locations and no information on the absence. Their modeling usually does not account for detection biases. In this work, we aim to merge three different sources of information to model the presence of marine mammals. The approach is fully general and it is applied to two species of dolphins in the Central Tyrrhenian Sea (Italy) as a case study. Data come from the Italian Environmental Protection Agency (ISPRA) and Sapienza University of Rome research campaigns, and from a careful selection of social media (SM) images and videos. We build a Log Gaussian Cox process where different detection functions describe each data source. For the SM data, we analyze several choices that allow accounting for detection biases. Our findings allow for a correct understanding of Stenella coeruleoalba and Tursiops truncatus distribution in the study area. The results prove that the proposed approach is broadly applicable, it can be widely used, and it is easily implemented in the R software using INLA and inlabru. We provide examples' code with simulated data in the supplementary materials.


Introduction
Progress of ecological science is more and more reliant on combining data from diverse sources [Fletcher Jr. et al., 2019]. This approach can increase the comprehension of ecological processes for both research and conservation purpose [Pace et al., 2019]. Data availability to model species distribution, for example, is rapidly expanding thanks to the fast development of new technologies [Soranno and Schimel, 2014], the growth of citizen science initiatives [Florence et al., 2020, Sicacha-Parada et al., 2020 and the opportunity of exploiting huge information harvested from the social media platforms [Mikula andTryjanowski, 2016, Pace et al., 2019]. The latter data types can be intrinsically challenging to merge in with existing, valued and validated data collected via standard research protocols, yet if that can be achieved, they can offer enrichment of existing data to generate powerful insights and even reducing the costs of collecting data conventionally [Buchanan and Bryman, 2018]. Nevertheless, heterogeneous data are complex to manage as they are 'polymorphic' in nature and affected by numerous forms of bias and limitations [Isaac and Pocock, 2015]. Information on species occurrence collected at sea by sea-users, for example, is characterised by a different spatiotemporal distribution of effort, which can be biased toward easily accessible habitat and times with better weather, or known areas of use [Corkeron et al., 2011b, Sicacha-Parada et al., 2020. Hence, a simple data pooling [Fletcher Jr. et al., 2019] with data gathered under conventional research methodologies is not enough to reliably model the presence of a species considering different explanatory variables -both environmental and anthropogenic -and to define its distribution over multiple spatial and temporal scales.
Integrated distribution modelling (IDM), i.e. the practice of fitting species distribution models with more than one observation practice [Isaac et al., 2020], is a new approach to combine different datasets, preserving the strengths of each and adjusting, at least to some extent, their limitations. IDM sets a spatial -or spatio-temporal -latent state, here statistically defined as a Point Process, of the sites where the animals were sighted, described by a series of covariates shared by different datasets. Multiple observation sub-models can be estimated from them, each describing a part of the latent state. Here we use the IDM approach on sighting data derived from different data sources (research, monitoring and social media) to predict the distribution of two dolphin species in the central Mediterranean Sea. The study of spatial distribution patterns of dolphin species is incredibly puzzling, as they spend much time under the water surface [Redfern et al., 2006], and establish whether they are present in a specific habitat involves a lot of visual/acoustic effort for scientists facing the constant background bias in the acquired data [Redfern et al., 2017, Breen et al., 2017.
Coping with several challenges, we propose a novel path to combine these different sources to provide cohesive summaries of the species' potential and realised distribution [Isaac et al., 2020]. First, as the available information is presence-only data, we opt for Point Process as the most natural solution [see Miller et al., 2019, and references therein]. Second, as several sources of bias are potentially present in the datasets, we propose models based on a location-dependent thinning of a Poisson process to reduce these biases [see Dorazio, 2014, and references therein]; however, the parameters of these models are not fully identifiable unless the covariates of abundance are distinct and linearly independent of the covariates of detectability [Dorazio, 2014]. In Yuan et al. [2017] a flexible stochastic partial differential equation (SPDE) model describes the spatial structure that is not accounted for by explanatory variables is proposed, and estimation is carried on using integrated nested Laplace approximation (INLA) in a Bayesian inference framework. The latter allows simultaneous fitting of detection and density models and permits prediction at an arbitrarily fine scale. Very recently Sicacha-Parada et al. [2020] adopt a similar approach using citizen science data on moose (Alces alces) occurrence in Norway, accounting for the geographical bias (oversampling of "accessible" locations). For marine observations, the boat's size, the distance from the coast, policy regulations and weather conditions are just some of the factor that can affect the accessibility of an area. We aim to propose a new protocol for presence-only data fusion, where information sources include social media. We investigate several possible solutions and compare different types of detection function and accessibility explanations. We show how variation in the detection function affects ecological findings on two dolphin species with different spatial distribution. The approach is entirely broad and the selected species are representatives for different habitats. Hence they constitute a good benchmark for the entire proposal. We provide R functions and example code to replicate our work in the online Supplementary material (https://github.com/smar-git/SM-data-merging).

Study species
Two dolphin species were selected for this study, the bottlenose (Tursiops truncatus) and the striped dolphins (Stenella coeruleoalba), both widely distributed throughout the Mediterranean Sea. The bottlenose dolphin is reported predominantly "coastal" or "inshore" [Bearzi et al., 2012], but its habitat changes depending on the region: it can inhabit shallow waters (less than 50 meters) close to the coast and at the mouths of the rivers [Pace et al., 2019, Triossi et al., 2013, around archipelagos or islands [Pace et al., 2012, Pulcini et al., 2014, and in waters above the continental shelf and slope [Azzellino et al., 2008]; less frequent, but still present, in deeper waters and pelagic areas. Bottlenose dolphins feed a wide range of demersal and coastal prey and can forage opportunistically behind trawling vessels [Pace et al., 2012]. The striped dolphin is considered "pelagic" in the Mediterranean Sea, showing a general preference for highly productive, open waters beyond the continental shelf [Aguilar and Gaspari, 2012]. Although the species is the most abundant cetacean in the Mediterranean, it is not found at uniform densities. The striped dolphin diet is mainly composed by pelagic or bathypelagic schooling-nictemeral fish, squids, and even crustaceans [Meissner et al., 2012]. There are not exact estimations of the number of bottlenose and striped dolphins living in the Mediterranean Sea. The poor understanding of the status of a population, together with the suspected decline in numbers (both species are listed under the status "vulnerable" in the IUCN Red List as their populations have been decreasing during the last decades), emphasize the importance of integrating all available information [Pace et al., , 2021].

Study area
The study area covers about 39,000 km 2 , and is located in the central Tyrrhenian Sea (Italy) ( Figure 1); it is characterized by different environmental features (e.g. bathymetries), structures (e.g. seamounts) and types of habitats [Pace et al., 2019]. Several rivers flow in this region, including the Tiber, and the simultaneous presence of both fresh and salt waters, as well as the geomorphological action of sedimentation and erosion, generate different ecological gradients, making the coastal area highly productive and rich in biodiversity [Ventura et al., 2015, Casoli et al., 2019. The study area also includes five islands (Giglio and Giannutri at north; Ponza, Ventotene and Santo Stefano at south) and several commer-

Data sources and attributes
Dolphin data has been collected over 13 years (2007-2019) by three sources: a) conventional research protocols from motor and sailing boats (non-systematic 'haphazard', sensu [Corkeron et al., 2011a]) (labelled UNIRM) [Pace et al., 2019]; b) standardized monitoring protocols from platforms of opportunity within the project "FLT Mediterranean Monitoring Network" (labelled FERRY) [ISPRA, 2016., Pace et al., 2019; c) social media reports (Facebook and YouTube) by sea-users [Pace et al., 2019] (labelled SM). Data collection procedures and selection are provided in Pace et al. [2019]. As the SM dataset included also details on other cetacean species than the two here investigated (Figure 1b ), we used this information as a proxy to infer boat densities potentially able to record the animals' presence. These three sources accounted for 283 records of striped dolphin (about 50% from SM) and 579 of bottlenose dolphin (about 80% from SM). The major contribution by SM justified the need for a careful choice of the related model's elements. We used distance from the coast (i.e. the euclidean distance between a sighting point and the shoreline), temperature, primary productivity, slope and depth as covariates. They are commonly selected in cetacean distribution studies as they may represent good proxies for species' ecological needs [Chavez-Rosales et al., 2019, Stephenson et al., 2020. Temperature and primary productivity were retrieved from COPERNICUS platform https: //marine.copernicus.eu/. Depth data were downloaded from GEBCO (General bathymetric Chart of the Oceanhttps://www.gebco.net/). Slope was computed from depth data through the terrain function in R. Details of the retrieved datasets and data handling procedures are reported in the Supplementary materials.

Modelling approach
To integrate data from all available sources and manage possible detection bias in each dataset, we adopted a point processes modelling. We followed Yuan et al. [2017], and Sicacha-Parada et al. [2020], expanding their approaches by building a Spatial Log-Gaussian Cox Process (LGCP) [Illian, 2019] that incorporates different detection functions and thinning for each data source. We assumed that sighting patterns, i.e. locations of dolphin groups in space (s ∈ S ⊂ 2 ), are properly described by a point process whose intensity function λ(s) is additive on the log-scale: where X(s) is a set of spatially referenced covariates, f (z) is a smooth term (that may be present or not) of some geo-referenced covariates z. As prior for f (z) a common choice is a random walk of order 1 [Rue and Held, 2005].
Finally, ω(s) is a zero-mean Gaussian process describing the residual spatial variation. As in Yuan et al. [2017] we adopted a Matèrn covariance of order 1 with range ρ and standard deviation σ.
We assume that the above process is observed in three different ways, conditionally independent given λ(s). Thus, 3 observed intensities were defined: where t j is a time scaling factor and g j (s) is the detection function (with values between 0 and 1) which determines the original process's thinning. The form of the detection function depends on the type of observational process. For the UNIRM data, we set where d 1 (s) is the distance (Km) between point s and the position of the boat when the group was sighted. K was defined as the maximum distance measured between the location of the first visual sight of a dolphin group by researchers (equipped with 7x50 and 10x50 binoculars) on the boat and the effective location of the group under optimal survey conditions (i.e., sea state≤1 Douglas, wind force≤1 Beaufort, no rain, no fog, no clouds). This measurement was possible because, upon sighting dolphins, researchers marked the GPS point where the animals were first located, the survey effort was suspended, and the vessel departed from its route to approach the group to a suitable distance (10-30 m) to correctly identify the species, estimate group size and composition. K was set to 4 Km, assuming that researchers can spot animals closer than K.
For the FERRY data, we used the classical half normal detection function [Thompson and Ramsey, 1987] defined as where, d 2 (s) is the perpendicular distance (Km) to the ferry track and ξ is a scale parameter.
For the SM dataset, the definition of the detection function has been carefully considered for biases. Records in this dataset are affected by large uncertainty, as observations are generally (a) skewed towards more accessible To better define "more accessible" and consider the distribution of the small boats we explored three different possibilities. First, we reasonably assumed that locations closer to the coast are more accessible to sea-users with small boats. Thus, following [Sicacha-Parada et al., 2020], the detection function was defined as: where d 3,1 (s) is the distance from the coast ( Figure 2a) and ξ 3,1 a scaling parameter. However, the distance from the coast may not provide an accurate representation of the small boats' density in a given area: locations close to harbours and holiday destinations (e.g. islands) are generally more crowded than other sites at the same distance from the coastline. Secondly, to obtain information on the boats' density in the study area, we used data from EMODnet (European Marine Observation and Data Network; [Martín Míguez et al., 2019], a free-usage platform of vessel density data derived from boats using AIS (Automatic Identification System, mandatory above 15m length). The database has a spatial resolution of 11 Km and covers 2017-2019 period. We selected 2 vessels' categories (sailboats and pleasure crafts) from the 11 listed, and applied a kernel estimator to ensure a smoothed density surface. The resulting log-density surface (Figure 2b) was labelled as "vessel log-density surface". The detection function was defined as where d 3,2 (s) is the log of estimated boats density, and Φ is the normal cumulative distribution function (cdf) with µ 3,2 and ξ 3,2 as location and scale parameters, respectively. The normal cdf was selected as we required the detection function to be close to 1 when the vessel log-density is high, and to (or equal to) zero when it is small (or null). As expected, higher vessel (log) densities were identified near the principal harbours and the islands (Figure 2b). However, EMODnet information accounted for a limited timeframe compared to our study and for larger vessels than the ones generally reporting observation records in SM platforms (small recreational boats moving near the coastline). Lastly, we used the entire SM dataset of 581 records (125 striped and 334 bottlenose dolphins and 122 other cetacean species) to estimate the observation process's intensity. We considered the spatial pattern of such observations as a proxy for the small boat density process if we disregard the species. A similar approach was used in occupancy models context, where non-detection records were constructed from sightings of other "benchmark" species (see for example Dennis et al. [2017], Kery et al. [2010]). We applied a spatial LGCP to estimate the (log) intensity of the process. Details of the estimation process can be found in the Supplementary material. Figure 2c shows the resulting estimated log-intensity that we then used as input for the detection function: where d 3,3 (s) is the estimated log-intensity at point s while Φ, µ 3,3 and ξ 3,3 are defined as in (7). Eventually, another potential bias affecting the observation processes is the different time (days) spent at sea by each data source. To account for this we introduced the t j parameter in expression (1). t j is known for both the FERRY and the UNIRM data, but it is indeterminate for SM data. We know that SM observations were collected by leisure boats all over the year, with a major number of sightings reported in spring-summer. Thus, we ran estimations with t 3 = 160, 200, 365 days, without sensible changes, and selected t 3 = 360.

Priors specification
To finalize the model in a Bayesian framework, we needed to specify priors for all model parameters. To avoid identifiability issues when estimating both the animal intensity and the observation process, we used slightly informative priors. For the parameters in the spatial field ω(s) in (1) we used PC priors [Fuglstad et al., 2019] setting P (ρ < 150) = 0.5 and P (σ > 2) = 0.01. Meaning that we consider a standard deviation above 2 as large and a range of 150 Km likely. We assing β Gaussian prior with mean 0 and precision 0.01. The location parameters µ in (7) and (8) are also Gaussian with mean 0 and precision 0.01. Finally, for the scale parameters in (5)- (7), let ξ = F −1 α (Φ(θ) where F −1 (·) is the inverse exponential cdf with rate α and Φ is a normal cdf. This corresponds to assigning an exponential prior to ξ. We then assign θ a standard normal prior. The parameter α is set to 1/20 in (6), and 1 in all other cases. The difference in rate is due to the different scale of the three inputs for the detection function ( Figure 2).

Inference and computational approach
The traditional way of fitting point processes is by gridding the space and modelling the intensity on a discrete number of cells. This implies that observations' locations are also approximated. We followed instead the approach introduced in Simpson et al. [2016] and applied in Yuan et al. [2017] and Sicacha-Parada et al. [2020]. Such an approach allowed us to use the true sighting locations, thus avoiding loss of information. Besides, the Gaussian field's SPDE representation has several computational advantages (see Lindgren et al. [2011]). To build a spatial model using the SPDE approach, we used the mesh shown in Figure 1b.
For computational efficiency, we used the INLA method for numerical Bayesian inference with Gaussian Markov random fields [Rue et al., 2009]. INLA allowes also to easily combine the three observation model in 2 to form the likelihood. Our model does not directly fall under the latent Gaussian model framework for the INLA estimation software because the parameters in the detection functions in (5)-(7) do not enter the model in a log-linear way. We use therefore the methodology introduced in Yuan et al. [2017] and implemented in the inlabru R package [Bachl et al., 2019] that allows fitting models with some non-linear elements. This is done by linearizing the model via Taylor approximation and using a line search to optimize the linearization point. Model evaluation was carried out using goodness of fit measures as in Sicacha-Parada et al. [2020], through the Deviance Information Criterion (DIC), Watanabe-AiKaike Information Criterion (WAIC), Marginal Likelihood (MLIK), and the logarithm of the pseudo marginal likelihood (LMPL). As a benchmark for the SM detection function choice, we use a constant detection function g(s) = 1, ∀s, that is equivalent to not include any thinning for the SM data.

Results
The distribution of the dolphins encounters in the study area is shown in The evaluation of SM detection functions is based on model's goodness of fit measures, DIC WAIC, MLIK and the LMPL, it is reported in table 1. The selected best performing detection function for all criteria and species is (8) (labelled as intensity). This choice affected model's terms estimate. For striped dolphin model with varying detection functions (Fig. 4), the effects of categorized Depth were in agreement with the species' distribution ranges: it is generally not found in very shallow waters (negative effects), observed at 100-200m depth, and more often encountered at depths over 200m. The effect of the detection function was found in the smaller size of the credible intervals with detection (8). Slope showed a significant reduction effect in the encounters where it is steeper. No significant difference was found among the smooth effects with varying detection (overlapping 95% confidence band, not shown). The effect of the Distance from the coast was not significant, and the intercept was larger for detection functions (7) and (8), with the latter showing less uncertainty than the first. For bottlenose dolphin model with varying detection functions (Fig. 5), both Depth and Distance from the coast had negative effects on sightings (deeper waters and increasing distance from the coast mean less encounters), with no-significant difference among detection functions. Again, detection (8) induced narrower 95% credible intervals. Estimates of detection functions' parameters are reported in tables 3 and 4 in the Supplementary Materials.
As a measure of relative uncertainty for the predicted intensity λ(s) we use the relative width of the 50% posterior credible interval (RWPCI) as proposed in Yuan et al. [2017]. This measure is defined as the interquartile range divided by the median The intensity surfaces estimated for the striped and bottlenose dolphins are shown in Figures 6 and 8, respectively; associated RWPCIs (9) are mapped in Figures 7 and 9. The intensity surface for both species changed consistently with the different detection function adopted for SM. For example, the vessel-based detection (7) (vessel-log density surface) induced some artifacts for the striped dolphin, and in general over-estimated the dolphins' encounter probability, while the detection based on distance from the coast (6) and the constant detection, under-estimate the same probability for the striped dolphin and create some artifacts. A relevant feature of the detection function (8) is that allows a consistent reduction in the uncertainty associated to the estimated intensity surface.  Figure 4: Estimated intercept and effect of the distance from the coastline (a) effect of depth (b) and slope (c) for the four fitted models for the striped dolphins. In (a) and (b) the error bars indicate 95% credible intervals using different detection functions for the SM data, detection constant (red), detection (6) (green), detection (8) (blue) and detection (7) (violet). In (c) the estimated smooth terms for the slope are reported Figure 5: Estimated intercept and covariate effects for the four fitted models for the bottlenose dolphin. The error bars indicate 95% credible intervals using different detection functions for the SM data, detection constant (red),detection (6) (green), detection (8) (blue) and detection (7) (violet).  (6), (c) detection (7), (d) detection (8).
In Fig. 10 we describe the estimated probabilities for the number of sightings over one year for both species. Panel (d) corresponds to the chosen detection, and is showing an ecologically sound distributions for the number of sightings. Striped dolphins are more common in the area than the bottlenose dolphin. The smaller number of sightings in the database (283 striped and 579 bottlenose dolphins) induces a large uncertainty on the estimates, however it still allows for correctly capturing the species spatial distribution.  (6), (c) detection (7), (c) detection (8).    (6) 3895.52 3933. 01 -2008.44 -1988.77 detection (7) 3840.93 3889. 51 -2019.20 -1969.47 detection (8) 3789.38 3810. 08 -1942.07 -1922.56 (b)

Discussion
This study demonstrates that methods of spatial data integration able to carefully consider and minimize datasets' biases can be efficiently used to predict species' distribution. Results here obtained may be broadly applicable to other species that require an improvement of spatial knowledge for their conservation and management. Dorazio [2014] pointed out that several statistical models have been proposed to analyse presence-only data, but they have largely ignored the effects of imperfect detectability and survey bias. The same author showed that proper modelling choices could reduce the bias in SDM estimates induced by these types of errors. Here we do more than just correct for detectability issues; we allow multiple sources of information to be integrated. We defined and estimated source-specific detection functions considering the nature of the data, i.e. presence-only, and the different observation processes, offering a more precise picture of the distribution of two dolphin species in the central Mediterranean. The output is consistent with the ecology of these species, highly supporting a thoughtful usage of spatial data extracted from social media platforms and introducing a novel way to model observation biases. In analysing different detection functions, we optimise distribution models for each species. That is very attractive considering the importance of defining suitable habitats for vulnerable or endangered cetaceans exposed to anthropogenic disturbance or threats, particularly in coastal areas Pace et al. [2018].
The point process approach allows us to reliably estimate the observation intensity surface. The analysis of intensity surfaces in figures 6 and 8, gives important insights on the relevance of the detection function in observation intensity estimation. The artefacts around the Tiber river estuary (central part of the area) for the bottlenose dolphin and close to the Giglio island (northern portion of the study area) for the striped dolphin are solved by detection (8). Again, with the same detection function's choice, analysing figures 7 and 9, we can observe the reduction of intensity estimates' variability (and hence uncertainty). The proposed "best" choice is very general and can be adopted whenever social media data are available. The two species were also studied in Pace et al. [2019] using a presenceonly data approach based on MaxEnt [Phillips et al., 2006]. While results related to the bottlenose dolphin analysis were ecologically sound and coherent, striped dolphins analysis was unfeasible in that framework, given the relevant number of near-to-the-coast observation by sea-users. In particular, the depth around the Pontine islands rapidly increases with the distance from the coast, playing a misleading role in the MaxEnt modelling approach. The proposed methodology, instead, is fully able of capturing both species behaviour, thus addressing the complex task of finding targeted techniques weighting species' diversity. Some limitations are intrinsic to the proposed approach. On the one hand, spatial estimation does not distinguish between land and sea. The latter implies the use of post-processing to cut the estimated intensity surface. On the other, each analysed detection function is not very flexible. Eventually, the information used to model the observation effort in the SM data can be further improved. Hence, further investigations will be carried out to: • Develop spatially non-stationary modelling approaches where a barrier can be added at the coastline as in Bakka et al. [2019].
• Develop flexible detection functions • Explore the use of satellite data to estimate the density of small boats in the study area [Santamaria et al., 2017] The implementation of these tasks and the improvement of the models' capabilities may further develop a fast-growing research approach and provide innovative insights in marine top-predators' distribution patterns. The multiplicity of issues confronting these marine species requires collaborative efforts at all levels to share and merge resources, data, and expertise efficiently [Vella et al., 2021, Pace et al., 2018.
• Primary Productivity, year 2019, with a grid resolution of 0.042, 141 depth levels, the name of the service according to Copernicus is MED-SEA ANALYSIS FORECAST BIO 006 014, see Bolzon et al. [2020.], Salon et al. [2019].
Environmental data were then aligned to the estimation grid both in terms of spatial and temporal resolution. First, data were re-projected using UTM coordinates (EPSG:32632). Depth raster was up-scaled to the 1km x 1km estimation grid resolution using a Median Smoothing. Temperature and primary productivity data required more intense management than depth. In the details given above, both primary productivity and temperature datasets were characterized by differences in spatial resolution and the number of depth levels between the temporal bin "2008-2018" and data collected in 2019. Hence, during the download data, filtered, keeping just the more superficial layer. The latter ranges between 0 and -1.47 m for the "2008-2018" series and between 0 and -1.01 m for the "2019" series. Primary productivity data were first harmonized with the same measurement unit (from moles/m3/s to mg/m3/day). Variables were then down-scaled using Generalized Additive Models. We fitted a GAM to each stack raster layer, that is, each month of each year of the investigated temporal range, and then predicted the variable value (i.e. temperature or primary productivity) over a surface gridded at 1 km x 1km. Predictions were, aggregated by average to obtain one value for each cell. All variables are available from Copernicus as monthly means.

B SM data detection function: Estimation of the intensity from observations of all species
We assume that the location of observations of all species in the SM dataset (Figure 1(b) ) are properly described by a LGCP point process with logintensity log γ(s) = α + u(s) (10)  where α is a common intercept and u(s) is a zero-mean Gaussian process with Matern covariance function of order 1 with range ρ u and standard deviation σ u . Here also, we follow the SPDE approach as in Yuan et al. [2017] to represent the Gaussian field and use the same mesh and the same priors for ρ u and σ u as in the main model. The estimated log-intensity surface is shown in Figure 2c while its standard deviation is shown in Figure  12).  Table 2: Posterior mean and 05% credible interval for ρ u and σ u

C Shape of the detection functions
One of the main ideas in this work is to correct the social media dataset for biases due to the larger number of sightings expected in areas that are more visited by small boats (the primary source of the SM data). We use three different proxies for the small boat density: the distance from the coastline d 1 (s), the log-density of vessels from the EMODnet data d 2 (s), and the estimated log intensity for all observations in the SM dataset d 3 (s).
We have computed the the three measures for the 862 location where a dolphin was observed (283 striped and 579 bottlenose dolphins) and for about 32000 points evenly distributed on a grid over the whole area of interest. The boxplots of these measures for each set of points are displayed in   13. In all three cases, both species' measures appear to come from a different distribution than the measures for the grid. A Kolmogorov-Smirnov test was performed to determine if the grid data and the bottlenose and striped dolphin data follow the same distribution or not. In all cases, the result (p − value < 2e − 16) indicates that the sets of measures do not follow the same distribution. That is an indication of a non-random sampling process. We then explore the relationship between each of the proxies d i (s), i = 1, 2, 3 and, the realative probability q(s d i ) of retaining a point in a location with measure d i (s) (i.e. not thinning) in the observed pattern. To estimate this, we grouped all three measures for striped and bottlenose sightings locations and for the grid points into 100 bins. For each bin we computed withp j obs , j = 1, 2 andp grid , the proportion of points that are part of the bin in the observed pattern of the two species and the dense grid, respectively.
In Fig. 14(a) (dots) we observe a considerable decrease of q(s d 1 ) from 0 to about 30 Km for both species. The decrease is steeper for the bottlenose but clearly present also for the striped dolphin. Figure 14(b) (dots) shows how the detection probability q(s d 2 ) increases when the log-density of the vessels from the EMODnet data increases. The pattern is similar for both species.
Finally, Figure 14(c) (dots) shows how the detection probability q(s d 3 ) also increases when the log-intensity estimated from all observations in the SM dataset (see B for details) increases. The estimated detection functions from our models are reported in Figure 14(a-c) as a solid line. They seem to capture reasonably well the observed patterns for both species. The estimated parameters for the detection functions in Equations 5-8 with posterior mean and 95% credible intervals are reported in Table 3