Causes and effects of sampling bias on marine Western Atlantic biodiversity knowledge

Knowledge gaps and sampling bias can lead to underestimations of species richness and distortions in the known distribution of species. The goal of this study is to identify potential gaps and biases in marine organisms sampling at the Western Atlantic Ocean, determine their causes and assess its effect on biodiversity metrics. We tested the potential interference of this bias with the representation of environmental conditions, potentially affecting biodiversity model predictions.


| INTRODUC TI ON
Estimates indicate a global marine biodiversity count between 700.000 and 2.2 million organism species (Appeltans et al., 2012;Mora et al., 2011).However, only 9%-40% of all marine species have been described (Appeltans et al., 2012;Mora et al., 2011), indicating a significant knowledge gap in marine biodiversity (Luypaert et al., 2020;Webb & Vanhoorne, 2020).These gaps result from both limited collections in certain locations and sampling bias.Sampling gaps occur due to the heterogeneous distribution of sampling efforts, leading to underexplored and under-sampled regions (König et al., 2019).Additionally, sampling bias arises from targeted collections in specific areas, species or taxonomic groups, often influenced by factors like public interest and research funding (Girardello et al., 2018).Addressing these gaps and sampling biases is crucial for enhancing our understanding of marine biodiversity and the conservation of marine ecosystems.Sampling gaps and sampling bias are the main causes of deficiency in knowledge, regarding the existence and abundance of species (Menegotto & Rangel, 2018;Williams & Pearson, 2019).
Knowledge gaps and sampling biases in specimen collection have significant implications for understanding patterns of biological diversity (Meineke & Daru, 2021;Oliveira et al., 2016).The lack of comprehensive and representative sampling can lead to underestimations of species richness and distorted views of species distribution patterns (Oliveira et al., 2016).Moreover, these gaps and sampling bias can directly affect species distribution estimates made by species distribution models (SDMs) since they may introduce bias into the sampling of environmental conditions used by SDMs (Muscatello et al., 2021;Oliveira et al., 2016;O'Toole et al., 2021).
These limitations have significant impacts on both scientific research and decision-making in marine conservation and resource management.
The marine tropical region is recognized as one of the most biodiverse areas on the planet, harbouring a wide array of marine species (Barlow et al., 2018).Paradoxically, this region also faces significant challenges in terms of biases and gaps in sampling (Menegotto & Rangel, 2018;Williams & Pearson, 2019).Logistic difficulties, limited resources and lack of investment in scientific research in these tropical environments have contributed to under-sampling and insufficient knowledge of tropical marine biodiversity (Enright et al., 2021;Stroud & Thompson, 2019).This lack of comprehensive information represents a significant obstacle to the understanding and conserving of these rich and threatened marine ecosystems.Therefore, it is crucial to enhance the efforts towards filling these knowledge gaps, improving sampling strategies and stimulating scientific research in tropical marine regions.
The Western Atlantic Ocean stands out in the tropics as a significant example of the challenges faced regarding biases and gaps in institutional collections (Miloslavich et al., 2011;Wynne, 2011).
Despite its wide area and remarkable biological richness, the Western Tropical Atlantic still presents significant knowledge gaps due to a lack of comprehensive and representative collections (Menegotto & Rangel, 2018;Miloslavich et al., 2011).These gaps result from factors such as limited resources, inadequate infrastructure and logistical challenges associated with conducting research in remote marine environments (Earp & Liconti, 2020;Hatcher & Okuda, 2016).This situation represents a challenge to the understanding and conservation of the tropical marine ecosystems of the Atlantic.
Quantifying and understanding sampling biases and their causes are extremely important for advancing studies on biological diversity and conservation strategies (Meineke & Daru, 2021).By identifying the groups of organisms and areas that are most affected by these biases, it is possible to apply and optimize efforts and resources for data sampling.Moreover, by analysing sampling gaps and sampling biases, it is possible to improve estimates of organism richness and distribution, providing a robust foundation for analyses and decision-making.These studies are a crucial step towards enhancing conservation strategies, ensuring they are targeted at the most critical groups and areas and contributing to the understanding of the dynamics and functioning of marine ecosystems.Thus, this kind of investigation is an essential step for obtaining more reliable estimates of biological diversity and developing effective marine conservation actions.Accordingly, the objective of this study is to patterns.Accessibility was identified as one of the main causes of sampling bias.The analysis of environmental bias indicated that the records do not represent all conditions present in the environment.Sampling density showed a strong relationship with endemism and a weaker relationship with species richness.

Main Conclusions:
We have identified a strong sampling bias related to ease of access that equally affects vertebrates, invertebrates and algae, resulting in a skewed sampling of the environmental conditions where species occur.Sampling patterns differ among the groups.The intensity of sampling effort significantly impacts measures of richness and endemism, potentially undermining the accurate recognition of real biological diversity patterns.

K E Y W O R D S
Atlantic, biodiversity metrics, environmental bias, knowledge gaps, Ocean identify possible sampling gaps and sampling biases for marine organisms in the Western Atlantic, from the limit of the southern USA to the coast of Patagonia.It aims to identify the possible causes of this sampling bias and examine its effect on metrics of biological diversity.Additionally, we test whether this sampling bias may interfere with the sampling of environmental conditions and its potential impact on SDMs and other biodiversity models.

| Species occurrence database
To compile the data, we carried out searches in online databases such as GBIF (Global Biodiversity Information Facility) (https:// doi.org/ 10. 15468/ dl.thtnur) and OBIS (Ocean Biodiversity Information System) (https:// obis.org/ ) and personal databases of Brazilian specialists, from collection data and collections not available online.For the online database searches, data were filtered by selecting countries in Latin America and states in the USA located below the latitude 41° north in the Atlantic, within a limit of up to 300 km from the coast.We chose not to include the northern portion of the Atlantic due to the extremely high number of samples from this region, which could influence the results on the causes of sample differences in Latin America.Additionally, we filtered the following organism groups: Annelida, Ascidiacea, Brachiopoda, Branchiopoda, Bryozoa, Cnidaria, Ctenophora, Echinodermata, Malacostraca, Maxillopoda, Mollusca, Nemertea, Nematoda, Ostracoda, Platyhelminthes, Porifera, Sipuncula, Cetaceans, fish from the Tetraodontiformes and Perciformes groups and algal groups Chlorophyta and Rhodophyta.
Data were filtered to remove records located on the continental surface or in the Pacific Ocean.In cases of non-georeferenced data, we attempted to georeferenced them by cross-referencing the locality description with the general data in the Global Biodiversity Information Facility (GBIF) database (https:// www.gbif.org/ ) to obtain coordinates.To ensure the taxonomic validity of species names, data were cross-referenced with the WoRMS (World Register of Marine Species) database (https:// www.marin espec ies.org/ ) and reviewed by experts in each taxonomic group.

| Sampling effort and sampling bias
To identify possible sampling bias and sampling gaps, we mapped the density of occurrences for each group, as well as for the total set of invertebrates, vertebrates and algae.For this purpose, all occurrences of marine organisms (regardless of taxonomic identification) were used.We utilized the kernel density index (Epanechnikov, 1969) and employed a search radius of 50 km to estimate the occurrences density using kernel density analysis.Subsequently, we performed a Pearson correlation analysis with degrees of freedom correction (Clifford et al., 2012) among the kernel density maps to compare sampling patterns among different groups.To compare the sampling effort among the studied countries, we summed the number of occurrences in the maritime territory of each country.To account for the maritime area's effect in each country, we calculated the occurrences density per country by obtaining the ratio between the numbers of occurrences per maritime area of each country (https:// marin eregi ons.org/ ).
To investigate the hypothesis that ease of access is one of the causes of sampling bias, considering logistic and financial factors, we employed a Spatial Autoregressive model (SAR).This model considers the spatial structure of the data, considering the effects of spatial autocorrelation while weighting the effect of each explanatory variable (Ver Hoef et al., 2018).The dependent variable of the model was the kernel index (representing the sampling effort), and the predictor variables included the distance from the coast, protected areas, ports, urban areas, research institutions and maximum bathymetry (indicating ocean depth).To avoid problems with multicollinearity and model overparameterization, we summarized the predictor variables into axes using Principal Component Analysis (PCA).The maps of ports and major coastal urban areas were obtained from the Natural Earth platform (https:// www.natur alear thdata.com/ ).To map research institutions, we searched government websites and the institutions' pages to verify if they had biology-related courses in coastal areas, research centres, or campuses.The listed institutions were then searched on Google Maps, and their geographic coordinates were used for mapping.
The bathymetric map was obtained from the General Bathymetric Chart of the Oceans (GEBCO -https:// downl oad.gebco.net).To generate the geographic distance map from the coast, we identified emerged areas in bathymetric maps.We selected pixels with values equal to or greater than zero, representing areas above sea level.
Consequently, we created a map where each pixel indicates its geographic distance to the nearest coast.This process yielded a continuous distance surface, where all areas situated on the continent are assigned a value of zero, indicating they are beyond the maritime reach.Maps of marine conservation units were obtained from the World Database on Protected Areas (http:// www.prote ctedp lanet.net/ ).For all accessibility-related maps, we calculated distance maps (maps that represent in each pixel the geographic distance to a given feature on the map) using the Dinamica-EGO software (Soares-Filho et al., 2009).The BioDinamica package was used to generate the kernel density maps, perform the correlation analyses, and the SAR model (Oliveira, Soares-Filho, Leitão, et al., 2019).

| Environmental bias
To understand the effects of sampling gaps and sampling bias on biodiversity models, such as SDMs and their predictive capacity, it is essential to quantify the marine environmental heterogeneity in the study area and contrast it with the conditions sampled in each record of biodiversity databases.To test the environmental bias of marine biodiversity sampling, we start from the premise that in our study area, the total absence of vertebrate, invertebrate or algal species is unlikely.Therefore, we postulate that each existing environmental condition must support at least one of these species.Under ideal sampling conditions, which would be random and uniformly distributed in space, we expect sampling to reflect all environmental conditions of the studied area.The lack of representation of certain environmental conditions in our samples may indicate a bias in collection, resulting in an environmental bias that would negatively affect the accuracy of predictive biodiversity models, such as SDMs.
To characterize the marine environmental heterogeneity in the study area, we used data from the Bio-Oracle Marine data layers for ecological modelling platform (https:// www.bio-oracle.org/ ), which includes surface variables and benthic data with minimum, mean, maximum and range values.The variables include temperature, salinity, current velocity, ice concentration, nutrients (nitrate, phosphate, silicate), dissolved oxygen, iron, chlorophyll, phytoplankton, primary productivity, calcite, pH, photosynthetic radiation, diffuse attenuation and cloud coverage.Additionally, we also used bioclimatic variables from the Climatologies at high resolution for the earth's land surface areas -CHELSA platform (https:// chels a-clima te.org/ ) to serve as a proxy for variables related to the environmental conditions of the oceanic water surface.Thus, this variables encompassing aspects such as mean annual temperature, mean diurnal range, isothermality, temperature seasonality, maximum temperature of the warmest month, minimum temperature of the coldest month, annual temperature range and others.The bathymetric data were obtained from the National Oceanic and Atmospheric Administration -NOAA (https:// www.noaa.gov/ ).In total, 134 layers of environmental variables (Appendix S1) were used to describe the marine environmental heterogeneity in the study area, which were adjusted to the defined limits of the research area.All variables were used at a spatial resolution of 5 km per pixel.Due to the extensive availability of data, it was necessary to reduce the number of variables for a better understanding of environmental heterogeneity and reduce multicollinearity among the variables.To achieve this, we performed a spatialized Principal Component Analysis (PCA).This analysis synthesizes the variables, combining them through their intrinsic correlations and generating axes (vectors) of values that represent all variables in a summarized way.Thus, it is possible to interpret the environmental conditions without the need to analyse each variable individually.The PCA provided a comprehensive view of the environmental conditions and contributed to the understanding of the spatial heterogeneity in the study area.The significant axes of the PCA (axes that present a higher percentage of explanation than expected by chance) (Nielsen, 2000) were used to represent the environmental characteristics of the study area.To assess the presence of any environmental bias within the samples, we utilized the geographic coordinates of each occurrence record (regardless of taxonomic identification) to spatially intersect these data with the significant environmental PCA axes.This approach allowed us to extract the environmental condition values at the sampled locations, thereby obtaining a distribution of the observed environmental conditions within the biodiversity samples.The distribution of environmental conditions for the study area was also computed and compared to the test performed for biodiversity samples analysis.This test was performed for all samples, for vertebrates, invertebrates and algae separately.The BioDinamica package was used to generate the PCA maps (Oliveira, Soares-Filho, Leitão, et al., 2019).

| Relationship between sampling effort and biodiversity metrics
The non-uniformity of the sampling effort can directly affect biodiversity measures, such as richness and endemicity (Oliveira et al., 2016).To estimate the effect of the sampling effort on richness and endemicity, we mapped these diversity metrics and checked their correlation with the sampling effort (kernel map, described in the sampling effort section).For this, we use Pearson correlation analysis with degrees of freedom correction by spatial autocorrelation (Clifford et al., 2012).To map richness, we counted the number of species per sampling unit (hexagon).For endemicity mapping, we used the corrected weight endemism method (Williams & Humphries, 1994), as this method reduces the effect of richness on weight endemism index.In both metrics, uniform sampling units were used, hexagons of equal area (150 km radius).As the size of the sampling unit can directly interfere with the mapping of richness and endemism, we performed tests to calibrate this parameter.
The tested hexagon radiuses were 100, 150, 200 and 300 km.We choose the hexagon size of 150 km (Appendix S1), as this size met the objective of not being too small to group enough records and not too large to generate generalizations in very large areas.In the analyses that involved only the estimation of sampling effort, we used all available records with geographic information, even if they had not been identified to species.In the analysis of species richness patterns and endemicity, we only use records identified up to species with valid names.The BioDinamica package was used to perform the richness and endemism analyses (Oliveira, Soares-Filho, Leitão, et al., 2019).

| Species occurrence database
A total of 9,960,067 occurrences were obtained (Figure S1), of which 77% were from GBIF, 22% from OBIS and 1% from the Brazilian experts databases.Among these records, 4,132,448 had taxonomic identification with valid species names, representing 41% of the dataset.The GBIF data represented most of the occurrences for most groups.Vertebrates accounted for 66% of the occurrences, while invertebrates made up 31%, and algae represented only 3% of the records.Among invertebrates, Malacostraca had the highest number of occurrences (760,078).In the algae group, Chlorophyta had the highest number of occurrences (126,858).For vertebrates, the fish group Tetraodontiformes had the highest number of occurrences (4,366,398).The database compiled a total of 22,811 species, with 20,055 being invertebrates (88%), 1618 vertebrates (7%) and 1138 algae (5%).All data are available at https:// zenodo.org/ recor ds/ 10779140.

| Sampling effort and sampling bias
The regions with the highest occurrence densities for all groups were the Florida region (USA), Haiti and the Lesser Antilles (Figure 1a).
Areas with a high density of records for both invertebrates and algae were the Florida region, Lesser Antilles, the north coast of Colombia and Venezuela, southeast Brazil and the southern limit of Brazil (Yellow areas in Figure 1a).For invertebrates, exclusively, the regions with the highest occurrence densities were the Gulf of Mexico, Cuba, and the San Jorge Gulf in Argentina (Red areas in Figure 1a).
The spatial pattern of occurrence density for vertebrates showed moderate correlation with the density patterns of invertebrates and algae (r < .57).Invertebrate occurrence density had moderate correlation with algae (r = .65).The USA has the highest number of occurrences for all groups and has the highest sampling density for vertebrates and invertebrates (81% of the total records).Only algae had a higher sampling density in Colombia.Costa Rica had the lowest sampling density among all countries.Coastal regions with the most significant sampling gaps, for all groups, were the coasts of Honduras, Nicaragua, Costa Rica, Guyana, Suriname, French Guiana, the northern portion of Brazil, Uruguay and most of the Argentine coast (Figure 1a).
Only the first three axes of the PCA of predictor variables of the SAR were used, as only these were significant, representing 92.91% of the variation of these variables.The SAR was significant in explaining the sampling bias (p < .001),and the spatial pattern of the sampling bias is mainly explained by the spatial aggregation of the data (Rho = .63),indicating the concentration of sampling in certain locations (Appendix S1).Among the predictor variables, the only one that had a significant effect in explaining the sampling bias was the first axis of the PCA (p = .03),which was highly correlated with all the predictor variables, positively with distance from the coast, distance from research centres, distance from ports, distance from protected areas and distance from urban areas, and negatively correlated with the maximum depth of the collection site (Appendix S1) (Figure 2).
The highest occurrence density was found in shallow waters, at depths less than 50 m (Figure 2).Samplings were therefore concentrated in certain conditions: close to the coast, research institutions, protected areas and urban areas (Figure 2).

| Environmental bias
The three first axes (the only significant ones) of the PCA represent 99.8% of the observed variation in environmental variables.The environmental conditions are extremely variable across the study area (Figure 3).There is greater uniformity of environmental conditions along the coast of Argentina (Figure 3).The northern coast of South America and the Caribbean region show high environmental heterogeneity, while the North American coast also exhibits greater environmental uniformity (Figure 3).present in the environment (Figure 3).The highest values in the three PCA axes are underrepresented or even absent in the species records.Some of the most frequent environmental conditions (PCA1 in Figure 3b) are not sampled in the records of any of the studied groups.

| Relationship between sampling effort and biodiversity metrics
There is a strong correlation between sampling density and endemism (r = .98),and a weaker one of sampling density with richness (r = .51)for all samples (Figure 1b).Invertebrates and Algae show a high correlation between sampling density and both richness and endemism (.81 and .91 for invertebrates and .79 and .94for algae, respectively).Vertebrates exhibit a high correlation of sampling density with endemism (r = .99)and a low one with richness (r = .40)(Figure 1).

| DISCUSS ION
In this study, we have demonstrated that despite notable discrepancies in sampling efforts among vertebrates, invertebrates and algae, all groups are equally affected by a strong sampling bias related to accessibility.The sampling patterns of vertebrates show little similarity compared to invertebrates and algae, while the latter two present more similar patterns to each other, indicating that the spatial distribution of sampling efforts is not strongly influenced by the same factors across these groups.This dissimilarity can be attributed to the concentration of researchers and marine research centres, each focusing on specific groups, in different locations (Hopkins & Freckleton, 2002;Paknia et al., 2015).For instance, the presence of a significantly larger number of specialists in invertebrates and algae in certain regions may explain the greater similarity in their sampling patterns, whereas vertebrates may be more extensively studied in other areas, leading to a higher spatial variation in their sampling efforts (Paknia et al., 2015).
The United States, despite not having its entire Atlantic coast considered by this study, stands out as the country with the highest number of occurrences for all groups of organisms, also showing the highest sampling density for vertebrates and invertebrates, totalling 81% of the total occurrences, which likely biases our understanding of marine biodiversity in pan-tropical latitudes.On the other hand, the coastal regions of Honduras, Nicaragua, Costa Rica, Guyana, Suriname, French Guiana, northern Brazil, Uruguay, and most of the Argentine coast have the largest sampling gaps.
Since financial investment in scientific research strongly influences the patterns of scientific output in countries (Lancho-Barrantes et al., 2021), this discrepancy in sampling density is likely related to the differences in national budgets for scientific research (Courtland, 2008;Ebadi & Schiffauerova, 2016).As the ease of access to sampling sites is a determinant factor for the sampling bias, this pattern could be related to issues of limited scientific funding, which would also explain the higher sampling density in the United States (Courtland, 2008;Ebadi & Schiffauerova, 2016).
The USA invests significantly more in biodiversity and biomedical research than Latin America, with examples such as the USA's investment of 2.80% of its Gross Domestic Product in Research and Development compared to just up to 0.76% in Latin America (Michán & Llorente-Bousquets, 2009).The results presented here concerning sampling intensity and bias and their relationship with logistical factors and the presence of research centres provide evidence that financial resources for research in different locations can be postulated as a plausible explanation for the sampling bias.
It is possible that this conclusion is more applicable to the nondeveloped countries, where sampling challenges and the need for financial investment in research are greatest (Pitman, 2010;Rodríguez, 2003), while the results suggest that in the USA the pattern may be less biased.
The patterns of richness and endemism are strongly influenced by the intensity of sampling effort.This relationship can compromise the mapping of actual patterns of biological diversity (Oliveira et al., 2016) and limit their use in conservation strategies (Oliveira et al., 2017;Oliveira, Soares-Filho, Santos, et al., 2019).
Interestingly, the endemism patterns for all groups showed a strong correlation with sampling effort, differently to what was observed for terrestrial organisms, which showed a weak correlation with endemism (Oliveira et al., 2016).On the other hand, vertebrates exhibited the lowest correlation between richness and sampling effort, diverging from the patterns observed for terrestrial organisms, which usually show high correlation across all groups (Oliveira et al., 2016).Thus, data from marine vertebrates are less biased than those from other groups in relation to measures of species richness, which should be considered in macroecological and conservation analyses.
The strong environmental bias found in the samples of all studied groups indicates that the sampling does not represent all environmental conditions present in the study area, suggesting that certain specific environments are being more frequently sampled at the expense of others, as observed in terrestrial environments (Oliveira et al., 2016).The environmental bias in the samples can influence the perception of species diversity and distribution, affecting the accuracy of estimates of biological diversity and the predictive capacity of ecological models (Castro, 2019;Cosentino & Maiorano, 2021;Syfert et al., 2013).The absence of sampling in certain environmental conditions can lead to an underestimation of biodiversity and an incomplete understanding of ecological and evolutionary patterns and processes occurring in the region (Henderson et al., 2022;Jones et al., 2021).Therefore, it is crucial to consider these environmental biases when analysing biodiversity data, planning new inventories and developing conservation strategies.Our methodological approach to estimating environmental bias has limitations in estimating how much this bias can affect specific models, such as SDMs for certain species.Thus, some species and some models may be more or less affected by this bias, making it necessary for future studies to estimate these specific effects of environmental bias in these areas.
Additional sampling efforts in underrepresented areas are necessary to obtain more accurate estimates of biological diversity and to improve our conservation strategies, aiming for the effective protection of marine environments and their fragile ecosystems.
In summary, the results highlight the importance of quantifying sampling biases and identifying their causes, especially considering the spatial distribution of records, their relationship with diversity metrics and environmental conditions.This information is crucial for more reliable estimates of biological diversity and for the The 134 environmental variables used showed a high correlation among them.This high correlation allowed the representation of the variables in just a few axes of variation (Appendix S1).The samples from all groups present a strong environmental bias, as they do not represent all conditions F I G U R E 1 Sampling effort and its correlation with diversity metrics.(a) RGB map of sampling effort, where red tones indicate predominant sampling of invertebrates, blue for vertebrates and green for algae.Yellow areas represent a high-density sampling of invertebrates and algae, pink tones indicate locations with a high-density sampling of invertebrates and vertebrates and black areas indicate the absence of sampling.The numbers on the arrows indicate the correlation of the sampling effort for all groups with (b) richness and (c) endemism for all groups; (d) correlation of the sampling density for each group with species richness; (e) correlation of the sampling density for each group with endemism.

F
Accessibility variables in the study area versus marine biodiversity records; black lines indicate number of pixels from the study area, and red lines indicate data from biodiversity samples in relation to distance.If the biodiversity samples were random, the red lines should show the same patterns observed in the study area (black lines).F I G U R E 3 Environmental heterogeneity of the Western Atlantic and the environmental bias of the samples; (a) RGB map of the first three axes of PCA of environmental variables; (b) Number of records of values of the PCA axes in the environment (black lines) and in the samples of all groups (grey lines), invertebrates (red lines), vertebrates (blue lines) and algae (green lines).
development of more effective conservation strategies.Moreover, the study emphasizes the need to expand collections and research in regions with sampling gaps, especially in tropical areas known for their high biodiversity and logistical challenges for sampling.This requires increasing financial investment in research in these regions.Understanding patterns of record density and sampling gaps is a crucial initial step to enhance our knowledge of marine biodiversity and to develop more comprehensive and effective conservation actions.Building on these findings, future studies can better address the challenges of conserving marine ecosystems, contributing to the preservation of these valuable environments and the sustainability of the resources they provide.A FFI LI ATI O N S 1 Instituto de Ciências do Mar (Labomar), Universidade Federal do Ceará, Fortaleza, Brazil 2 LEDALab, Departamento de Ciências Biológicas, Universidade Estadual