The implications of estimating rarity in Brazilian reptiles from GBIF data based on contributions from citizen science versus research institutions

Understanding the distribution of rare species is important for conservation prioritisation. Traditionally, museums and other research institutions have served as depositories for specimens and biodiversity information. However, estimating abundance from these sources is challenging due to spatiotemporally biased collection methods. For instance, large‐bodied reptiles that are found near research institutions or in popular, easily accessible sites tend to be overrepresented in collections compared to smaller species found in remote areas. Recently, a substantial number of observations have been amassed through citizen (or community) science initiatives, which are invaluable for monitoring purposes. Given the unstructured nature of this sampling, these datasets are often affected by biases, such as taxonomic, spatial and temporal preferences. Therefore, analysing data from these two sources can lead to different abundance estimates. This study compiled data on Brazilian reptile species from the Global Information Biodiversity Facility (GBIF). It employed a community‐ecology approach to analyse data from research institutions and citizen science initiatives, separately and collectively, to assess taxonomic and spatial species coverage and predict species rarity. Using a 1‐degree hexagonal grid, we analysed the spatial distribution of reptile communities and calculated rarity indices for 754 reptile species. Our findings reveal that 87 species were exclusively recorded in the citizen science subset, while 212 were recorded only by research institutions. The number of observations per species in the citizen science data followed a Gambin distribution, which aligns with the expected pattern of abundance in natural communities, unlike the data from research institutions. This suggests that citizen science data may be a more accurate source for estimating species abundance and rarity. The discrepancies in rarity classifications between the datasets were likely due to differences in sample size and potentially other sampling parameters. Nevertheless, combining data collected by both research institutions and citizen science initiatives can help to fill knowledge gaps in reptile species occurrence, thus enhancing the foundation for conservation efforts on a national scale.

K E Y W O R D S biodiversity monitoring, citizen science, Neotropics, rarity, semi-structured data

Plain language summary
Accurately classifying rare species is essential for guiding conservation actions.Traditional data collection for biodiversity, often based on museum specimens originating from expeditions conducted by professional scientists, does not accurately reflect true patterns of species abundance.These efforts are frequently limited by financial constraints and logistical issues that restrict spatial and taxonomic coverage.Sampling is particularly intensive in areas near professionally employed scientists' home institutions, field bases or museums.In contrast, citizen science-where members of the public engage in scientific activities-has revolutionised the way species occurrence data are collected.Over the past two decades, volunteers have increasingly contributed observations from locations around the world that are often overlooked by paid scientists, thereby generating large occurrence datasets.By combining citizen-generated observations with data from research institutions, we can enhance our understanding of the distribution of reptile species across Brazil.Our study reveals differences in the number of observations per species between the two data subsets, with citizen science providing a more accurate indication of species rarity.Therefore, citizen science can broaden our knowledge of species abundance while also supporting more effective conservation actions on a larger scale.

| INTRODUCTION
In the face of the current global biodiversity crisis, understanding species distributions and population sizes is increasingly critical (Hortal et al., 2015;Nori et al., 2023).Data on species distribution can elucidate the environmental requirements of a species by examining its fundamental niche (Kearney, 2019;Takola & Schielzeth, 2022).When a species is suspected to be declining, these data are key for informing conservation actions (White et al., 2023).Many studies have focused on this issue to inform decision-making in this area (Kondratyeva et al., 2019;Loiseau et al., 2020).The need to systematically assess the condition of wildlife populations and related threats on a global scale was first recognised in the 1960s (Mace, 1994), leading to the publication of the IUCN Red List of Threatened Species in 1991.This list is based on the systematic assessment of species extinction risk (Mace et al., 2008).Currently, taxa are assigned into one of eight categories (from Least Concern to Extinct) based on geographic range, population trend, size and structure, as well as their temporal trends (IUCN Standards and Petitions Committee, 2022).Several of these criteria are linked to population size and focus on aspects such as decline (criterion A), ongoing decline or extreme fluctuations (B and C), and range, including severe fragmentation (B) and the extent of occurrence or area of occupancy (B and C;IUCN Standards and Petitions Committee, 2022).Nevertheless, a large number of species worldwide have yet to be evaluated, particularly in megadiverse countries, due to the absence of data, restricted data access, and inaccuracies in datasets concerning population size and geographic distribution (Hochkirch et al., 2020).Furthermore, insufficient funding often drastically limits the rate of such assessments (Juffe-Bignoli et al., 2016;Rondinini et al., 2014).Conservationists need efficient analytical tools to provide evidence of high extinction risks and to prioritise actions that reduce the number of Not Evaluated and Data Deficient species.Species that are poorly understood are frequently overlooked when allocating conservation resources (Woinarski et al., 2021).Considering that species most vulnerable to extinction are often naturally rare (Harnik et al., 2012), identifying such species can serve as an indicator of potential threats and can flag Practitioner points species as conservation priorities (Gauthier et al., 2010;Veach et al., 2017).
Although rarity is often intuitively interpreted as a species having few individuals, it is, in reality, a complex, multidimensional concept.To address this complexity, Rabinowitz (1981) proposed seven types of rarity based on three properties: (1) geographic distribution (widespread vs. narrow-ranged species), (2) habitat specificity and (3) local population size (abundance-based rarity), along with their various combinations.For instance, a rare species might be characterised by a small population that is widely dispersed geographically or by an abundant population restricted to a limited habitat area (Rabinowitz, 1986).This classification framework is still frequently used, for instance, for plants at regional (Quiroga & Souto, 2022) and national levels (Choe et al., 2019) and for deep-sea bivalves (McClain, 2021), serving as a base for further theoretical work (Maciel, 2021).Under nonexperimental conditions, the uneven distribution of individuals among species is a common pattern observed in biological communities (Magurran, 2004).Typically, a few species are dominant while many others are rare, causing the abundance distribution curve for communities to fit a near-logarithmic shape (McGill et al., 2007).This abundance distribution pattern can be considered a natural power law for biodiversity datasets and can be explained by the different abilities of species to access resources in a given space (Marquet et al., 2007).However, inaccurate or incomplete knowledge of species rarity can lead to erroneous allocation of resources or impede conservation actions (Dibner et al., 2017;Katzner et al., 2011).This is particularly concerning in countries with high biodiversity that are often poorly inventoried.Therefore, there is a growing demand for increased biodiversity surveying, especially in nations that harbour global biodiversity hotspots facing threats from escalating deforestation pressure and the effects of climate change (Habel et al., 2019;Kong et al., 2021).In response to this urgent demand, more efficient methods need to be adapted to aid in the understanding of biodiversity distribution and trends across extensive geographic scales.Data produced by citizen science have been making substantial contributions to biodiversity monitoring at the global scale (Chandler et al., 2017;Johnston et al., 2023;Mesaglio et al., 2023).These efforts have also influenced public policy (Fritz et al., 2019;Roger et al., 2023) and raised awareness among both the public and policymakers (Danielsen et al., 2014).With the widespread adoption of internet-enabled smartphones and the development of user-friendly applications to record biodiversity, members of the public, working in a nonprofessional and unpaid capacity, have been documenting the location of species worldwide (Deacon et al., 2023;Pocock et al., 2024;Tulloch, Possingham, et al., 2013).These records are often accompanied by photographic and video documentation, providing valuable secondary information (Klinger et al., 2023;Pernat et al., 2024).
Nevertheless, the spatial adequacy (Backstrom et al., 2024) and overall quality of citizen-science data are highly variable, influenced by the heterogeneous behaviour of the observers (Callaghan, Poore, Hofmann et al., 2021;Pocock et al., 2023) and the accuracy of observations (Gorleri & Areta, 2022;Gorleri et al., 2023).Spatiotemporal biases, particularly those related to observer behaviour, can be identified by comparing unstructured (or semistructured) data with results from structured surveys (Balázs et al., 2021;Szabo et al., 2012).This comparison facilitates the calibration of different datasets, making them suitable for trend analysis and enhancing their reliability (Forti et al., 2024;Hertzog et al., 2021).In fact, the integration of citizenscience data and structured surveys has been shown to offer effective complementary insights (Dimson et al., 2023;Robinson et al., 2020;Tulloch, Mustin, et al., 2013).
The Global Biodiversity Information Facility (GBIF) collates georeferenced species occurrence data from a variety of sources or research institutions, including academic institutions, government research facilities, museums, herbaria, as well as various fauna and flora inventories (henceforth referred to as RI data).A second stream of data originates from citizen-science initiatives (henceforth referred to as CS data).GBIF currently hosts over 1.5 billion records for taxa from around the globe (gbif.org).Despite this extensive database, GBIF data are not without limitations, some of which are inherent to the database itself (i.e.biases originating from the amalgamation of datasets collected through different methods), and others that stem from the incoming data, such as taxonomic bias, identification errors and incorrect or missing geographic coordinates (Petersen et al., 2021;Troudet et al., 2017).For smaller datasets, some of these issues can be mitigated by manual data cleaning, while for larger ones, automated filtering techniques are applied (Zizka et al., 2020).Additionally, the quality of different taxonomic and geographical subsets varies significantly (Szabo et al., 2023).In spite of these limitations, GBIF provides a viable alternative to conducting surveys of ecological communities at large spatial scales, particularly in biodiverse countries with limited scientific knowledge and resources (Heberling et al., 2021;Ivanova & Shashkov, 2021).
In regions, such as Europe, the United States and Australia, the volume of data generated through citizen science has significantly contributed to the assessment of population trends across various taxa, including snakes (Santos et al., 2022), bats (Barlow et al., 2015) and birds (Fink et al., 2020;Szabo et al., 2010).On the other hand, in Brazil, the integration of citizen science into biodiversity research is still relatively nascent.Despite this, certain taxonomic groups, particularly birds and amphibians, have been receiving disproportionately high interest (Forti & Szabo, 2023).Citizen science has notably advanced our understanding of reptile distribution in Brazil, discovering previously unknown areas of occurrence (Oliveira et al., 2023).Yet, in spite of these advances, citizen science data have not been formally incorporated into decision-making processes in the country.An important initial step in leveraging these data for conservation and policy-making is to evaluate whether observations from citizen science can accurately reflect population sizes and species distributions and whether they can reliably classify species as 'rare' or 'common'.
Reptiles represent one of the most diverse animal groups on the planet, and Brazil stands as a significant biodiversity hotspot, holding the thirdhighest diversity of reptiles globally with 856 species; a number that continues to rise as new species are described each year (Guedes, Entiauspe-Neto, et al., 2023).Within these species, there is a broad range of environmental tolerances.Some species exhibit wide environmental tolerance, enabling them to coexist in diverse habitats, while others, with specialised habits, can only survive within narrowly defined environmental conditions and thus have restricted distributions (Birskis-Barros et al., 2019).In spite of their diverse adaptations, reptiles are threatened worldwide due to habitat loss caused by expanding agriculture, deforestation and urban development, as well as illegal exploitation (Böhm et al., 2013).Consequently, 23.5% of reptile species are at some risk of extinction globally, with the figure standing at 14.4% in Brazil (IUCN, 2024).A high proportion (73%) of Near Threatened and threatened Brazilian species are classified under Criterion B concerning geographical range and population trends (IUCN, 2024).
Reptiles are integral to diverse trophic interactions within ecosystems.Through their activities as grazers, browsers, apex predators and scavengers, they play important roles in trophic networks, facilitating the balance and functioning of these systems (Pinto-Coelho et al., 2021).Beyond these roles, reptiles also contribute to other ecological processes, including seed dispersal, pollination, nutrient cycling and ecosystem engineering by creating habitats such as burrows and pools, which serve other species (Miranda, 2017).The socioeconomic importance of reptiles is equally notable.They contribute to tourism (Cohen, 2019), their bioactive compounds are used in pharmacological research (Mishra et al., 2020), and in Brazil, they also serve as a protein source for rural communities (Cajaiba et al., 2015).Considering their ecological and socioeconomic roles, coupled with the threats they face, the conservation of reptiles should be prioritised, particularly in tropical countries where biodiversity is rich, and the impacts of biodiversity loss can be profound (Miranda, 2017).
Data deposited in research institutions often originate from localised studies that focus on one particular species or on a small number of related species.Due to variations in the design and aims of these studies, certain species may be overrepresented, while many others remain neglected, leading to potential biases in the data collected (Meineke & Daru, 2021).In contrast, citizen science initiatives often employ gamified apps to motivate volunteers to collect observations of a diverse array of species across larger geographic scales (Callaghan, Poore, Mesaglio, et al., 2021;Sandbrook et al., 2015).As representation often reflects availability, rare or less detectable species will have fewer observations than dominant and conspicuous species (Johnston et al., 2018).Considering that RI and CS biodiversity data are generally collected using differing methodologies (aims, design, and scale), we hypothesised that the two datasets would yield different relative species abundance estimates.In particular, we predict that CS data can provide more accurate estimations of species relative abundance than data contributed by research institutions.We test this hypothesis through a community ecology approach, using the number of observations (records) as a proxy for species abundance in local communities to compare species abundance and rarity estimates based on reptile occurrence data from Brazil, as presented in GBIF, contributed by CS versus RI.We also discuss the limitations of GBIF data, particularly in relation to the biases associated with these two types of data contribution.

| Data collection and organisation
We downloaded data on reptile occurrences in Brazil from GBIF (https://doi.org/10.15468/dl.j7ajhx)on 30 October 2023.Following the taxonomy and species distribution in the Reptile Database (http:// www.reptile-database.org/),we conferred species names and their classification as native or exotic to Brazil, considering them exotic if the country was not listed in their native distribution.To ensure data accuracy, we removed observations that displayed taxonomic inconsistencies (e.g.non-recognisable synonymy).We also eliminated duplicate observations of the same species that occurred at the same geographic location on the same day using the distinct function of the dplyr package (Wickham et al., 2022) in R version 4.2.1 (R Core Development Team, 2022).
We classified observations as originating from CS when the institutional code was (1) BioDiversi-ty4All, (2) Diveboard, (3) iNaturalist and (4) naturgucker.All other observations were designated as RI data.To organise the data spatially, we used the geographical coordinates of each observation and aggregated them using a hexagonal grid generated over the extent of Brazil (in angular geographic coordinates-EPSG 4674).The grid was configured with a horizontal and vertical spacing of 1 degree each, resulting in 1188 grid cells.We overlapped reptile records from GBIF with the grid using QGIS v. 3.28.5 (QGIS Development Team, 2021).Using these spatial units, we assigned a unique grid cell ID to each reptile observation.We adopted a community ecology approach, defining each grid cell as a local community based on observations submitted from the same grid cell.Within this framework, the number of observations per species in each community was used as a proxy for the relative abundance of a particular species (Callaghan et al., 2024).After this spatial organisation, we excluded non-continental observations (i.e.those submitted from oceanic locations or from oceanic islands) when calculating rarity indices.Therefore, we did not calculate rarity indices for three island endemic species (Amphisbaena ridleyi, Bothrops insularis and Bothrops sazimai).Finally, to assess the representativeness of observations across different biomes, we overlaid species occurrence data with a layer representing the six major Brazilian biomes: Amazonia, Atlantic Forest, Caatinga, Cerrado, Pampa and Pantanal, according to the Brazilian Institute of Geography and Statistics (IBGE, 2019).This analysis enabled us to determine the frequency of each species within these biomes.
We prepared the data for analysis by compiling three ecological matrices, with species represented as columns and grid cell IDs (local communities) as rows.We created separate tables for the two subsets: one comprising observations from citizen science initiatives and another containing observations from professional researchers only.We also prepared a combined table using the full data set.We estimated the probability of each species being classified as common or rare based on the species abundance and number of grid cells occupied.Typically, a rare species is expected to have a low number of observations (low abundance) and to be absent from most local communities.We used the fuzzy clustering algorithm from the FuzzyQ package to quantify community-level coherence in the classification of species into common and rare clusters (Balbuena et al., 2021).This method simultaneously evaluates the dissimilarities in occupancy and abundance, producing indices of commonness (C i ) and rarity (R i ).These indices are derived from dissimilarity indices that reflect the probability of a given species, denoted as species i, being categorised as common and rare, respectively (Gower, 1971).
Having obtained fuzzy quantification for each species present in each subset, we proceeded to identify species that were allocated to different clusters.Next, using cell ID, species name, geographical coordinates, biome and abundance per grid cell, we calculated four Rabinowitz rarity indices (GRI, HSI, PSI and RR) for each species across the two subsets and the full data set.These calculations were conducted using the rrindex package (Maciel, 2021).These indices are based on three dimensions of rarity: geographic range index (GRI), habitat specificity index (HSI) and population size index (PSI).We considered the number of biomes occupied by each species as a measure of habitat specificity and the absolute number of observations per grid cell as an indicator of species abundance.The fourth index calculated using this package was the rarity index (RR), which is the central axis of these dimensions, representing a synthesis of the three other indices calculated as RR = med(GRI+HSI+PSI) (Maciel, 2021).
We compiled data on global threat status (IUCN, 2024) using the rredlist package (Chamberlain, 2020).We cross-referenced these data with entries in the GBIF database and directly consulted the IUCN Red List website for taxa the package could not categorise (https://www.iucnredlist.org/).The five criteria used by IUCN are based on geographical range and population size (IUCN Standards and Petitions Committee, 2022).Generally, species that are more threatened are also rarer than those classified as Least Concern.For the purpose of this study, we classified the threat status of each species as (1) non-threatened (i.e.Least Concern) or (2) Near Threatened and threatened (Vulnerable, Endangered, and Critically Endangered).We included four species classified as Lower Risk/Near Threatened and two Lower Risk/Conservation Dependent species in the second category.Unfortunately, we were unable to find information for 177 species, and 21 were categorised as Data Deficient.These species were excluded from the threat category calculations.To represent the relationship between rarity and commonness visually, we plotted raritycommonness indices on a scatterplot using the full data set.We computed species completeness based on the latest list of Brazilian reptiles (Guedes, Entiauspe-Neto, et al., 2023).

| Data analysis
We compared the proportion of common species in the two (CS and RI) subsets using a χ 2 -test through the chisq.testfunction in R. To assess the consistency of species classification between the two subsets and the full data set (see Table S1), we performed multiple correlation analyses.We used the Spearman method through the cor.test function in R to evaluate the relationships between the two subsets and the full data set, focusing on the commonness index and the four Rabinowitz rarity indices.
As an alternative evaluation of the quality of the two subsets for patterns of species abundance, we tested their Gambin model distribution fit.Gambin is a stochastic model that mixes gamma distribution with a binomial sampling method (Matthews et al., 2014).According to empirical tests, Gambin distribution provides a superior fit to species abundance distributions when compared to other classic models (Ugland et al., 2007).Therefore, Gambin distribution is very useful in describing ecological communities with species abundance curves represented by common species and a long tail of rare species, a pattern frequently observed in nature.We used the fit_abundances function of the gambin package (Matthews et al., 2014) to test the fit of species abundance patterns to the Gambin model for both subsets.This method provides an α-value, a parameter also used as a diversity metric reflecting the complexity of a community's interactions with its environment (Ugland et al., 2007).Based on the logic that threatened and Near Threatened species are rarer than Least Concern species, we used two generalised linear models (GLM) to test whether the indices of commonness differ between Least Concern versus Near Threatened and threatened species for citizen-versus professional-collected data.This was done by applying the glm function with Gamma distribution as a link function after assessing data distribution.
We checked spatial bias in the two subsets by calculating the number of observations (considering all reptile species) per subset.We then analysed the difference in observation counts between the two subsets for each grid cell.We tested for spatial bias by calculating the expected number of observations per biome based on the proportional size (in km 2 ) of the biome.We compared the expected counts to the observed numbers using the chisq.testfunction in R.
To analyse species diversity within the identified communities, we constructed rarefaction curves based on the two subsets and the full data set.These curves were generated using the specaccum function and employing the rarefaction method provided by the vegan package in R (Oksanen et al., 2020).We used these curves to define whether species diversity had satisfactory coverage at the national scale, given that a clear asymptote (an identifiable trend line with no change in direction) indicates that further sampling is unlikely to yield additional species, thereby affirming satisfactory species coverage at the national scale.We correlated species diversity based on the two subsets among grid cells using Spearman correlation implemented through the cor.test function in R. We produced graphs using the R base plot function and the ggplot2 package (Wickham, 2016).

| Overview of reptile GBIF data from Brazil
Based on GBIF data, we identified 754 reptile species within Brazil's geographical boundaries.After the exclusion of duplicates and potential misidentifications, these species were represented by 42,580 observations, covering 82.6% of the total species recognised as native to Brazil.Among the 43 families recorded in the database, Colubridae had the highest representation (13,879 observations), while nonnative families (Acrochordidae, Boyeriidae, Chamaeleonidae, Phrynosomatidae, Platysternidae, Psammophiidae and Varanidae) were represented by a single observation each.The database was mainly composed of Squamata (91%), while Testudines and Crocodylia accounted for only 5% and 4%, respectively.
The number of observations has increased substantially since the 1980s, particularly in the last 20 years, reaching a peak in 2022 (Figure 1).Historical data revealed the oldest observation from a research institute dates back to 1880, featuring a South American ground lizard (Ameiva ameiva) in the city of São Paulo.The earliest record from citizen science was a House gecko (Hemidactylus mabouia) logged in 1970 in the city of Rio de Janeiro.
There were 24,828 observations (58.3%) and 17,756 observations (41.7%) in the RI and CS subsets, respectively.iNaturalist was the largest source of citizen-science data (99.2%).The Argentine black and white tegu (Salvator merianae) was the most observed species within the CS dataset (1464 observations), while the Amazon lava lizard (Tropidurus torquatus) had the most (1388) RI observations.
Native species accounted for 35,199 observations, including 5683 observations of 316 species endemic to Brazil.Among these endemic species, the Neotropical lava lizard (Tropidus hispidus) had the highest number of observations (644).We also identified 66 exotic species in the database, represented by 1702 observations.The most common exotic species was the House gecko, with 1324 observations.
Considering the full data set (i.e.436 species that were reported in both subsets), we classified 279 species as rare (Table S1).Ridley's worm lizard (Amphisbaena ridleyi), Noronha skink (Trachylepis atlantica) and the Endangered Calango (Tropidurus psammonastes) had the highest rarity values with regard to geographical range criteria (GRI), while the general values of rarity (RR) were highest for the Endangered Dumeril's worm lizard (Leposternon octostegum), and two other lizard species: Caparaonia itaiquara and Tropidurus pinima.Ameivula mumbuca, a species of teiid lizard, had the lowest commonness index (C i ).

| Comparing species rarity between the two subsets
Among the 1188 hexagonal units of the grid, 308 contained no recorded reptile observations in the GBIF database.We identified a strong negative correlation between rarity and commonness indices across the full data set (ρ = -0.7564968,p-value < 2.2 * 10 −16 ).While most threatened and Near Threatened species had higher RR and lower C i values, some Near Threatened or threatened species were still classified as common based on the evaluation of the full data set (Figure 2a).Although the proportion of common species did not significantly differ between the two subsets (28% and 34%, respectively; χ 2 = 1.7603, p-value = 0.1859), 81 species were classified differently based on the two subsets (Table S1).Furthermore, the coefficients for both the correlation of the rarity index and the commonness index between the two subsets were <0.70, and the index values differed for many species between the subsets (Figure 2b-f).Additionally, correlation analyses of the indices that made up the Rabinowitz rareness dimensions (GRI, HSI, PSI and RR) between the subsets showed that the highest disparity was related to the population size index, which had the lowest correlation coefficient (ρ = 0.3324617; p-value = 9.626 * 10 −12 ), followed by the commonness index (ρ = 0.6491; p-value ≤ 2.2 * 10 −16 ), rarity index (ρ = 0.6656551; p-value ≤ 2.2e * 10 −16 ), habitat specificity index (ρ = 0.7155195; p-value ≤ 2.2 * 10 −16 ) and geographic range index (0.7700875; p-value ≤ 2.2 * 10 −16 ).

| Spatial bias and species completeness in the two subsets
Data contributed by RI contained 81% of the total number of species reported from Brazil, while CS data contained only 63%.The former provided more observations, particularly from the south and southeast of the country, while the latter had a larger contribution in the northeast (Figure 3).Both subsets provided good species coverage from the central-western region.Species richness calculated from the two subsets did not show a significant correlation at the grid cell level (ρ = 0.069; p-value = 0.086).
The Atlantic Forest was the best-represented biome, with 15,543 observations, while the Pantanal only had 2109 observations.A total of 3929 observations were submitted from oceanic islands.The distribution of observations per biome differed significantly from the expected values, which were calculated based on the size of the area (χ 2 = 318,459, df = 5, p-value < 2.2 * 10 −16 ).The Amazon, Caatinga and Cerrado were underrepresented, with standardised residues of -178.71847,-11.72771 and -36.50577, respectively, while the Atlantic Forest, Pampa and Pantanal were overrepresented, with standardised residues of 536.59507, 149.28487 and 51.20920, respectively.
Considering the two subsets, 212 species were exclusively observed in the RI subset, while 87 species were unique to the CS subset.Thus, with 695 species, species richness was estimated to be higher in the professional scientist subset compared to the citizen scientist subset (542 species).Neither of the rarefaction curves for the two subsets nor that of the full data set presented a well-defined asymptote (Figure 4).

| DISCUSSION
While traditional surveys conducted by experts are indispensable for advancing knowledge of taxonomy, biology and geographical coverage of species, RI data, such as those available in natural history collections, usually do not suffice to determine species abundance patterns or rarity at larger scales.This limitation largely stems from inherent biases associated with particular research objectives of professional researchers or logistical constraints (Isaac & Pocock, 2015).For example, many taxonomists go to remote places to describe new species from poorly studied regions (Brito et al., 2021;Kennedy et al., 2019), focusing predominantly on collecting potentially new taxa.Similarly, population ecologists may concentrate on monitoring and sampling specific species, often overlooking others present at the study site.Such biases in RI data can lead to an overrepresentation of certain species while neglecting others.In the context of Brazilian reptiles, research attention is often skewed towards larger species, particularly those whose geographic ranges coincide with the locations of institutions housing experts (Guedes, Moura, et al., 2023).As a result, observations of specific species disproportionately contribute to natural history collections, independent of the actual abundance or distribution range of these species.
In contrast, CS data, despite being known to have spatiotemporal (Bowler et al., 2022;Di Cecco et al., 2021) and species traits biases (Callaghan, Poore, Mesaglio, et al., 2021;Marcenò et al., 2021), typically provide a more comprehensive picture of species abundance and distribution range.These factors significantly influence species detectability among observers, affecting the number of observations each species may have (Szabo et al., 2012).Our findings suggest that CS data provide more accurate estimates of species abundance or rarity than RI data.Nevertheless, neither data set reflects "real" abundances perfectly.The disparity in rarity classification between the two subsets is likely due to the limited capability of RI data to estimate population sizes accurately.This conclusion is supported by the observation that, unlike CS data, RI data did not fit the Gambin distribution, which represents the expected natural pattern for species abundance (Matthews et al., 2019).While geographic range data are well-documented for most terrestrial vertebrates, obtaining accurate population size data remains challenging for many reptile species (Ficetola et al., 2018).This issue is reflected in the relatively high number of reptile species classified under criterion B and in the relatively high correlation between the two datasets with regard to the geographic range index but not for the population size index.
Although the use of observations from CS appears to offer fewer constraints for inferring species abundance compared to data from RI, we need to highlight the potential risk of false negatives (species that were present but remained undetected) and false positives (misidentifications or recording species that were in fact absent) related to data from citizen science initiatives (Gorleri et al., 2023;McDonald & Hodgson, 2021).Despite these challenges, integrating data from various sources is known to improve data quality (Brown & Williams, 2019;McDonald & Hodgson, 2021).Nevertheless, more evidence from empirical tests using robust estimators of population size among species is necessary to validate whether integrating data sources can improve abundance data for Brazilian reptiles.
The integration of different data sources has improved our understanding of species richness and taxonomic coverage in our data set, highlighting the benefits of spatial complementarity.Despite these advances, the rarefaction curve for the full data set did not reach a satisfactory asymptote.The RI data set provided a more extensive list of species, with 212 (28% of all Brazilian reptile species in GBIF) species that were exclusive to this data set.For instance, the only records of the colubrid Zonateres lanei were seven museum specimens.Among these 212 species, only six were classified as common: the Spotted ground-snake (Adelphostigma occipitalis), Yellow head mussurana (Boiruna maculata), Keeled sepia snake (Dryophylax hypoconia), Amazon coastal house snake (Dryophylax nattereri), Dark blind snake (Liotyphlops beui) and Coastal house snake (Mesotes strigatus).Among the 203 rare species, 18 were classified as threatened and three as Near Threatened (Table S1).
Conversely, 87 reptile species (12% of the total species in GBIF) were only present in the CS subset, making it highly valuable.All of these species were classified as rare, with nine of them listed as globally threatened and one as Near Threatened.This underscores the significant role that volunteer observations can play in capturing data on rare and globally threatened species (Báthori et al., 2022;Fontaine et al., 2022;Tiralongo et al., 2020).Our study identified certain grid cells that had a larger contribution from the CS subset than from the RI subset, especially in the Northeast (Figure 3).While the combination of RI and CS data improved the total coverage of reptile species at the national scale (GBIF includes information on 82.6% of the reptile species in Brazil), there is still potential for improvement.The rarefaction curve has not yet reached a plateau, suggesting the species count could increase with further sampling efforts, especially in understudied regions.Spatial gaps are primarily located in the Amazon Basin, where increased research effort is required to improve our understanding of species abundance and distribution patterns.These regions should be prioritised for attention by academic experts, and the involvement of organised citizen-science initiatives could contribute to filling these gaps (Brooks et al., 2023).An integrated approach that combines the efforts of professionals and volunteers in structured projects (including a programme for training citizens) could result in a highly effective strategy for increasing data availability in these under-researched regions (Callaghan et al., 2019).
Citizen science initiatives typically produce more observations in urban landscapes (Tulloch & Szabo, 2012) and our results indicate that such sources are currently the main contributors of occurrence data on reptiles in Brazil, proportionally surpassing the representativeness of RI data since 2019 (Figure 1).The spatiotemporal differences in RI and CS data were also reflected in differences in the most observed species.We can also infer the distribution of exotic species, as the GBIF contains data on 61 exotic species.Exotic species can affect native biodiversity negatively, and citizen science is considered an effective tool to monitor their distribution and trends (Encarnação et al., 2021;Johnson et al., 2020;Phillips et al., 2021).The abundance of CS data supports macroecological research (Altwegg & Nichols, 2019).For example, many studies have used citizen science data to build species distribution models, particularly for birds and mammals (Feldman et al., 2021).In spite of the data available, other vertebrate groups have received less attention (Feldman et al., 2021).Based on GBIF data, citizen science can contribute to more accurate distribution models for some species (Robinson et al., 2020), such as the House gecko, which has over 1000 observations in Brazil.
The unprecedented global biodiversity crisis underscores the urgency of identifying and cataloguing (Tilker et al., 2020), a task made more challenging by the widespread lack of information on population sizes for most species (Kindsvater et al., 2018).In this context, our study has provided valuable information on biases present in largescale public data concerning species abundance, which can inform their use for ecological science and informed conservation decision-making (Johnston et al., 2023).We have demonstrated that CS data can be an important source for obtaining reptile species abundance patterns and rarity in Brazil.Nevertheless, biases need to be recognised and accounted for.CS and RI data differ in estimating population size, which can affect the accuracy of classifying rare species.In spite of these challenges, we recommend the integration of these two types of data to study spatial and taxonomic coverage.

| IMPLICATIONS AND RECOMMENDATIONS
While rare species are not necessarily threatened, a detailed evaluation of data sources and the underlying causes of rarity can serve as an early warning to trigger conservation actions.For example, the teiid lizard Glaucomastix cyanura, which displays high values across all rarity indices in both RI and CS data, is currently classified as Data Deficient by the IUCN.This classification positions it as a potential candidate for evaluation, with a high chance of being recognised as threatened.More concerning is the case of the Pantanal coral snake (Micrurus tricolor), currently evaluated as Least Concern.Yet, this species has reached the maximum values for all rarity indices in both subsets.The convergence of these rarity dimensions strongly suggests that this species needs to have its threat status revised, given its endemicity to the Pantanal, its limited geographical range, and appearently low densities.Furthermore, the populational trend of this species is currently unknown (https://www.iucnredlist.org/species/15202936/15202939).Given the ongoing large-scale land conversion and extreme fires in the Pantanal (Garcia et al., 2021) and future predicted threats (Lima et al., 2020), these environmental pressures intensify the urgency for reassessment.Another similar example is the Black-headed coral snake Micrurus averyi, also classified as Least Concern, yet exhibiting concerningly low rarity indices.Restricted to the northern Amazonia and a small area in Guyana (http://www.reptile-database.org/), its limited observations and small geographical range are cause for concern.Similarly, other species like the colubrid snake Helicops tapajonicus and the tropidurid lizard Tropidurus insulanus should also have their global threat status carefully re-evaluated.Despite certain limitations, our results indicate that CS data have reached a threshold, having accumulated a sufficient amount of data to inform species conservation in Brazil.This approach could also benefit other countries, particularly those that have traditionally lacked extensive population data but are currently experiencing increased volunteer activity.This methodology aids in filling critical data gaps, while also empowering local communities to actively partake in biodiversity conservation.

1
Number of reptile observations in Brazil in the Global Biodiversity Information Facility between 1800 and 2023, comparing the amount of citizen science data (green) and observations by professional scientists (orange).

F
I G U R E 2 (a) Relationship between rarity and commonness indices for Brazilian reptile species based on the Global Biodiversity Information Facility database; black dots represent Least Concern species, red dots represent Near Threatened and threatened (VU, EN and CR) species, and grey dots are nonclassified species.The other figures correlate indices calculated from citizen science (on the x axis) and professional-collected datasets (on the y axis) for (b) habitat specificity index (HSI), (c) geographic range index (GRI); (d) population size index (PSI); (e) rarity index (RR) and (f) commonness index (C i ) values.The red dashed isoclines at 0.5 on the last graph delimitate rare (<0.5) and common (>0.5) species.

F
I G U R E 3 Spatial distribution of reptile observations in Brazil based on the number of observations from citizen science and traditional survey data subsets of the Global Information Biodiversity Facility (GBIF) database.F I G U R E 4 Comparison of species richness interpolation curves of reptiles in Brazil considering the full Global Biodiversity Information Facility data set (in purple), and the two subsets, traditional survey (in orange) and citizen science data (in green).