A new statistical approach for identifying rare species under imperfect detection

Species rarity is often used as a measure to assess the risk of extinction of species, and thus, different methods have been developed to describe the composition of rare species in biological communities. These methods usually depend on species attributes that are not always available and very often ignore imperfect species detection. In this work, we developed a new method to characterize species rarity in a community when species are detected imperfectly. Our modelling framework is based on Bayesian occupancy models to estimate species distributions under imperfect detection using presence‐nondetection data.


| INTRODUC TI ON
The need to understand how species distributions are influenced by environmental changes has motivated the development of a wide range of species distribution models that allow for identification of the most important areas for biodiversity conservation (Elith & Leathwick, 2009). These methods are usually based on scoring procedures that describe different attributes of an assemblage of species in a community (Leroy et al., 2012). However, individual species responses may vary widely within a community, making the task of identifying the ecological processes of interest that drive occupancy patterns difficult, especially for rare and elusive species (Bailey et al., 2014).
Rare species are often a point of interest for conservationists, as they represent the most vulnerable species to environmental changes (Leroy et al., 2013). A species is considered to be rare when the range of distribution or their abundances is low with respect to the range or distributional properties of other species within a comparable taxa (Blackburn & Gaston, 1997). Thus, species rarity is considered to be a scale-dependent, relative-emerging property for a set of species (Hartley & Kunin, 2003).
Species with limited distributions are prone to experiencing local extinctions influenced greatly by a reduction in the genetic pool due to population isolation (Ellstrand & Elam, 1993;Karron, 1997), environmental changes and demographic stochastic processes (e.g. variability in birth or death rates and fluctuating sex ratios (Lande, 1993;Lee et al., 2011)) that cause random fluctuations in the species' survival and reproduction rates (Melbourne & Hastings, 2008). However, estimating the distributions of rare species is a challenging task because occurrence records are scarce and underestimate the true species distribution due to imperfect detection (MacKenzie et al., 2005).
To correct for detection bias, multispecies occupancy models enable probabilities of species occurrence and detection to be estimated simultaneously. This class of models has proven to be a powerful tool for estimating attributes of biological communities such as size, composition and species richness when the number of species in the community is known (Gelfand et al., 2005) or unknown (Dorazio & Royle, 2005;Dorazio et al., 2006). However, their potential application to classify species is a less studied subject (Pacifici et al., 2014), and very few studies in community ecology have analysed the species classification problem while accommodating imperfect detection. For instance, Pacifici et al. (2014) proposed using multispecies occupancy models to investigate the variability in occupancy probabilities after classifying species from two distinct avian communities into different groups defined by landscape features, habitat requirements and species diet. Grouping species can reveal group-level responses to habitat covariates that would otherwise be difficult to observe if the community was analysed as a whole. However, results can be sensitive to these predefined classes (Pacifici et al., 2014), and the class labels are assumed to be known a priori rather than being estimated from the data. Imperfect detection has become an increasing area of research over the last decade (Devarajan et al., 2020;Kellner & Swihart, 2014), but current methods proposed to quantify the rarity of a species at the community level have not yet accounted for detectability bias and very often rely on population density parameters, which are not always available for the less studied species. For instance, Rabinowitz (1981) proposed a classification system that has been applied in several conservation studies (e.g. Broennimann et al., 2005;Isaac et al., 2009;Maciel, 2021;Yu & Dobson, 2000) to categorize species by different types of rarity and commonness based on the local population density, the area of the species range and the number of different types of habitats each species occupies.
However, abundance and habitat specificity data are not always available, especially for small invertebrates (Leroy et al., 2013).
Hence, scoring procedures based on species occurrences have been used to assess the rarity of assemblages of species where the only information available comes from occurrence data sets (Leroy et al., 2012(Leroy et al., , 2013. In this work, we also propose an occurrencebased approach to characterize the rarity of a given species. Here, our modelling framework assesses species rarity based on the relative occupancy probabilities on a National scale. We propose a Bayesian occupancy mixture model for multiple species to quantify species rarity in a community when species are detected imperfectly to help determine sites where the incidence of rare species is higher and, thus, relevant for conservation and management. Mixture models have been explored in a wide range of fields because of their flexibility to model situations in which the population of interest is a mixture of subpopulations for which subpopulation membership is not known (McLachlan & Basford, 1988). For instance, finite mixtures in Capture-Recapture models have been used to account for heterogeneity in individual capture probabilities (Norris & Pollock, 1996;Pledger, 2000). Within the context of occupancy models, mixture models have been used to model heterogeneity in detection probabilities (Royle, 2006), to account for false-positive errors (Royle & Link, 2006), and more recently, to characterize the structure of a community based on latent groups of species that have similar responses to environmental conditions (Sollmann et al., 2021).
In this work, we propose using finite mixtures as a classification tool that enables a community's structure to be described in terms of rarity/commonness. Specifically, we look at UK Odonata communities and characterize them based on a two-class finite mixture into rare and common species while accommodating imperfect detection. Odonata, a taxonomic order comprised dragonflies and damselflies, has served as an important bioindicator to assess water body quality and ecosystem integrity (Golfieri et al., 2016). Thus, ecological studies of the distributions of these species are crucial for the management and restoration of freshwater ecosystems.

| Occupancy and species trait data
Data were compiled by hydroscape (web: hydroscapeblog.wordpress.com), a project investigating how anthropogenic stressors and connectivity interact to influence biodiversity in UK freshwaters.
Odonata occurrence records (for over 4000 [1 km] grid cells defined by the presence of a waterbody) from 2000 to 2016 were taken for 39 noninvasive species from the British Dragonfly Society Recording Scheme (2020). Species-specific covariates that may affect the species detection probability were taken from Powney et al. (2014).
These covariates were (1) median body size, (2) flight duration (difference in months between the start and the end of the flying period) and (3) number of different habitats that each species occupies (e.g. lowland rivers and canals, bogs moorland and lowland wet heath, ponds, lakes and woodlands).

| Generating presence-absence data
Our proposed modelling approach relies on presence-absence (a.k.a. detection-nondetection) data. However, Odonata occurrences are derived from presence-only records where there is no information regarding species nondetections. Thus, individual species nondetection records were inferred based on the information of sightings of other dragonfly and damselflies species by following K´ery et al.
(2010) and Termaat et al. (2019), i.e. sighting of species i confirmed its presence at that site and it was deemed as undetected at those sites where any species other than i was recorded.

| Bayesian mixture occupancy model to identify rare species
Our modelling approach is based on occupancy models, a special class of methods that enable species distribution to be estimated under imperfect detection (MacKenzie et al., 2002;Tyre et al., 2003). These models are defined by specifying (i) an ecological process from which the occupancy state of whether a site is truly occupied or not (i.e. z j = 1, if site j is occupied and 0 otherwise) is drawn based on the estimated species occupancy probability ( ), and (ii) an observational process conditioned on the occupancy state that relates the observed species occurrences on repeated sampling occasions (i.e. y jk = 1, if species is detected in site j during visit k) to the estimated detection probability (p). In this work, we developed a mixture occupancy model that can be used to classify species rarity based on their relative occurrences while accounting for the imperfect detection derived from species' nondetections.
Particularly, finite mixture models assume that a random vector x = x 1 , …, x n characterized by a set of unknown parameters is distributed among H different nonoverlapping groups with probability h of being in group h (such that ∑ H h=1 h = 1). Thus, the density function describing the distribution of vector x is given by: Bayesian estimation using MCMC methods is based on specifying a latent class model that associates each x i with a latent variable i ∈ {1, …, H}. The latent class variable i is sampled from a categorical distribution, i.e. i ∼ Categorical (H, ), resulting in the likelihood defined as: This leads to the joint posterior from which a Gibbs sampler can be implemented by setting priors for and (see section 2.3.2 for details). The hierarchical structure of the occupancy model allows for incorporation of the latent variable i = h for those species belonging to class h ∈ (1, 2, …, H). By grouping species in classes, the relative frequency of the number of species in each class can be expressed as a proportion of the total number of species estimated to be present at each site (see equation (7)).
In a standard multispecies occupancy model, species-level parameters are drawn from a common distribution characterized by hyperparameters representing the parameter's average value among all the species in a community (and also the scale parameters for normal hyperpriors) (Dorazio & Royle, 2005;Dorazio et al., 2006).
Typically, logit-normal models are used to describe species heterogeneity in occupancy and detection probabilities (Coull & Agresti, 1999;Dorazio & Andrew Royle, 2003;Royle, 2006). In addition to the species heterogeneity described by the logit-normal model, adding the finite mixture component enables occupancy parameters of the ecological process to be defined based on the latent classes to which each species belongs. This allows the composition of a community to be characterized in terms of how common or rare the species are.
Thus, a multiple-species occupancy model can be formulated as follows: Equation 4 (a) denotes the ecological/state process where the latent variable is a categorical random variable relating the i -th species' occupancy (z ij for i = 1, …, S species and j = 1, …, M sites) given by the occupancy probability hi for a specific class h such that Pr ( = h) = h for h = 1, …, H and = 1 , …, H , allowing for the class structure in the community to be estimated by linking each ( species-level parameter to a latent class from the finite mixture component, so that species are grouped in terms of their rarity. The observational process driven by the detection probability Mixture model parameters also suffer from a lack of identifiability due to the invariance of the posterior distribution to permutations of the group labels (Redner & Walker, 1984;Richardson & Green, 1997). To make model parameters identifiable and avoid label-switching issues, component mean values can be ordered, i.e.
,1 < ,2 < … < ,H . By tracking on each MCMC sample draw s, then, the posterior probability of each species being classified into the hth category is given by: where I (s) i = h denotes an indicator variable that takes the value of one if species i belongs to cluster h and zero otherwise. Species can be clustered by assigning them to a class based on these posterior probabilities as follows: Additionally, the relative class frequency at each site ( hj ), expressed as a proportion of the local species richness (i.e. ∑ i z ij ), can be computed as a derived quantity by tracking the hth class-membership and the occupancy status for every species on each MCMC draw: Note that a finite mixture density could also be included for the detection probabilities in the observational model by fitting a multivariate density for the joint distribution of ( , p) by using inverse Wishart priors for the covariance-variance matrix. However, we found identifiability issues in some of our simulations due to labelswitching among clusters. (Similar issues were reported by Sollmann et al. (2021) when using infinite mixtures for modelling the joint distribution of occupancy parameters, especially when variation among clusters was high). Thus, in this work, we have specified finite mixtures for the state process only because our primary goal is to identify species based on their rarity rather than their elusiveness.
Moreover, by adopting a Bayesian inference approach, the proportion of species in each class can be computed as a derived quantity of the occupancy model at each of the sampled sites in the study.
In the Odonata case study, this enables inference about the occupancy pattern of rare species to be limited to only the sites in the sample. This is of particular interest since these sites represent lakes and ponds, which are key components of the hydrological network in the United Kingdom.

| Introducing species-specific effects and sitelevel covariates
The Odonata case study contains information about species-specific traits that could be associated with the species detection. Thus, these species-specific effects can be specified as a linear model for the detection hyperparameters (Eqn. 4 (b)) as follows: Here, the mean detection probability p of each species is a lin- A similar approach by Dunstan et al. (2011), implemented finite mixture models in a frequentist setting to capture heterogeneity in species responses to environmental gradients among different latent classes. However, by adopting a Bayesian inference approach, we can retain the latent variables while accommodating imperfect detection.
Moreover, the hierarchical structure of the occupancy model enables the information among the species in the community to be shared within each class. Therefore, by assuming all species within a particular class (e.g. rare species) are related to one another by being part of the same biological community, the parameters of species with sparse occurrence records can be estimated (Dorazio et al., 2011).
To estimate the parameters of the occupancy mixture model, Dirichlet conjugate priors are specified for (i.e. h,i is the number of species assigned to each class. Vague normal priors, logistic(0,1) or weakly informative zero-centered t-distributed priors with scale parameter of 1.566 and degrees of freedom 7.763 can then be specified for the mean hyperparameters and inverse-gamma (conjugate prior), Uniform(0,5) or Half-Cauchy priors for the variance hyperparameters (Outhwaite et al., 2018). (See Northrup and Gerber (2018) for a detailed discussion on prior specifications).

| Fitting a two-class finite mixture
Note that for this work, we will be addressing a binary classification problem only, since our aim is to distinguish rare from common species, but the method can easily be generalized to multiclass problems. We work under the assumption that the number of classes is fixed and known before conducting the analysis. (The choice of the number of classes is based on the ecological context of the problem only). For a two-class problem, the likelihood in (2) can be simplified to allowing Beta (a 1 , a 2 ) priors to be specified for .
For this work, the sensitivity of the priors was tested by comparing our model results under the different aforementioned prior parametrizations. Our results were consistent when either logistic(0,1) and zero-centered t( = 1.566, = 7.763) distributed priors were specified for the mean hyperparameters. Specifying such priors instead of vague normal priors avoids the need for calibrating the precision parameter of the normal prior, which can often be a problem when a logit-scale transformation is used, as vague normal priors (e.g. Normal [0500]) lead to a high probability density around zero and one on a probability scale. Moreover, Uniform(0,5) and Half-Cauchy priors for the variance hyperparameters showed an overall better mixing and lower autocorrelation of the MCMC chains compared with inverse-gamma priors. Finally, we also specified Dirlichet (10,10) priors for the mixing parameters. Graphical diagnostics for convergence of our analysis are available in the Appendix S1.
For each analysis, we ran a total of 50,000 iterations with a burnin period of 10,000 and a thinning of 10 (for memory optimization purposes only) on three independent Markov chains (approximate run time <30 min). The algorithm's convergence was assessed through conventional graphical diagnostics (i.e. traceplots of the posterior samples showing overall good mixing with low posterior autocorrelation, and Gelman-Rubin between-within chains variance ratio <1.1 (Gelman & Rubin, 1992)). All of our modelling work and data manipulation was implemented in R version 4.0.0 (R Core Team, 2020). Our models were run in R Nimble (de Valpine et al., 2017).

| S IMUL ATION S TUDY
We designed a simulation study to test the performance of our occupancy mixture model in which four distinct species classes were defined to simulate a community made up of a combination of common, rare, elusive and nonelusive species. We tested the ability of our model to differentiate between common and rare species under varying occupancy, detection and mixing probabilities, i.e. we assessed if the proposed model could identify those elusive common species that could be mislabeled as rare due to their low detection probabilities.
To generate the aforementioned community, a total of S = 50 species were simulated in M = 300 sites visited on 4 different occasions.
Species-specific occupancy and detection probabilities were drawn from a multivariate normal distribution as follows: where logit −1 (Ω) = hi , p hi is the species-specific occupancy and detection probabilities with h being the mixing probabilities that determine the proportion of species belonging to each class. These species-specific parameters were drawn from the community logitscaled baseline occupancy and detection probabilities h and p h respectively. Note that the simulated occupancy and detection probabilities ranged between 0.05 and 0.95 across species. This captures a reasonably wide span of different species responses that matches what we observed in the Odonata case study.
Σ h is the covariance-variance matrix for the hth class with diagonal elements 2 h , 2 p h corresponding to variances for the logitscaled community mean occupancy and detection probabilities respectively. The off-diagonal elements of the covariance-variance matrix Σ h were set to zero to remove the abundance-induced detection effect (i.e. when detection probabilities are influenced by the high abundance of widespread species), which is assumed to be captured by specifying the number of different habitats that each species occupies (Eqn. 8). Figure 1 illustrates the general framework used to simulate the three following scenarios (details of each scenario can be found in the Appendix S1):

| Simulation scenario 1: nonoverlapping with constant variance and proportional allocation
First, four well-separated classes were simulated by specifying a reasonable distance between the mean value h of each group and

| Simulation scenario 2: moderate overlapping with variance heterogeneity under constant and varying mixing probabilities
For a second simulation, we allowed for a moderate degree of overlapping between rare/nonelusive species and common/elusive species while specifying different variances for each class (see Appendix S1 for the specification of variance heterogeneity). Furthermore, we explored the model performance under constant and varying mixing probabilities such that the number of species in common and elusive classes was larger than the number of species in the rare and nonelusive classes respectively.

| Simulation scenario 3: strong overlapping with variance and mixing heterogeneity
For the third scenario, a greater degree of overlapping was induced by setting (i) similar detection and occupancy probabilities for each class, (ii) variance heterogeneity and (iii) different proportions of species allocated to each class. The mixing coefficients and the variancecovariance matrix were the same as those described for scenario 2.

| Assessing model performance
Model 4 was fitted to the simulated data for each scenario using the settings described in section 2.3.3. The model performance was assessed by using standard metrics calculated based on the confusion matrix shown in Figure 2. The diagonal elements of the confusion matrix are given by the number of rare and common species that have been correctly predicted as such (true positives (TP) and negatives (TN) respectively). The off-diagonal elements of the confusion matrix contain the false positives (FP =Pr(̂ i = 2| i = 1)) and false negatives (FN =Pr(̂ i = 1| i = 2)) errors. The different metrics that were compared are also presented in Figure 2. Here, the correct classification rate (CCR) describes the overall proportion of species that were classified correctly. Sensitivity (TPR) and specificity (TNR) denote the proportion of rare/common species that have been classified correctly among the rare and common classes respectively. The model's precision or positive predictive value (PPV) represents the proportion of truly rare species among all the species that have been predicted to be rare. Similarly, the negative predictive value (NPV) describes how many species predicted to be common are truly common. Cohen's Kappa statistic ( ) was also calculated as a measure of the model's performance relative to what would be expected by chance. According to Fleiss and Cohen (1973), a value between 0.40 − 0.60 indicates a moderate performance, while values of 0.61 − 0.80 and 0.81 − 1 suggest a substantial and almost perfect performance respectively. Finally, the F-score was calculated as F = 2 × (TPR × PPV) ∕ (TPR + PPV) , and the balanced accuracy (TPR + TNR∕2) was used as metrics to assess the overall model performance under each scenario. The F-score, on the one hand, accounts for both false-positive and false-negative errors and gives equal importance to sensitivity and precision (i.e. focuses primarily on detecting correctly rare species). The balanced accuracy, on the other hand, is useful when comparing classes with an unbalanced number of observations, as it also considers the true negatives and, thus, gives equal importance to predicting correctly both rare and common species.

| Simulation study results
Our simulation study results in Table 1 show the standard metrics of classification performance (section 3.1) for each simulated scenario.
Overall, our results suggest good model performance in identifying F I G U R E 1 Direct acyclic graph illustrating the simulation scheme under varying stochastic parameters indicated by the shaded nodes. The square box represents the model's latent state process for S=50 species and M=300 sites. Square nodes represent simulated data while circle nodes are the stochastic parameters varying for simulation F I G U R E 2 Confusion matrix and standard classification metrics for a two-class problem. TPR denotes the sensitivity or true positive rate, TNR denotes the specificity or true negative rate, PPV and NPV denote the positive and negative predictive values, respectively, and CCR denotes the correct classification rate computed over the total number of species S those rare and widespread species with an ≈ 80\% of accuracy across all scenarios. For scenario 1, our model correctly classified all rare and common species with respect to the true categories. In scenario 2, the model classification performance was affected by groups with an unbalanced number of species. For instance, with constant mixing probabilities for each class, the CCR was 0.96 and the TNR and PPV indicated that all common species were correctly identified as such, and all the predicted rare species were truly rare.
Moreover, the statistic (>0.80) suggested an almost perfect performance relative to what can be expected by chance, and the balanced accuracy and F-score values above 0.90 indicated a very good predictive performance. However, when the proportion of common species was greater than the proportion of rare species, the TNR and PPV (0.83 and 0.70 respectively) suggested that 83% of species were classified as common out of the total number of common species because some common species have been predicted as rare (i.e. only 70% of predicted rare species were truly rare). In this situation, the F-score shows a lower value compared to CCR and balanced accuracy because it does not consider the true negatives, unlike the balanced accuracy, which suggests a very good classification performance when either an equal or unequal proportion of species are allocated to each class. On the third scenario, the accuracy decreased to ≈ 80%. The different classification performance metrics were approximately 10%-20% lower than the metrics in scenario 2.
We also found that having an unequal number of species for each class yields a much lower accuracy. Particularly, when the number of common species is greater than the number of rare species, both the PPV and TNR suggest an overestimation of the true number of rare species. On the other hand, when the number of elusive species was higher than the number of nonelusive species, the TPR decreased, suggesting that several rare species are being classified as common

| Odonata case study results
We fitted the proposed mixture occupancy model (Equations 1 and 8) to quantify the proportion of rare species of Odonata across freshwaters in the United Kingdom. The proportion of rare species at each site was calculated according to Eqn. 7. Figure 3 shows the predicted classes for each of the Odonata species, with approximately half of the species being categorized as rare. Note that the credible intervals show that the uncertainty for detection probabilities is larger for rare species than for common species. However, at high levels of occupancy, the uncertainty of the estimated occupancy for widespread species becomes larger than for rare species. This pattern could be related to the distribution of occupancy being right-skewed (a small number of common species show relatively high-occupancy probabilities >0.75), so that uncertainty around the estimates of species with high-occupancy probability will be larger than estimates for species with low occupancy. In addition, when the occupancy is low, we have less information from which the probabilities of detection are estimated resulting in greater uncertainty around these estimates.
The proportion of rare species across the different sites is shown in Figure 4. Overall, the proportion of rare species and its estimated standard deviation are low and homogeneous across space. There are, however, some areas in the north and south west where the proportion of rare species is greater than the proportion in central areas, which are dominated almost exclusively by widespread species. These proportions were calculated with a reasonably small error (SD <0.2 for most sites), though there are some sites where uncertainty is larger possibly due to the very low number of Odonata occurrence records at those sites (see Figure S1).

| D ISCUSS I ON AND CON CLUS I ON
In this work, we have presented a new approach to characterize the composition of a community in terms of how rare or common species are. Our model differs from previous methods by considering imperfect detection. Moreover, our modelling framework enables using standard designs of occupancy models for which information about the species' distribution ranges, population densities or habitat specificity are not required to estimate species distributions. In fact, any species-specific data of this nature could be incorporated in either or both latent state and observational processes of the model as described in section 2.3.2.
However, any extension to the occupancy model could potentially be developed within our mixture modelling approach. For example, our model could be integrated with an explicit spatiotemporal occupancy model (see Rushing et al. (2019)) to estimate the temporal changes in occupancy that a species of a specific class experiences.
This could be potentially useful for monitoring the distributions and temporal dynamics of invasive species at different stages after their introductions and to identify the tipping point when a focal species experiences a major increase or reduction in its distribution given by the change in its latent class.
The simulation analysis of our mixture occupancy model showed an overall good classification performance, specifically when the proportion of individual species is similar between classes. However, when the proportion of common species was set to be greater than the proportion of rare species, a larger proportion of common species were classified as rare. Under this unbalanced scenario, the posterior mass density from which the occupancy community parameters are drawn will be greater for common than for rare species. Thus, uncertainty around the occupancy community mean for rare species will be larger and pulled towards the density mass of common species, resulting in some common species being classified as rare, yet producing precise predictions for the common species (i.e. the sensitivity and the proportion of predicted common species that are truly common will be high). Note that for this simulation analysis, the large difference in the proportions of species belonging to each class could very well portray a real-life situation where biological communities are composed primarily of widespread species.
Thus, to avoid the overestimation or underestimation of rare species caused by unbalanced classes, informative Dirichlet priors (or beta priors for a two-class classification problem) could be specified to reflect our prior judgement about the proportion of species of each class determining the community structure.
It is important to notice that using an informative prior causes the classification of species rarity to change and, hence, introduces a risk of underestimating the number of truly rare species if the prior weight is greater for common species, or overestimating the rare species if the prior assigns more weight to this class. However, as mentioned by Leroy et al. (2012), it is preferable to conduct an analysis to characterize species rarity rather than ignore it and thus potentially overlook hot spots relevant for conservation. Therefore, the choice of priors depends on (1) the previous knowledge about the target species distribution and (2) the conservation targets. For instance, if the conservation study focuses on rare species, more weight could be assigned to the prior of a species being classified as rare to ensure that most of the species will be correctly classified as rare.
Our approach enables the class membership for multiple species based on a fixed number of classes to be estimated. If the number of classes is unknown, a Bayesian nonparametric mixture model in- volving Dirichlet processes could be specified (Hjort et al., 2010).
For instance, Johnson and Sinclair (2017) implemented a reversible jump MCMC algorithm (RJMCMC; Richardson and Green (1997) In the case study analysed in this work, the proportion of rare Odonata species is generally low (below 20%) and homogeneous across all of the region. There are, however, some areas in the north of Scotland and in the south of England where the proportion of rare species is above 50%. This could be explained by the occurrences of species at very local scales such as Coenagrion hastulatum and Somatochlora arctica, which are confined to a few particular lakes in Scotland and south west Ireland (source: https://briti sh-drago nflies.org.uk). Consequently, the distribution patterns for some of these rare species might not be evident on a national scale and could be limited to specific habitats where species are exposed to different environmental conditions and extinction processes (Hartley & Kunin, 2003). Hence, because of the occurrence-based approach implemented here, where the proportion of rare species on a national scale is derived from the relative mean occupancy probabilities in each class, the estimation of rarity depends on the study spatial scale. Previous studies by Hartley and Kunin (2003) and Leroy et al. (2013) have shown how species rarity can be a scale-dependent process and have proposed different multiscale metrics to quantify the species rarity. However, producing such metrics under imperfect detection is a lesser studied subject for small invertebrates (Leroy et al., 2012(Leroy et al., , 2013) and a very interesting area for future research.
For example, our modelling framework could be integrated with multiscale occupancy modelling designs and sampling schemes like the one proposed by Nichols et al. (2008) to estimate the proportion of rare species at varying spatial scales under imperfect detection.
Other occurrence-based methods that assess species rarity (such as the Leroy et al. (2012) rarity index) rely greatly on the sampling methods, as uneven sampling effort might induce an artificial rarity for those species recorded in poorly sampled sites. Accounting for sampling effort and detection probabilities in the observation process is of major importance, specifically when working with large citizenscience data sets where observations are collected without a standardized field sampling protocol for a frequently arbitrary selection of sites. This variation in observation effort causes detection probabilities to vary over time and space (K´ery et al., 2010). The two-component structure in our model allows us to incorporate terms that correct for uneven sampling effort. In our models, we only used the number of visits at each site as a proxy for the sampling efficiency. Thus, the correction for uneven sampling effort occurs automatically through the number of sampling days representing the number of Bernoulli trials in the distribution of p rather than as a covariate in the model.
Alternatively, a common approach to the analysis of citizen-science data is to define the logit-scale detection probabilities as a function of (1) the day when each site was visited and (2) the daily species lists to account for heterogeneity in detection probabilities caused by seasonal variation in the flying periods of species and unequal observation effort (K´ery et al., 2010;van Strien et al., 2010van Strien et al., , 2013. The approach presented in this work provides a new method to understand the species rarity assemblage in a community when species are detected imperfectly. It can be potentially applied to different situations such as multiscale studies and spatiotemporal modelling. We have applied our model on a national scale to estimate the proportion of rare species of Odonata, a taxonomic group (along with other small invertebrates) that has a significant under-representation in studies accounting for imperfect detection (Devarajan et al., 2020;Kellner & Swihart, 2014). We believe this work provides a substantial improvement to how rare species are identified and to the understanding of how species rarity assemblages can be quantified.

ACK N OWLED G EM ENTS
The authors thank Dr. Stephen J. Brooks, Dr. Tom August and the Hydroscape team for their insights and comments. The authors thank NERC funding (grant NE/N005740/1) for making this project possible. JB was supported by CONACyT scholarship (494334).

CO N FLI C T O F I NTE R E S T
We warrant that this manuscript is the original work of the authors listed and has not previously been published. All the authors listed made substantial contributions to the manuscript and qualify for authorship, and no authors have been omitted. We warrant that none of the authors has any conflict of interest with regard to this manuscript.

PEER R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/ddi.13495.