Estimating species misclassification with occupancy dynamics and encounter rates: A semi‐supervised, individual‐level approach

Large‐scale, long‐term biodiversity monitoring is essential to conservation, land management and identifying threats to biodiversity. However, multispecies surveys are prone to various types of observation error, including false‐positive/false‐negative detection and misclassification, where a species is thought to have been encountered but not correctly identified. Previous methods assume an imperfect classifier produces species‐level classifications, but in practice, particularly with human observers, we may end up with extraspecific classifications including ‘unknown’, morphospecies designations and taxonomic identifications coarser than species. Disregarding these types of species misclassification in biodiversity monitoring datasets can bias estimates of ecologically important quantities such as demographic rates, occurrence and species richness. Here we present a joint classification‐occupancy model that accounts for species non‐detection and misclassification. Our framework accommodates extinction and colonization dynamics, allows for additional uncertain ‘morphospecies’ designations and makes use of individual specimens with known species identities in a semi‐supervised setting. We compare the performance of our model to a classification‐only model that discards information about occupancy and encounter rate. We illustrate our model with an empirical case study of the carabid beetle (Carabidae) community at the National Ecological Observatory Network Niwot Ridge Mountain Research Station, near Boulder, CO, USA. We also use simulations to evaluate model performance through validation metrics where varying fractions of the data are confirmed. The model supported imperfect classifier accuracy and favoured certain true species classifications strongly for some morphospecies. The model outperformed (e.g. precision) the reduced model that discarded occupancy information, and these differences were most pronounced for abundant species. Spatial and temporal dynamics from modelled occupancy and encounter rates may inform species misclassification probability, but this idea has not yet been tested. Our statistical framework explores this opportunity, and can be applied to datasets with imperfect species detection and classification, limited verification data and non‐species classifications.


| INTRODUC TI ON
Large-scale, long-term biodiversity monitoring is essential to conservation and land management and identifying threats to biodiversity. Such comprehensive datasets increasingly include multispecies surveys that capture information-rich co-occurrence data, enabling community-level analyses (Iknayan et al., 2014;Ovaskainen et al., 2017). However, multispecies surveys are prone to various types of errors, including false absences where a species is present but not detected (Dorazio & Royle, 2005), and misidentification, where a species is encountered but its species identity is not correctly recorded (Miller et al., 2011).
Certain classes of occupancy models account for observation error in biodiversity surveys that seek to understand species distributions, track population changes and describe mechanisms underlying population and community dynamics (MacKenzie et al., 2002).
Latent presence/absence states are modelled explicitly, with an observation model that accounts for the details of the detection process, including the potential for false negatives (non-detections at occupied sites) and false positives (detections at unoccupied sites; Chambert et al., 2015;Miller et al., 2012;Royle & Link, 2006;Wright et al., 2020). Disregarding false positives in biodiversity monitoring data can bias estimates of ecologically important quantities such as demographic rates, occurrence and species richness (Chambert et al., 2015(Chambert et al., , 2018McClintock et al., 2010).
Multi-species surveys are also subject to errors in species identifications by imperfect classifiers. Imperfect classifiers include citizen scientists (e.g. North American Breeding Bird Survey ;Sauer et al., 2017), technicians trained in local taxonomy (e.g. invertebrate trapping by NEON; Hoekman et al., 2017), automated methods (e.g. bat acoustic recording software; Wright et al., 2020) or convolutional neural networks used with camera trap data; Tabak et al., 2019). Previous methods assume an imperfect classifier produces species-level classifications, but in practice, particularly with human observers, we may end up with extraspecific classifications including 'unknown', morphospecies designations (i.e. individuals that cannot be taxonomically identified so are grouped by morphology) and taxonomic identifications coarser than species.
If species observations are prone to misclassification, then samples with verified species identities might be used to estimate misclassification probabilities. We refer to this situation as 'semi-supervised': true species identities are known for some but not all individuals. However, leveraging these partially observed, individual-level validation data for the rest of the dataset present a methodological challenge. Previous multi-species occupancy models that accommodate misclassification have used site-level validation data where the occupancy state of a species is known only at a site but not at an individual sample level (Chambert et al., 2018) or multinomial models with site-level covariates that aggregate individual samples (Wright et al., 2020). Using an individual, sample-level approach can help in resolving non-species (e.g. morphospecies) identities to the true species identity.
Misclassified species identities can be dealt with using one of two contrasting approaches. A simple two-step approach in which (a) a classifier is used to assign species identities to each individual (creating one complete synthetic dataset from classifier output, for which species identities are treated as known or are verified using an unambiguous classification method), then (b) the synthetic dataset is analysed using a downstream model (e.g. an occupancy model). This two-stage approach does not use any information about occupancy or encounter rates in the first stage. An alternative approach is to simultaneously model the classification process and the ecological process in a single joint model. A joint model directly uses classifier output as data, relating the observation process to underlying, imperfectly observed, ecological states in one step. Such an approach can simultaneously account for uncertainty in species identities, and use information about occupancy and encounter rates to inform species identity estimates (Wright et al., 2020). However, there remains the practical question of how much value is added by a joint model vs. a two-stage approach. A priori, we expect that a joint model should produce better true species identities simultaneously by directly modelling the link between ecological states and the observation process, but this has not yet been tested.
Here we present an individual-level, semi-supervised, dynamic occupancy model that accounts for species non-detection and misclassification. Our Bayesian approach extends the classificationoccupancy model of Wright et al. (2020) to (a) accommodate extinction and colonization dynamics, (b) allow for additional uncertain morphospecies designations in the imperfect species classifications and (c) make use of labelled samples with known species identities in a semi-supervised setting. Furthermore, we compare the classification performance (i.e. accuracy and precision of posterior draws) of a joint classification-occupancy model to a reduced classification-only model that discards information about occupancy and encounter rate on a withheld test set. We demonstrate our model using simulations and with an empirical case study of the tested. Our statistical framework explores this opportunity, and can be applied to datasets with imperfect species detection and classification, limited verification data and non-species classifications.

| Modelling occupancy dynamics and misclassification
Consider data collected at sites i = 1, … , N, according to a robust design (Hoekman et al., 2017) where each site is visited j = 1, … , J times within primary periods t = 1, … , T, where the occupancy states are assumed to be static within primary periods. We are interested in occupancy states (true presence or absence) and encounter rates (observed frequency) for species k = 1, … , K.

| State model
We assume that the objective of classification is to use the resulting data in an ecological model describing species occurrence over space and time. Let z i,k,t be the binary occurrence state for species k = 1, … , K at site i and during time t. Sites are either occupied (z i,k,t = 1) or not z i,k,t = 0 . We assume that the occupancy states arise as Bernoulli random variables: The probability of occurrence in the initial primary period is i,k,1 .
Subsequent occupancy dynamics depend on the probability of colonization i,k,t and persistence i,k,t , such that for t > 1:

| Observation model: Encounters
On any particular sampling occasion j at site i in period t, we encounter L i,j,k,t individuals with encounter rate i,j,k,t . We assume that the number of encounters is a Poisson random variable: k,t i,j,k,t . So L i,j,k,t = 0 indicates non-detection, due to either the species not occupying that site (i.e. z i,k,t = 0) or the species may occupy the site but was not encountered (i.e. i,j,k,t = 0). In a setting with misclassification, the number of encountered individuals L i,j,k,t is not observed directly because of uncertainty in the true species identities of encountered individuals. We do however observe the total number of individuals across all species encountered on any particular occasion: L i,j,.,t = ∑ K k=1 L i,j,k,t . The properties of sums of Poisson random variables allow us to model these observed totals as:

| Observation model: Classification
In addition to observing the total number of encountered individuals on an occasion, L i,j,.,t , we assume that we also obtain imperfect species classifications for each encountered individual. In cases where individuals have been encountered (L i,j,.,t > 0), we obtain imperfect classifications of individuals l = 1, … , L i,j,.,t and model these as arising from a categorical distribution with a species-specific probability vector: where y i,j,l,t is the imperfect classification, and k [i,j,l,t] is a probability vector associated with the true species identity of individual l, which we denote k i, j, l, t . Element k ′ in the vector k [i,j,l,t] represents the probability that an individual is classified into category k ′ , conditional on the true species identity k i, j, l, t , such that k [i,j,l,t],k � = Pr y i,j,l,t = k � | k i, j, l, t .
If species are always misclassified as alternative species (i.e. not morphospecies nor other groups), then k will be a vector of length K (Wright et al., 2020), whereas if there are extraspecific classes (e.g. morphospecies), k may have more than K elements.
True species identities are modelled as: The occupancy state model is essentially an informative prior on classification probabilities so that where there is a high probability of occurrence of a species, then samples are more likely to be that species compared to when occupancy probability is low.
If ground-truthed species identity data are available for some individuals, then k i, j, l, t is partly observed and this model can be used in a semi-supervised setting. In an unsupervised setting, this individuallevel formulation is a disaggregated version of the single-season multinomial model of (Wright et al., 2020; Appendix S1). An aggregated version would pool the counts so no identifying information is attributed to any single sample. The disaggregated, or sample-level, approach facilitates the treatment of non-species (e.g. morphospecies) identities and allows for covariates to be included in the model at the individual/observation level, which could improve the estimation of classification probabilities and possibly ecological parameters. For example, sample confidence (i.e. the observer's confidence in species classification for an individual) is a sample-level covariate (like the observation-level covariate, sample quality, in Augustine et al. (2020)) that might be correlated with how samples might be prioritized for verification. This is the case for how samples are prioritized for verification in the NEON dataset, but sample confidence is not available.

| Incorporating morphospecies designations
In some settings, the imperfect classifier might assign more classes than there are unique species so that the vector k has more than K j,l,t ∼ Categorical k[i,j,l,t] , elements. For example, if an imperfect classifier is unable to identify a set of species, they may classify those individuals as 'unknown' or as a unique morphospecies associated with a given sampling occasion. Thus, it is possible for individuals to be classified into K ≥ K classes, where K is the sum of the number of species and the total number of morphospecies designations. In such cases, the matrix Θ = � 1 , … , � K can be rectangular, with the first K columns corresponding to the classification probabilities for species 1, … , K, and the remaining columns corresponding to classification probabilities for non-species (e.g. morphospecies) classes:

| Application to NEON carabid data
We fit our model to the carabid pitfall trap sampling data collected by NEON at NIWO during 2015-2019 (National Ecological Observatory Network, 2021). Carabids are a ubiquitous and speciose family of ground-dwelling invertebrates that are commonly collected by passive sampling methods, like pitfall traps, as described in Hoekman et al. (2017). A well-studied sentinel group, carabids make an excellent study system for assessing community occupancy rates and classification accuracy. Collecting and identifying carabids is resource-intensive, but NEON lowers this barrier to entry by providing a public carabid dataset with three levels of classification (parataxonomist, expert taxonomist, then DNA barcoding). Although is considered imperfect. Identification by an expert taxonomist is treated as confirmation data but is limited due to budget constraints.
We confirmed the accuracy of the expert taxonomist classifications in finding that all individuals sent for DNA barcoding by NEON match the expert taxonomist's identification for the samples we used. In F I G U R E 1 Classification scenarios in NEON carabid data. Each column is an imperfect classifier label. Each species is either present or absent, and morphospecies do not have an occupancy state. In some cases, the imperfect classifier (parataxonomist) matches the error-free classifier (expert taxonomist; black arrow), in other cases the imperfect classifier was wrong (red arrow), while in other cases still, the error-free classification is unknown due to lack of validation data. For example, the Morphospecies 1 individual with no error-free classification must belong to a different column, but this species identity is unknown in the raw data. Ground beetle by Nikita Kozin, from the noun project the few cases where the expert taxonomist could not identify a specimen to species level, we use their genus-level classification for the validation dataset.
Our dataset contains 5,865 individual specimens, 1,910 of which were identified by an expert taxonomist, and 62 species classified by the parataxonomist, 23 of which are morphospecies. Morphospecies identifications are unique to each NEON sampling location and year. We fit our model using all individuals and used no environmental covariates. Having both parataxonomist and expert taxonomist classifications complicates the use of NEON's carabid pitfall trap data ( Figure 1). Only one study to date has been published using the NEON carabid pitfall trap data (Egli et al., 2020), but Egli et al. (2020) analyse only the subset of individuals that have expert taxonomist classifications. The classification-only model to which we compare our joint model can be fit only to the thinned dataset of verified samples, resulting in a loss of information.

| Model specification
We used informative priors for the species classification probability vectors 1 , … , K that placed higher probability density on the correct species classification. In the case of NEON beetle data, this is reasonable given the training that parataxonomists receive in beetle identification. Because all elements of each k vector need to sum to one, and each element is bounded between 0 and 1, we used a Dirichlet prior: k ∼ Dirichlet k . We chose the Dirichlet concentration values k by comparing draws from the Dirichlet prior distribution to our prior intuition about imperfect classifier accuracy, making an assumption that there was a 65% chance that a species is correctly classified and some small probability that it is assigned to another specific class. Additionally, with smaller differences between the values of the Dirichlet prior, the model appeared not identifiable.
This prior is informative particularly for components of Θ that have few observations (e.g. fewer than 80 observations on the diagonal, or fewer than 2 off-diagonal; see Table S2.1) such that the prior may be more informative than the data.
We used multivariate normal priors at the species and site level for initial occupancy, persistence, colonization and encounter rates. Correlated priors allow information sharing among parameters ( Figure S3.1). The motivation for this stemmed from a prior expectation that these parameters could be related. For example, species with a higher encounter rate might be more likely to occur initially, persist or colonize new sites. Similar arguments could be made about relationships among site-level parameters. Each species is associated with a vector k of length 4, where ϵ k,1 , ϵ k,2 , ϵ k,3 and ϵ k,4 are species-specific adjustments on each of the four ecological parameters. The multivariate normal priors have a mean of zero and an unknown covariance matrix: k ∼ Normal Similarly, site-specific adjustments i were drawn from a different multivariate normal prior. These adjustments were added together on a transformed scale to compute initial occupancy, persistence, colonization and encounter rates, for example, logit i,k,1 = ϵ i,1 + ϵ k,1 . A full model specification for the case study is available in Appendix S4 (Plummer et al., 2003). To

| Simulation study
We conducted a simulation study to study the general behaviour of the model. We simulated 15,000 datasets with two species (K = 2) and three imperfect classifications (K = 3) while varying the fraction of verified samples. We expect model performance to decline as this fraction decreases.
Each dataset represents a single season of sampling with three surveys at each of 30 sites. The idea behind the three imperfect classifications is that two are the true species and one is a morphospecies (e.g. Species A, Species B and Morphospecies 1 in Figure 1).

| Case study: Occupancy
Occupancy estimates varied across species and through time ( Figure   S2.2). No occupancy model was fit for the reduced, two-stage model, so there are not occupancy estimates to which to compare the full model's results. The joint occupancy model was designed to allow correlation between parameters across sites and species.
Occupancy, growth and turnover rates also varied through time. Sites with high encounter rates tended to have low initial occupancy and colonization probabilities and high persistence probabilities ( Figure   S3.1). Furthermore, sites with high colonization rates tended to have high initial occupancy probabilities and low persistence probabilities. At the species level, we saw positive correlations among many of the model components, but in particular, species' encounter rate was positively correlated with species' initial occupancy, persistence and colonization rates ( Figure S3.1). Species varied in their detection success by the imperfect classifier, from ones that were common and consistently identified correctly (e.g. Calathus advena) to ones that were not identified at all (e.g. Dicheirotrichus mannerheimii) but were caught by the expert taxonomist.

| Case study: Classification
The model yielded high probabilities of classification along the diagonal of the Θ confusion matrix where the true and imperfect identifications match (Figure 2). The parataxonomists, or imperfect classifiers in the NEON dataset, were trained in beetle taxonomy, so we built the model to favour the imperfect classifier by giving more weight in the Θ prior to diagonal values, making morphospecies classifications less probable. However, some species were just as or more likely to be identified as a morphospecies by the imperfect classifier than as the correct species. For example, the parataxonomist was more likely to classify Pterostichus (Hypherpes) sp.
as morphospecies D13.2016.MorphBT than as the true species (see white arrows in Figure 2). However, no species had more than 3% probability ( Figure 3).
These differences are found most notably for the abundant species.
The full model yielded higher correct classification probabilities for the abundant species. Furthermore, the reduced model has larger 95% (CI) widths compared to the full model for many Θ indices ( Figure 4). Thus, we find that a joint classification-occupancy model outperforms a two-stage model (classification, then occupancy) due to both the improvements in the modelling framework and the joint model's expanded access to the full dataset. Since the models being compared are fit to different sets of empirical data, we cannot quantify bias, which is important in model comparison.
We evaluated the performance of the full model's species clas-

| Simulation study
We found that model performance improved when fit to datasets with larger fractions of verified samples; however, the gains in model performance were modest. We illustrate model performance through model validation ( Figure S5.1), coverage ( Figure S5.3), difference between estimated parameter value and true value ( Figure   S5.2), and width of 95% credible intervals ( Figure S5.4). In all of these metrics, we see that the gains in model performance are small or non-existent as the fraction of samples verified increases. We also found that the joint classification-occupancy model outperformed the classification-only model. Specifically, the joint model had higher precision ( Figure 5). These results are specific to the dataset scenario specified for these simulations (e.g. 2 species, 1 morphospecies) and if parameter values were used for a dataset with higher species richness, the results would likely change.

| DISCUSS ION
We developed a statistical approach that can be applied to datasets with imperfect observations that enhances multispecies classification by leveraging occupancy dynamics. This approach builds on recent work that integrates classification into occupancy models (Devarajan et al., 2020, and references therein) by evaluating the advantage of a joint classification-occupancy model, which allows imperfect classification categories to outnumber species; leverages individual-level confirmation data in a semi-supervised setting; and allows for the option to include covariates at the level of the variates or observer error, the latter of which we focus on in the case study. Accounting for false identifications is important to reduce bias in occupancy dynamics estimated from multispecies biodiversity monitoring datasets (Chambert et al., 2015;McClintock et al., 2010;Miller et al., 2011;Miller et al., 2015) or in multi-state capture-recapture models (Pradel, 2005). Alternative models that account for false positives may consider data from only the focal species (Chambert et al., 2015) or from binary observations (Chambert et al., 2017). Like Wright et al. (2020), we use available counts from an imperfect classifier (Figure 1). However, we use all species detected, no matter how rare. By using a rectangular classification matrix that allows for propagation of taxonomic uncertainty for multispecies datasets where imperfect classifications outnumber species (e.g. unknown, morphospecies, to the family level; Figure 2), we remove a limitation that previous occupancy modelling methods have used (Chambert et al., 2018;Wright et al., 2020). For example, although the model priors favour imperfect classifier accuracy, the F I G U R E 3 Comparison of density distribution between prior, full model posterior (classification with occupancy) and reduced model posterior (classification alone) for select confusion matrix indices for species in the Pterostichus genus. Each panel displays the probability distribution that the first species (true species identity) is imperfectly labelled as the second species (imperfect classification). For non-abundant species, k probability distributions look like the second and third rows. In the second row, the three models' probability distributions overlap on the off-diagonal (where P. adrictus is imperfectly classified as other species) and overlap but differ from the prior along the diagonal (where P. adrictus is correctly classified as P. adrictus). In contrast, the top and bottom rows reflect probability distributions for abundant species. In the bottom row, the posterior distributions overlap and are narrower and smaller than the prior on offdiagonal values (where P. restrictus is imperfectly classified as other species). Along the diagonal, we see a difference in posterior probability distribution between the reduced and full models (where P. restrictus is correctly classified as P. restrictus), visualizing how the full and reduced models perform differently model found likely species matches for a couple of morphospecies that were abundant in the data (e.g. D13.2015.MorphO, D13.2016. MorphBT; Figure 2).
Our model is semi-supervised and makes use of data at the individual level. Whereas alternative models use data pooled at the site or visit level (Chambert et al., 2018;Wright et al., 2020), our model leverages the rich, individual-level information to reveal which species are commonly mistaken by the imperfect classifier and how often, allowing species counts to inform classification (Chambert et al., 2017). Verified individuals can be used as partially observed occupancy data in our semi-supervised model. Our model could be expanded to include observation-level covariates (e.g. sample confidence), which is an extension that our individual-level, occupancydetection model can achieve that a count-detection model cannot.
The case study in Wright et al. (2020) 4 and 5). Accuracy of classification estimates was not assessed for either model because true misclassification probabilities are not known for the case study data. While there was large agreement between the confusion matrix Θ of both models, the full model had higher probability estimates for abundant species (Figure S2.1).

F I G U R E 4
Comparison of precision between posterior 95% credible intervals (CI) of joint classificationoccupancy (full) model and classificationonly (reduced) model. The full model is more precise for points below the line, indicating that the reduced model has a larger CI than the full model for that species. The clustering pattern is driven by whether the value is close to zero or not. Mostly along the diagonal, larger values predict that the imperfect and verified classifications match, which yields larger CI widths, whereas most off-diagonal values are close to 0 and have strong confidence, or smaller 95% CI width (Figures 2 and 3) TA B L E 1 Validation metric macro-averages for joint classification-occupancy model. Accuracy is the number of species classifications correct over the total number of classifications; precision is the macro-average of for each imperfect classification category, the number correctly matched over the total labelled as that category; recall is the macro-average of for each species, the number correctly matched over total samples of that species; the F1 score combines precision and recall into a single number; and the holdout log-likelihood is the log-likelihood for the holdout data subset. See associated code for formulas used Future work could more explicitly address false positives by informing priors with species commonly misidentified by the imperfect classifier or by inducing sparsity in the matrix through setting certain priors to 0.
Large-scale, long-term biodiversity surveys are critical to inform land management and conservation policy (Hughes et al., 2017) and require accurate species classifications to achieve conservation objectives. This probabilistic approach can model species occupancy while accounting for imperfect detection and classification. Considering misclassification is even more important when making inferences about temporal transition (i.e. extinction, colonization) than for occupancy because misclassification in either of the focal time periods (t or t + 1) can produce bias. Innovations in occupancy models, in general, are rapidly being made to consider an expanding variety of study systems and experimental designs (Bailey et al., 2014) and most of these approaches rely on observation-level, ground-truthed verification data for model training and validation. The proposed model can be extended to enhance occupancy inferences made from citizen science surveys (Sauer et al., 2017); long-term, multi-PI surveys where methodologies may vary through time or across sites; automated imperfect classifiers (i.e. machine learning algorithm) applied to large volumes of open data (e.g. satellite or airborne remote sensing, camera traps); or research scenarios where data provenance is limited, which makes propagating classification uncertainty important.
Future work could explore through simulation ways that model parameters affect behaviour (e.g. effects of different weights on the informed Dirichlet prior, imperfect classifier accuracy or sample prioritization for verification) and use observation-level covariates to enhance estimation of classification probabilities and ecological parameters.

ACK N OWLED G EM ENTS
The National Ecological Observatory Network is a program sponsored by the National Science Foundation and operated under cooperative agreement by Battelle. This material is based in part upon work supported by the National Science Foundation through the NEON Program. We thank G. Vagle for taking part in the conception of this project and J. Coulombe for her graphical design assistance.
The work was supported by the CU Boulder Grand Challenge investment in Earth Lab. AIS was supported as a GRA at Earth Lab for work on this project. Any use of trade, product or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

CO N FLI C T O F I NTE R E S T
The authors have no conflict of interest to declare.

PEER R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/2041-210X.13858.

DATA AVA I L A B I L I T Y S TAT E M E N T
Carabid data are publicly accessible through the NEON data portal at https://data.neons cience.org/data-produ cts/DP1.10022.001 (National Ecological Observatory Network, 2021). Code for data cleaning and analysis is archived on Zenodo at https://doi. org/10.5281/zenodo.6395234 (Spiers et al., 2022) and also available at https://github.com/annas piers/ NEON-NIWO-misclass.