A continuous‐score occupancy model that incorporates uncertain machine learning output from autonomous biodiversity surveys

Ecologists often study biodiversity by evaluating species occupancy and the relationship between occupancy and other covariates. Occupancy models are now widely used to account for false absences in field surveys and to reduce bias in estimates of covariate relationships. Existing occupancy models take as inputs binary detection/non‐detection observations of species at each visit to each site. However, autonomous sensing devices and machine learning models are increasingly used to survey biodiversity, generating a new type of observation record (i.e. continuous‐score data) that reflects the model's confidence a species is present in each autonomously sensed file, instead of binary detection/non‐detection data. These data are not directly compatible with traditional binary occupancy modelling methods. Here, we develop a new occupancy model that models continuous scores on a visit level as a Gaussian mixture, combining a distribution of scores for files that do contain the species of interest and a distribution of scores for files that do not. The model takes as input continuous scores for each autonomously sensed and classified file, along with an optional small number of binary, manually verified detection and non‐detection annotations. We present a simulation study that shows that over a range of empirically realistic parameters, our model outperforms traditional occupancy models that are based on binary annotation alone. We also apply this new model to an empirical case study using data generated from five machine learning classifiers applied to autonomous acoustic recordings gathered in the eastern United States. Because our occupancy model generalizes allowable input data beyond binary observations, it is particularly well‐suited to the increasing volume of machine learning classified data in ecology and conservation.


| INTRODUC TI ON
Ecologists and conservation biologists are increasingly in need of tools to efficiently survey biodiversity at large scales. Scientists often assess occupancy, the proportion of sites at which a species occurs, to identify species at risk or evaluate conservation efforts. Occupancy itself can suggest the degree to which a population persists or is threatened across its range MacKenzie, 2005) and is used in studies of metapopulation dynamics (e.g. Chandler et al., 2015;Hanski, 1994;MacKenzie et al., 2003).
Occupancy is also used to predict how species respond to covariates such as habitat characteristics, climate change, disturbances or restoration efforts (e.g. Cavada et al., 2019;Chambert, Kendall, et al., 2015;Johnson, 2007). Accurately estimating species occupancy is thus of critical importance to advancing ecological understanding and conserving species.
To estimate species occupancy, surveys are often conducted to attempt to detect a species of interest at particular sites. However, detection/non-detection data do not themselves provide knowledge of site occupancy, which cannot generally be measured directly in the field because observers detect species imperfectly (MacKenzie et al., 2002). Most significantly, field surveys usually include false absences, where a species is present at a site but undetected (Martin et al., 2005). Field surveys may also produce false presences, in which a species does not occupy a site but is mistakenly identified as present, for example, due to misidentifications (Iknayan et al., 2014).
Imperfect detection can cause bias and reduce power in studies measuring species occurrence or the effects of ecological covariates on species occurrence, especially where covariates also influence detection probability Gu & Swihart, 2004;Mackenzie, 2006;Martin et al., 2005;Tyre et al., 2003).
Over the past two decades, occupancy modelling has gained popularity among ecologists as a method of accounting for measurement errors in surveys and reducing bias in estimation of covariates' effects on occupancy (MacKenzie et al., 2018). Single-species occupancy models traditionally leverage multiple binary detection/ non-detection surveys at each sampled site to partition the probability that a species is present from the probability that a species is detected given its presence. More recently, models have been de-veloped to account for the rate of false positives as well (Chambert, Miller, & Nichols, 2015;MacKenzie et al., 2018;Miller et al., 2011Miller et al., , 2013Royle & Link, 2006).
While traditional occupancy models were developed for use with binary observational data generated from human detection/ non-detection surveys, researchers are increasingly collecting similar survey data with autonomous sensors. These sensors, including automated acoustic recorders and camera traps, have several advantages for biodiversity surveys. Autonomous sensors may enable scientists to collect data over larger spatial scales (Darras et al., 2019;Shonfield & Bayne, 2017), increase detection of nocturnal, rare or secretive species (e.g. Bobay et al., 2018;Wrege et al., 2017), more easily survey remote environments (Buxton & Jones, 2012;Gray et al., 2019), provide a verifiable and long-lasting record of species (Darras et al., 2019;Newson et al., 2017), and generate many repeat 'surveys' for potential use in occupancy modelling (Abrahams & Geary, 2020;Chambert et al., 2018).
Given the large spatial and temporal scales at which these sensors can collect images and audio, much effort has been put into creating automated algorithms capable of analysing large volumes of sensor data, such as machine learning classifiers and signal processing algorithms (Priyadarshani et al., 2018;Tabak et al., 2019).
Deep learning algorithms such as convolutional neural networks are becoming increasingly popular for processing these data, and in recent years have been shown to outcompete other algorithms for the analysis of complex data (e.g. avian soundscape data, Kahl et al., 2019). These and other automated recognizer algorithms (e.g. signal processing algorithms, Lapp et al., 2021) produce continuous scores reflecting the model's confidence that each file contains the species they are designed to identify (though these scores should not be confused for calibrated probabilities of presence within a file, especially in deep learning; see Guo et al., 2017).
Here, we develop an occupancy model that takes as inputs continuous scores of the type returned by many machine learning classifiers. Instead of 'visits', which in traditional applications of occupancy models are a series of physical visits to each field site, our individual surveys are 'files', the numerous autonomously sensed files generated by autonomous sensors and scored by a machine learning algorithm. We examine the model's performance on both simulated and real classifier scores and compare this to the performance of a traditional occupancy model that uses only human annotated data, without the additional information provided by classifier scores. We show that the continuous-score model recovers unbiased estimates of true occupancy across a wide range of plausible parameter values.
In an empirical case study, the continuous-score model is found to outperform the traditional model in 52% of cases, provide comparable performance in an additional 28% of cases and produce a slightly less accurate estimate of occupancy in 20% of cases.

| Standard occupancy model
Static single-species occupancy models are designed to estimate true occupancy for a species based on multiple visits to a site. For i in 1, … , T sites, let z i be a Bernoulli random variable that equals 1 if a species is present at site i and is otherwise 0. Assuming a species' occupancy probability, , does not vary across sampled locations (see Supporting Information for a generalization), standard occupancy models estimate latent presence of a species as Let y ij be an observation of site i on visit j in 1, … , J i visits, where y ij = 1 if the species is detected and y ij = 0 if the species is not de-tected. Traditional occupancy models assume that if a site is occupied, the species is detected during a visit with probability p, while visits to an unoccupied site always result in non-detections (i.e., no false positives). The probability of detecting a species at a site i conditional on its true presence z i is then This model can be applied to data collected using autonomous sensors if a human annotator labels a subset of files from each site i with detection or non-detection of the species. If annotators do not falsely detect a species in a file when it is absent in the file, these data can be used directly in the model framework above.

| Continuous-score model design
Consider now the case in which each sample unit or 'site' i in 1, … , T sites is surveyed by one autonomous sensor such as an acoustic recorder or a camera trap. Define a file f i,k i as a short audio clip or single camera trap image where the second subscript denotes k i in 1, … , K i files generated by the sensor at site i . Each file is scored by a machine learning model, generating a score s i,k i that is a continuous real number representing the algorithm's confidence that a species of interest is present in the file. Scores are not required to sum to 1 across files or across multiple species in a multispecies classifier. The model also optionally allows for a second observation type, in which a subset of files receive a binary detection/non-detection annotation from a human observer.
Although many distributions may be used to describe s i,k i , empirical data (see below) suggest that a Gaussian mixture is appropriate for such classifier output. For machine learning algorithms that output scores in the range [0, 1], real number valued scores can be produced by applying the logit transform to the raw scores.
As above, we presume that the species occurs at site i with occupancy probability , such that We assume that files collected at an occupied site i contain the species with probability i , representing a 'call rate' for acoustic surveys or an 'appearance rate' for camera trap surveys. Assuming that i is equal for all occupied sites, the presence of a species in a file f i,k i is given by such that if the site is unoccupied (z i = 0) the species will not be present in any file. Mechanistically, can be interpreted to include all the factors leading to the presence of a species in a file, including not only a species' rate of calling or appearing, but also its availability for being detected by the autonomous sensor using a given survey (see Brack et al., 2018;Dénes et al., 2015;Martin et al., 2005).
If a file f i,k i does not contain the species, the score for that file is assumed to be assigned by a classifier as where 0 and 0 are the mean and standard deviation of scores returned by the classifier for files that do not contain the species.
Conversely, for files that do contain the species, where 1 and 1 are the mean and standard deviation of scores returned by the classifier for files that do contain the species.
These parameters provide measures of classifier performance.
Classifiers with larger 1 − 0 are better able to discriminate between files that do and do not contain the species (Figure 1a).
Throughout, we assume that 1 > 0 , such that the classifier is, at a minimum, capable of assigning higher scores on average to files in which a species is present. Likewise, classifiers with smaller 1 and 0 can discriminate with higher precision between files that do and do not contain the species (Figure 1a).
The overall distribution of scores within or across sites can thus be described as a two-component Gaussian mixture ( Table 1). One component of the mixture consists of files in which the species is absent and the second component consists of files in which the species is present (Figure 1b). Within an occupied site i , both components will be present in the mixture with a weight of assigned to the positive component. When a site is not occupied, the positive component will not appear and the score distribution will be simply the Gaussian distribution N 0 , 0 . If scores across all occupied and unoccupied sites are aggregated, the weight of the positive component will be the product of appearance rate and occupancy, * .
As shown below, this model can be used as specified above without any additional data. However, to improve performance, the model can be extended to also incorporate binary detection/nondetection human annotations for a small random subset of files at each site. When annotated data are available, every file f i,k i can be in one of three states: h i,k i = 0 if the file is confirmed to not contain the species by an annotator, h i,k i = 1 if the file is confirmed to contain the species by an annotator, or h i,k i = 2 if the file is not reviewed. Scores s i,k i are still available for all files. Human annotators are presumed to correctly annotate whether a file does or does not contain the species (see Section 2.3 below). For this expanded model, the probability i,k i of observing a score s ij given the site's true occupancy state z i and the human observation category h i,k i for that file is the parameterization of which is given in Table 1.
Given the vectors of scores S and of human annotations H the likelihood of the model is expressed as The full hierarchical specification of the model and additional details about the likelihood function are in Supporting Information.
We used the NIMBLE package in R to conduct a simulation study and an empirical study using this model and a traditional occupancy model (based off an implementation by Kéry & Royle, 2016) in a Bayesian framework (simulations: NIMBLE v0.9.0, R v4.0.0; empirical study: NIMBLE v0.11.0, R v3.6.3; de Valpine et al., 2017;NIMBLE Development Team, 2021). We used uninformative priors of Uniform(0, 1) for and , N 0, 10 4 for the mean of the distribution of scores for files where the species was absent, N 1, 10 4 for the mean of the distribution of scores for files where the species was present, and Uniform(0, 10 4 ) for the standard deviations of the both of the score distributions. In fitting the traditional model, we used uninformative priors of Uniform(0, 1) for and p. We estimated all posterior distributions using Markov-chain Monte Carlo (MCMC) simulation using 1 chain of 10,000 samples (10,000 burn-in, no thinning or adaptation, NIMBLE binary and RW samplers). We chose MCMC settings by visually inspecting plots of sample chains for several examples. We conducted graphical validation of sample chains to assess the convergence of a subset of models using v.0.19-4 of the CODA package for R (Plummer et al., 2006;Robert & Casella, 2010).
The model above has similarities to existing occupancy models that account for false positives in binary detection history (Chambert, Miller, & Nichols, 2015;Mackenzie et al., 2006;Miller et al., 2011;Royle & Link, 2006). Of these, our model is most similar to the 'multiple detection methods' site-confirmation design, which makes use of ambiguous detections that are easier to collect and unambiguous detections that are harder to collect (Miller et al., 2011). It also incorporates features of the 'observation confirmation' method, which allows definitive confirmation of species identification postsurvey (Chambert, Miller, & Nichols, 2015).
Simultaneous to the development of our continuous-score model, Kéry and Royle (2021, p. 431) described an occupancy model that also treats the continuous output of a classification model as a Gaussian mixture. The model proposed by Kéry and Royle (2021) F I G U R E 1 The outputs of a machine learning classifier applied to remote sensing files form a mixture of two Normal distributions. The logit function is applied to transform machine learning scores from [0, 1] to the real numbers. The mixture's upper and lower components are formed by scores from files that, respectively, did or did not contain the species of interest. (a) The amount of overlap between the two components is a measure of classifier quality. Higher quality classifiers may increase the mean of the positive distribution, reducing overlap between the two components representing an improved ability to differentiate positive and negative files. (b) Higher site occupancy and appearance rates increase the relative size of the upper component of the mixture. Note that when scores are examined on a site-by-site basis, the upper component will not be present for sites where the species does not occur recording a machine learning score s i,k i given the site's true occupancy state z i and the human observation for the file h i,k i . In this table, f s i,k i | , is the probability density function of the Normal distribution with mean and variance 2 . The means and standard deviations of score distributions in files that do not contain the species and do contain the species are, respectively, 0 , 0 and 1 , 1 . The probability that a species appears in a file at an occupied site is is structured more similarly to the model proposed by Chambert et al. (2018), while our continuous-score model above is most similar in structure to the 'multiple detection methods' site-confirmation design of Miller et al. (2011). Aside from presenting a complementary approach to continuous-score modelling, our work contributes the first tests of such a continuous-score model using realistic ranges of empirical parameter values and empirical machine learning classified data, while comparing the model performance to traditional occupancy models. Our analysis also examines the effects of varying levels of file annotation, a question that Kéry and Royle (2021) noted as particularly important to address.

| Methodological considerations
Before addressing the performance of our continuous-score model, where human annotators produce false negatives in particular types of files, such as recordings in which a vocalization is quiet or photos in which an animal is distant. If such false negatives tend to receive low scores from the classifier (i.e. both humans and the classifier are more likely to 'miss' the same files), the mixture model above could be extended to a three-component mixture. Alternative approaches could add an additional hierarchical layer that models the annotation process separately from to distinguish between detection and availability, which in traditional occupancy models are not important to distinguish.
Fourth, we note that the model above assumes that a species will be present in at least one file if it is present at a site, which once again suggests that our model will be most successful when is reasonably far from zero. As an illustration, a species present at a site with = 0.1, a value much lower than that found in our empirical case study, will have less than a 0.02% chance of being absent in every file in a 60 file sample. In our analysis below, we approximate the probability that a species appears in zero files at an occupied site to be zero. In cases where is expected to be low, the approaches described above or an increase in the number of files per site can increase the appropriateness of this assumption.

| Simulation study
We first conducted a simulation study to test the ability of our model While examining initial results from model fitting, we found there were cases in which the fitted model appeared unable to distinguish the two components of the mixture model, with estimates of 0 and 1 nearly equal. Approximately 5% of the total runs (1,742 of the 37,500 simulations) produced this type of failure. We consider this issue, diagnosed by fits in which the estimates of the two classifier means differed by less than 0.1, to be a type of model failure. For additional details, see the Supporting Information.
In this simulation study, we assessed models' bias through three methods: using the median error of the estimates, checking whether the interquartile range of estimates for the scenario contained the true population value of occupancy and calculating root-meansquare-error across the estimates for the scenario. We assessed precision by comparing the sizes of the interquartile ranges for estimated model parameters.

| Empirical study
We also tested our model empirically using a dataset of annotated We excluded recordings from this ARU from the remainder of the analysis. Classifiers were trained using OpenSoundscape v0.4.7 (Kitzes et al., 2020, Supporting Information). We then used these classifiers to produce file-level scores for species presence in the remaining 3,780 5 s files in the dataset. We split these remaining clips into 63 'temporal sites', each corresponding to a continuous 5-min recording. We considered each 5-min recording, each containing 60 5 s files, to be a site.
We defined true at temporal sites as the sample-at-hand occupancy (see also Chambert et al., 2018). We considered annotation levels ranging from zero annotations to annotation of every file (N = 0, 1, 2, 4, 6, 8, 32, 60). Clips were randomly selected in a nested fashion such that annotations for each higher annotation level included clips annotated in lower annotation levels. We fit our continuous-score occupancy model to all annotation levels and the traditional occupancy model to all annotation levels except N = 0 and N = 1, as traditional occupancy models require multiple surveys per site.

| Simulation study
After removing the approximately 5% of simulations which failed (see Section 2), our results demonstrate that the continuous-score model successfully produces unbiased estimates of occupancy.
Across all succeeding simulations, the mean and median root mean Our second simulation study demonstrated that our model provided unbiased estimates of covariates on both and in 11 of 12 scenarios; in the 12th scenario, a combination of using a good classifier with no annotations, most of the simulations did not converge ( Figures S6-S9).

F I G U R E 2
A simulation study finds that the continuous-score model's occupancy estimates are unbiased and are more precise than the estimates of a traditional occupancy model. The continuous-score model can also be applied to data with fewer than 2 annotations per site, unlike a traditional occupancy model. Simulations were performed over 5 annotation levels, ranging from no annotations to 72 annotations per site (10% of all data annotated), 5 call rate levels, 5 occupancy levels and 3 classifier quality levels. For both the continuous-score model and the traditional model, this figure shows the difference between each model's occupancy estimate and the true occupancy in the 'adequate classifier' simulations. Simulations in which the continuous-score model failed to distinguish the components of the Gaussian mixture are removed The classifier-generated score distributions generally did not strikingly deviate from our assumption that they were twocomponent Gaussian mixtures except for the Black-capped Chickadee Classifier (Figure 4). The score distribution for files containing Black-capped Chickadees was bimodal due to the classifier assigning low scores to files containing the 'twitter' vocalization of Black-capped Chickadees, which was not present in the training dataset used to create the classifier. Importantly, this bimodal score distribution did not strongly impact the accuracy of the occupancy estimation.

| Empirical study
At low annotation levels, the fitted two-component mixture differed from the actual score distribution for two species, Eastern Towhee and Wood Thrush, but in both cases improved to closely match the true distributions with more annotations (Figure 4). For the Eastern Towhee, this poor fit is due to a poor classifier that only weakly distinguished between Eastern Towhee positive and negative files, owing to the training data containing almost no negative files from this highly vocal species, leading to underestimations of call rate and occupancy (Figure 3; Figure S15). In contrast, the Wood Thrush model demonstrated the opposite problem, where at lower annotation levels, the model overestimated the call rate and occupancy for most annotation levels ( Figure 3; Figure S15).

| DISCUSS ION
The continuous-score occupancy model presented above provides a framework that takes as input continuous machine learning classifier scores and, optionally, a small number of human-annotated files.
When used to predict occupancy probabilities, a simulation study found that even with no annotated data, the model can recover Our simulation experiment and empirical case study have a no- likely to lead to more precise estimates of model parameters. We also note that in larger field studies, the ability of the continuousscore model to decrease the required number of annotated files per site, as compared to a traditional occupancy model, will become more valuable.
Aside from the continuous-score model developed above, there are two broad alternative approaches that could be used to estimate occupancy using autonomously collected data. First, human annotators can review a sample of available files, ignoring the remainder of unreviewed files, and use traditional occupancy models to analyse this sample. This approach is straightforward to implement, is analogous to analysis methods used for human survey data and is still commonly used in the literature. As demonstrated above, our continuous-score model substantially outperforms such an approach.
Second, continuous-score data from a machine learning classifier can be converted to binary 0/1 data through a choice of a threshold. As the resulting binary data are nearly certain to contain false positives and false negatives, these binary data should be used with an occupancy model that accounts for these false positives and false negatives (e.g. Chambert, Miller, & Nichols, 2015;MacKenzie et al., 2018;Miller et al., 2011Miller et al., , 2013. Many of these 'false-positive occupancy models' require a second type of observation that is not subject to false positives, unlike our continuousscore model above which benefits from such additional verification but does not require it.

F I G U R E 4
Comparison of machine learning scores of classifiers on autonomous acoustic recorder data with scores from an example simulated classifier. Overlaid on the empirical score distributions are the distributions fit by the continuous-score model at four levels of annotation, representing either 0, 2, 8 or 60 files annotated per site. The simulated scores were generated with an 'adequate' classifier where the species occupied 50% of sites and calls in 25% of the files at occupied sites While we did not explicitly compare our continuous-score model to this approach of thresholding data for use with a false-positive occupancy model, such a comparison is an important future direction of research. Several structural designs for false-positive occupancy models (e.g. Chambert et al., 2018;Miller et al., 2011, Balantic & Donovan, 2019 could be used for this evaluation. We note that when making such comparisons, false-positive and false-negative rates will be determined by the choice of threshold, and thus the accuracy and precision of parameter estimates in such models is likely to vary with the choice of threshold as well. As there are likely to be systematic differences between the dataset used to train/test the machine learning classifier and the dataset used for model fitting (Knight & Bayne, 2018), we stress that false-positive and falsenegative rates will need to be estimated within the model itself, rather than assumed based on performance on training or validation data.
We anticipate several particularly useful applications of our model. First, the model may be applicable to studies over large spatiotemporal scales where gathering annotated data is time- consuming. Examples include estimation of occupancy over many sites or assessment of changes in occupancy over the course of a season. This may also apply to studies of multiple species in which annotating all species is time-consuming, requires additional expertise or requires annotation during multiple survey periods (e.g. diurnal vs. nocturnal species). Second, the model may potentially be applicable to studies where call rate or occupancy are thought to be low. Traditional occupancy models exhibit a strong bias in these scenarios that is only mediated with high levels of annotation. Third, the model may be useful in studies where call rate is a parameter of interest. An additional useful extension to occupancy models that utilize machine learning classifier-produced scores would be using classifier scores to direct listening effort toward higher-scoring files that are more likely to contain the species of interest.

ACK N OWLED G EM ENTS
We thank Samuel Lapp for contributions to identifying the underlying distributions of machine learning classifier outputs. We are grateful for the collaboration of the Powdermill Avian Research Center in producing the data used for the empirical study. We are appreciative of Torbjørn Ergon and two anonymous reviewers for their comments which greatly improved the manuscript. This

CO N FLI C T O F I NTE R E S T
The authors declare that there are no competing interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
Implementations of continuous-score and traditional models, scripts for data simulation, fitting models to simulated data, and classifier training, and the trained machine learning models are available at Dryad Digital Repository https://doi.org/10.5061/dryad.ns1rn 8ptd (Rhinehart et al., 2022). Data used to train machine learning models are available from Dryad Digital Repository https://doi.org/10.5061/ dryad.d2547 d81z (Chronister et al., 2021).