Reliability of animal counts and implications for the interpretation of trends

Abstract Population time series analysis is an integral part of conservation biology in the current context of global changes. To quantify changes in population size, wildlife counts only provide estimates because of various sources of error. When unaccounted for, such errors can obscure important ecological patterns and reduce confidence in the derived trend. In the case of highly gregarious species, which are common in the animal kingdom, the estimation of group size is an important potential bias, which is characterized by high variance among observers. In this context, it is crucial to quantify the impact of observer changes, inherent to population monitoring, on i) the minimum length of population time series required to detect significant trends and ii) the accuracy (bias and precision) of the trend estimate. We acquired group size estimation error data by an experimental protocol where 24 experienced observers conducted counting simulation tests on group sizes. We used this empirical data to simulate observations over 25 years of a declining population distributed over 100 sites. Five scenarios of changes in observer identity over time and sites were tested for each of three simulated trends (true population size evolving according to deterministic models parameterized with declines of 1.1%, 3.9% or 7.4% per year that justify respectively a “declining,” “vulnerable” or “endangered” population under IUCN criteria). We found that under realistic field conditions observers detected the accurate value of the population trend in only 1.3% of the cases. Our results also show that trend estimates are similar if many observers are spatially distributed among the different sites, or if one single observer counts all sites. However, successive changes in observer identity over time lead to a clear decrease in the ability to reliably estimate a given population trend, and an increase in the number of years of monitoring required to adequately detect the trend. Minimizing temporal changes of observers improve the quality of count data and help taking appropriate management decisions and setting conservation priorities. The same occurs when increasing the number of observers spread over 100 sites. If the population surveyed is composed of few sites, then it is preferable to perform the survey by one observer. In this context, it is important to reconsider how we use estimated population trend values and potentially to scale our decisions according to the direction and duration of estimated trends, instead of setting too precise threshold values before action.

confidence in the derived trend. In the case of highly gregarious species, which are common in the animal kingdom, the estimation of group size is an important potential bias, which is characterized by high variance among observers. In this context, it is crucial to quantify the impact of observer changes, inherent to population monitoring, on i) the minimum length of population time series required to detect significant trends and ii) the accuracy (bias and precision) of the trend estimate.
2. We acquired group size estimation error data by an experimental protocol where 24 experienced observers conducted counting simulation tests on group sizes.
We used this empirical data to simulate observations over 25 years of a declining population distributed over 100 sites. Five scenarios of changes in observer identity over time and sites were tested for each of three simulated trends (true population size evolving according to deterministic models parameterized with declines of 1.1%, 3.9% or 7.4% per year that justify respectively a "declining," "vulnerable" or "endangered" population under IUCN criteria).
3. We found that under realistic field conditions observers detected the accurate value of the population trend in only 1.3% of the cases. Our results also show that trend estimates are similar if many observers are spatially distributed among the different sites, or if one single observer counts all sites. However, successive changes in observer identity over time lead to a clear decrease in the ability to reliably estimate a given population trend, and an increase in the number of years of monitoring required to adequately detect the trend. 4. Minimizing temporal changes of observers improve the quality of count data and help taking appropriate management decisions and setting conservation priorities. The same occurs when increasing the number of observers spread over 100

| INTRODUC TI ON
Conservationists and stakeholders often focus on population dynamics to quantify the scale and significance of ecological and human impacts on wildlife. Estimates of state variables (abundance, occurrence, and species richness, Royle & Dorazio, 2008) (Cowardin & Golet, 1995), or adaptive harvest management programs (Madsen et al., 2017), aim to identify those species most in need of conservation and management attention by using criteria such as quantified reductions in estimated population size (Gärdenfors, 2001;Gärdenfors et al., 2001).
However, wildlife counts only provide estimates and not actual population size, because of various sources of errors such as imperfect detection (Dénes et al., 2015), imperfect abilities to count animals that are detected (Seber, 2002;Thompson, 2002;Williams et al., 2002), misidentification of species or nonexhaustive geographical coverage. When unaccounted for, these errors can introduce considerable estimation bias and obscure important ecological patterns (Wenger & Freeman, 2008), which can reduce the power to detect trends and accuracy of any trends that are detected (Sanz-Pérez et al., 2020). Integration of systematic sources of count errors into population models can help. Nevertheless, to keep the model simple and avoid overparameterization (principle of parsimony, Vandekerckhove et al., 2015), only a selection of the most important errors should be modeled, according to the situation. In the specific case of highly gregarious species such as waterbirds (Tamisier & Dehorter, 1999), seabirds, and cetaceans (Barlow & Gerrodette, 1996), counts aim to estimate the size of groups of up to several tens of thousands of individuals. In this case, the estimation of group size is particularly likely to be biased and errors must be taken into account. Sources of bias in group size estimation include for instance the count methods, equipment, as well as observer identity. Some studies have focused on measuring this latter source of error. The results collectively suggest that visual estimates of large aggregations of individuals may generally be associated with underestimation combined with high variances within and among observers (Dervieux et al., 1980;Erwin, 1982;Prater, 1979). In these studies, measurement of bias in observer estimation over a range of group sizes is generally evaluated through comparison with aerial photographs (Dervieux et al., 1980;Erwin, 1982;Prater, 1979).
However, it is recognized that photographs are far from ideal to assess the true number of individuals in groups, due to other biases that can occur when reading the photo (e.g., definition, movement; Descamps et al., 2011). Studies have therefore used ground counts as a proxy for the "true" number of individuals (Bouché et al., 2012;Smith, 1995), but these are always estimated with a margin of error, and this is particularly the case for high group densities where individuals overlap each other.
Sources of error that change over time, like observer identity, can generate incorrect estimates if they are not properly taken into account (Barker & Sauer, 1992). In particular, it is common over long time series that the staff in charge of counts changed over time. The magnitude of observer differences in estimation error of groups therefore can induce additional variability (Dervieux et al., 1980;Erwin, 1982;Prater, 1979), potentially leading to wrong conclusions regarding trends (McCain et al., 2016). Indeed, the lack of detection of a trend with a given statistical test may correspond to a real absence of a trend or an important type II error (β), hence a lack of statistical power (1−β) which can be due to short time series and/or changes of observers (Gerrodette, 1987;White, 2019). Short time series are potentially misleading: at least 10-20 years of continuous monitoring are generally necessary to achieve a high level of statistical power depending on species, trend strength, and study design (Reynolds et al., 2011;Rueda-Cediel et al., 2015;White, 2019).
However, since both time and resources available for conservation are finite, time series are often shorter than statistically desirable (Field et al. 2007;Hughes et al. 2017). Therefore, managers need to know how much they can trust the apparent trend of a population to reliably identify the sites or species for which conservation action is really needed, and take management decisions (Giron-Nava et al., 2017;Martin et al., 2012Martin et al., , 2017. Earlier studies have thus examined potentially important trade-offs between spatial and temporal replication to minimize uncertainty in trend estimates (Rhodes & Jonzén, 2011) and measured the impact of sampling (i.e., count) frequency on trend estimates (Wauchope et al., 2019). However, we sites. If the population surveyed is composed of few sites, then it is preferable to perform the survey by one observer. In this context, it is important to reconsider how we use estimated population trend values and potentially to scale our decisions according to the direction and duration of estimated trends, instead of setting too precise threshold values before action.

K E Y W O R D S
group size estimation error, population monitoring, sampling design, statistical power, time series are not aware of any study that has assessed the impact of changes in observers, taking into account group estimation error, on the minimum length of population time series (T min ) required to detect significant trends in abundance. This is what we propose to do here, through a simulation study with different scenarios of changes in observers. We also evaluated the effect of observer change on the accuracy (bias and precision) of the trend estimate given the error in estimating group size, with the hypothesis that observer changes reduce confidence in trend estimates.

| Population simulation
We simulated a population strictly distributed over 100 sites with initial numbers per site being randomly sampled from a Negative Binomial Distribution with mean mu = 300 individuals and the dispersion parameter size = 2. Thus, these parameters allow to sample initial group size within the range of values of the 120 group sizes used in the simulation software (see part 2. Group size estimation error). From the initial group size at each site, we simulated the change in "true total population size" (summed over the 100 sites) over time. We considered a simple deterministic model describing density-independent growth: where N t is population size in year t, and λ the population growth rate.
Population size in year t + 1 depends on population size in year t, and the population growth rate remains constant over the entire monitoring period, representing a population evolving in a constant environment. This is an obvious simplification of reality, but not a problem for the present study, which aims at measuring the relative impact of scenarios of changes in observers on Tmin required to detect significant trends in abundance, and the accuracy of the trend estimates.
The same growth rate was used for all sites. We applied 3 scenarios to the population, that is, declines of 1.1%, 3.9%, or 7.4% per year over 25 years. In this way, we covered most of the thresholds used in conservation programmes: the IUCN Red List applies the criteria of "10% decline over 10 years or 3 generations" (−1.1% per year) to identify Declining species, "30% decline over 10 years or 3 generations" (−3.9% per year) for Vulnerable species, and "50% decline over 10 years or 3 generations" (−7.4% per year) for Endangered species (IUCN, 2019).

| Group size estimation error
Group size estimation error data were obtained by an experimental protocol using animal counting simulation software Wildlife Counts (version 2.0.; Hodges, 1993). In this study, 24 experienced observers from the Camargue, Southern France, conducted counting simulation tests on group sizes ranging from 2 to 1,098 individuals with various spatial configurations. The 24 experienced observers sampled in our study are part of several institutions with conservation professionals involved in the counts in Camargue. These professionals have counted in several countries under different field conditions and have to estimate groups with very large numbers of individuals in a relatively short period of time (especially during airplane counts and during ground counts when birds are taking off). Each observer was subjected to 60 tests at average speed (2-21 s to count individuals depending on the size of the groups), and 60 tests at a maximum speed (1-7 s), for a total of 120 tests, identical in group size for all observers. Before each series of 60 tests, two practice tests were performed to prepare the observer. Obviously, display times in the field can be much longer than the times used in the computer tests, but the aim of such tests was only to expose a range of observers to standardized situations, without the aim of measuring their actual observation efficiency in realistic conditions. However, the very short display times for groups of individuals is also a reality in the field, particularly during aerial surveys but also during ground counts to some extent (e.g., when birds are disturbed and take flight).

| Simulation of observed population size
Here, the term "observed population size" referred to the sum of count estimates at the 100 sites. We fitted a local polynomial regression (loess regression, Cleveland et al., 1992) for each observer to compute observed group sizes as a function of actual group sizes. For this study, we kept the default parameters of the function, namely a degree of smoothing α = 0.75, type-2 polynomials and a Gaussian family. We applied the predicted loess regression to the simulated population adding a random noise extracted from the observer specific loess regressions, because a given observer may underestimate or overestimate group size. For the 24 observers, the standard deviation (SD) extracted from the loess regression varied from 43 to 103 with an average of 66 ± 18 SD. In this way, for each observer, we simulated 100 observed group sizes from each of the 100 sites, over 25 years ( Figure 1a; Table 1: scenario O1 T1). Based on these data, five scenarios of changes in observers were run (Table 1). Three gradual scenarios of temporal changes in observers were applied: (T1) the same observer performed the counts for the 25 years of the monitoring; (T5) observers identity changed every five years, that is, 4 observer changes over the entire monitoring period; and (T25) observers identity changed annually, that is, 24 observer changes.
Each of the 24 experienced observers was randomly selected, without replacement, to perform monitoring over one year (T25) or for a period of five years (T5). Note that for T25, we randomly re-selected one of the 24 observers to complete the 25th year of the monitoring.
Two scenarios of spatial changes in observers were applied: (O1) the same observer performed the counts at all sites; (O24) observers differed at each of the sites. These spatial scenarios may reflect the aerial counts of wildlife (O1), where one person usually carries out the complete survey of many sites (Carretta et al., 2000;Jachmann, 2002). Conversely, (O24) may be more representative of (1) N t+1 = N t the ground counts or national scheme, with some turnover in nature reserve staff, for example. For O24, the 24 experienced observers were randomly spread over the 100 sites, that is, 20 observers randomly counted four sites and four additional observers randomly counted five sites, respectively. Finally, a total of six scenarios were run for each of the three simulated trends (−1.1%, −3.9% and −7.4% per year, Table 1). In the five scenarios where observers' identity changed, 100 random selections of observers were performed for each of the scenarios (Figure 1b). The flow diagram in Figure 2 summarizes the different simulation steps.

| Power and accuracy analysis
One approach to determining T min required to detect significant trends in abundance under the different scenarios of changes in observers is through repetitive simulations (Gerrodette, 1987;Gibbs et al., 1998;White, 2019). Under each scenario, for each random selections of observers, we calculated the proportion of simulations (hereafter statistical power) in which the slope parameter from linear regression was significantly different from 0, at the 0.05 threshold. Although there are many approaches to studying population trends, we used log-linear regression on the observed population size because this is the simplest and most commonly applied method (Thomas, 1996). The T min required to be confident with the detection of a trend in abundance was considered to be obtained when statistical power was equal or greater than 0.8. The significance level 0.05 and statistical power 0.8 were used here as these are historically and commonly used thresholds (Cohen, 1992). When T min was greater than 25 years, we set T min at 25 years to avoid missing data when compiling the results.
We also evaluated the effect of observer change on the accuracy (bias and precision) of the trend estimate given the error in estimating group size. After checking the normality of the data, we performed t-tests to evaluate the bias of the trend estimate F I G U R E 1 Change in observed and true population size. Population size refers to the total number of individuals over all sites. (a) Example of 100 simulations with random noise of observed population size for one observer in the O1T1 scenario (same observer throughout the study period). In this case sample size N = 24 which corresponds to the 24 experienced observers (the 23 other observers are not represented here). (b) Example of one random selection of observers with 100 simulations with random noise of observed population size in the O1T5 scenario (observer changes every 5 years). In this scenario, 99 other random selections were also made (but not represented here), therefore sample size N = 100 random selections of observers Note: N corresponds to the sample size; that is, for all scenarios involving observer changes, 100 random selections of observers were performed. For the scenario where the same observer counted all sites over the entire monitoring period, sample size was 24 which corresponds to the 24 experienced observers who completed the counting simulation tests. The plane icon was used to represent monitoring carried out by one single observer on all sites, as it is often the case during aerial surveys. The observer with binoculars icon was used to represent monitoring carried out by several observers distributed over the sites. Silhouettes reproduced from Flaticon (https://www.flati con.com/). in relation to the simulated theoretical value in the model (1). Normalized root-mean-square deviation (NRMSD) was used to measure and compare the precision of the trend estimate between the different scenarios (Hyndman & Koehler, 2006). All analyses were conducted in R (R Core Team, 2017).

| Group size estimation error
The computer exercises showed a frequent underestimation of group size by the observers. On average, they underestimated group sizes by 13% ± 28 (SD). Such underestimation was greater when there were more individuals to be counted. The data showed inter-

| Trend bias analysis
In general, for all scenarios and trends, observers did not appear to be able to accurately estimate the actual rate of change in population size. However, the direction of the trend was detected in 94% of all cases considered (Table S1). Figure 4  When observers' identity changed spatially (scenarios O24 T1, O24 T5, O24 T25), the temporal changes did not influence the NRMSD mean values, regardless of the trend value ( Figure S4).

| D ISCUSS I ON
This study shows that under realistic field conditions where observers' identity changes temporally and spatially, wildlife population size and trend estimates are similar if many observers are spatially distributed between the different sites, or if one single observer counts all sites. Our results also show that successive changes in observer identity over time reduce our ability to precisely estimate a given population trend and increase the number of years of monitoring required to adequately detect the trend.
In our study, the counting data were analyzed by log-linear regression and do not include statistical practices that adjust for variation among observers such as random-effect intercepts for individual observers (Link & Sauer, 1997). Techniques that fail to take into account the variation between observers remain popular among monitoring programs (Klvaňová & Voříšek 2007;Rosenstock et al., 2002).
Indeed, the analysis of data that allows accounting for variations in inter-observer group size estimates is not always straightforward, and coordinators of monitoring programs are sometimes skeptical F I G U R E 7 Mean NRMSD according to the number of years of monitoring for the trend -7.4% per year and for all scenarios with only temporal changes in observers (100 simulations with random noise for each of the 24 observers for the scenario O1 T1 and for each of the 100 random selections of observers for scenarios O1 T5 and O1 T25) about accounting for it because they find the analyses too complex.
One example is the Pan-European Common Bird Monitoring Scheme (PECBMS), which produces national and supranational indexes by using the TRIM software (TRends and Indices for Monitoring data; Klvaňová & Voříšek, 2007). TRIM is also used to assess conservation status for IUCN Red List assessments (Criterion A; Maes et al., 2015).
While frequent changes in observers are inherent to monitoring, this study advocates for considering the identity of the observers in order to take into account the variations in inter-observer group size estimates during the statistical analysis.
Our study highlighted a general trend for observers to underestimate group size during counts (as found by Dervieux et al., 1980;Erwin, 1982;Prater, 1979), and their difficulty in precisely detecting the value of the trend (in this study the theoretical trend is detected in 1.3% of the cases). The computer tests also showed a wide variation in count ability among the observers, even though all of them were experienced fieldworkers. This observation is in line with studies on human counting capacities (Erwin, 1982;Prater, 1979). The ex situ protocol used in this study was however novel in that it allowed us to focus only on the intrinsic ability of observers to estimate group size, while standardizing other regular sources of group count errors such as habitat type, weather conditions, species behavior, time, and equipment available (Barker & Sauer, 1992)  Although the estimated trend values remain highly biased, our results show that the gap between estimated and real population sizes decreased over longer monitoring periods, in the case of declining populations. This did not simply reflect a gradual improvement in observer accuracy over time, which could arise through learning (Garel et al., 2005;Williams et al., 2006), since this was not taken into account in our simulations. At the beginning of the counts when the population was still relatively large, there was a wide gap between estimated and actual population sizes, which decreased as the population gradually declined following the negative trends we used. Indeed, such improvement in the quality of the counts was due to the fact that, on average, observers tended to underestimate smaller group sizes to a lower extent (for group sizes between 2 and 201 individuals observers underestimated by 0.34% ± 10.5 (SD); for group sizes between 208 and 1,098 individuals observers underestimated by 26.4% ± 13.0 (SD)). Similarly, trend estimates would become increasingly underestimated over time in the case of a population increase.
In addition to the effect of group size, the number of years of monitoring appeared to improve trend precision (Supporting 2), although this occurred through a greater precision rather than a less biased mean estimated value (see also Yates, 1953). Some studies highlight that longer time series are needed to obtain smaller confidence intervals, and to detect a decline when it is of low magnitude, that is, −1% per year (Connors et al., 2014;Tománková et al., 2013;Wauchope et al., 2019;Wilson et al., 2011). Accordingly, we found that it is most challenging to detect the direction of a slight trend (sensu Wauchope et al., 2019) when temporal changes in observers occur (Table S1) sites: mean T min : 7 ± 1 (SD) years, which is on average two year longer than when 24 observers are spatially distributed between 100 sites) and a loss of precision of estimated trends is induced in some cases ( Figure S5). This pattern likely arises because there are fewer individual processes to buffer each other, so that the poor abilities of one given observer are more likely to lead to biased overall results.
In addition to this, when counts are carried out over smaller areas where fewer sites are counted, such as the 40 sites counted from the ground in Camargue nature reserves, in Southern France (Tamisier & Dehorter, 1999), spatial changes in observers decrease confidence in the estimation of derived trends (24 observers monitor each one site: mean T min : 6 ± 1 (SD) years, which is on average one year longer than when 24 observers are spatially distributed over 100 sites and a loss of precision of estimated trends is induced, see Figure S6). In this context, it is preferable to favor a single observer for all sites where only a few sites need to be sampled frequently, especially if spatial autocorrelation in population dynamics is high (Rhodes & Jonzén, 2011).
To achieve an optimal balance between cost-effectiveness and precision of wildlife monitoring programmes, our study showed that the most important thing to avoid is temporal change in observer identity. Indeed, a greater frequency of observer change gradually increased the period necessary to detect a significant trend, and decreased the precision of estimated trends for a given monitoring duration, although the direction of the trend was generally adequately detected (See also supporting 1).
Wildlife management however depends on long-term monitoring databases, especially so for species with longer generation times (White, 2019). The collection of such data can be difficult, expensive, and labor-intensive (Williams et al., 2002). Consequently, many monitoring programs often require the use of a large number of observers, both instantaneously to cover the many sites used by the animal population, and over time as people change job or retire (Schwarz & Seber, 1999), which introduces an additional source of variability into observations. The results of this study therefore first call for sufficient and sustained funding of monitoring schemes, so that staff can receive the appropriate similar qualification and remain involved in monitoring for prolonged periods, instead of relying on successive volunteers of very unequal count abilities. Equally, our results suggest that long time series can help to compensate for some of the biases introduced by changing observer identity. In such cases, interpretation of shortterm monitoring (e.g., 3-5 years for birds species) could be highly misleading. Ultimately, these results can help to design future counting protocols with the aim of finding the best compromise between high precision (minimizing temporal changes in observers to maximize the precision of the trend estimate), cost-effectiveness (minimizing temporal changes of observers to achieve high statistical power and decrease the monitoring period), and logistic feasibility (temporal changes of observers are inherent to population monitoring). This compromise must be adapted to the species in question (Ficetola et al., 2018), for example owing to its gregariousness and consequent difficulty to be counted, to the management objectives (Lindenmayer & Likens, 2009;McDonald-Madden et al., 2010) and adapted to time period to match those used in conservation schemes, such as IUCN criteria, while achieving high statistical power (White, 2019).

| CON CLUS IONS
In order to make the right management decisions, population trend analysis should be based on the highest possible quality of count data. Building a count protocol adapted to conservation objectives, minimizing temporal changes of observers trying to maintain staff positions, and considering the importance of the number of observers distributed spatially according to the number of sites monitored all improve the quality of count data. In addition, it is important to reconsider how we use estimated population trend values, and potentially to base our decisions on the direction and duration of estimated trends without requiring these to cross predefined, precise thresholds. Ensuring that we collect reliable count data will provide help taking appropriate management decisions and setting conservation priorities in this context. Alternative methods based on imagery have been gaining ground over the last decades (Akçay et al., 2020;Hodgson et al., 2018;Lyons et al., 2019), and more particularly with automated computer vision software (Chabot & Francis, 2016;Hollings et al., 2018).
However, computer vision software may work under some particular conditions but they are generally biased and known to fail in several situations (Chabot & Francis, 2016;Hollings et al., 2018) even if considerable improvements are underway (González-Villa & Cruz, 2019). We expect that the continuing technological developments (sophisticated image analysis software and advances in camera and drone technology) in the analysis of remotely sensed data will control the error in estimating groups' size in many more situations than at present, where observers remain the most widely used means to monitor animal populations.

ACK N OWLED G M ENTS
Data collection was possible thanks to the 24 experienced observers from the Camargue that conducted the computer counting simulation tests. This work was supported by funding from the Foundation François Sommer. We thank Nigel Taylor for helpful comments and proofreading.

CO N FLI C T O F I NTE R E S T
The authors declare this research was undertaken in absence of any conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
All data used in our analysis are available from https://doi.