Experimental test of birdcall detection by autonomous recorder units and by human observers using broadcast

Abstract Autonomous recording units are now routinely used to monitor birdsong, starting to supplement and potentially replace human listening methods. However, to date there has been very little systematic comparison of human and machine detection ability. We present an experiment based on broadcast calls of nocturnal New Zealand birds in an area of natural forest. The soundscape was monitored by both novice and experienced humans performing a call count, and autonomous recording units. We match records of when calls were broadcast with detections by both humans and machines, and construct a hierarchical generalized linear model of the binary variable of correct detection or not, with a set of covariates about the call (distance, sound direction, relative altitude, and line of sight) and about the listener (age, experience, and gender). The results show that machines and humans have similar listening ability. Humans are more homogeneous in their recording of sounds, and this was not affected by their individual experience or characteristics. Humans were affected by trial and location, in particular one of the stations located in a small but deep valley. Despite recorders being affected significantly more than people by distance, altitude, and line of sight, their overall detection probability was higher. The specific location of recorders seems to be the most important factor determining what they record, and we suggest that for best results more than one recorder (or at least, microphone) is needed at each station to ensure all bird sounds of interest are captured.


| INTRODUC TI ON
There is a need for effective bird monitoring methods to assess species presence, abundance, evaluate the consequences of current species management-for-conservation practices, and to provide an indication of overall balance in a given biome (Dawson & Efford, 2009;Digby, Towsey, Bell, & Teal, 2013;Towsey, Planitz, Nantes, Wimmer, & Roe, 2012;Vielliard, 2000). Birdsong is often used to detect, monitor, and quantify species because it works even when the individuals are out of sight. Humans are capable of identifying birds aurally with reasonable accuracy: The average person can recognize birdcalls in their backyard, while experts can identify hundreds of bird species by their song alone. It is therefore not surprising that birdcall surveys are a common method of assessing populations of birds and conservation managers have turned to some of these methods to monitor species for conservation purposes.
Advances in technology have seen an increase in the use of autonomous recording units (ARUs) for monitoring of bird populations. This technology has been recognized for having the potential to overcome some of the human issues, and for having some extra advantages. For example, ARUs are less likely to affect birds' behavior, and their sampling can be scheduled in advance and carried out at selected times of day and night over long periods (Telfer & Farr, 1993;Hobson, Rempel, Greenwood, Turnbull, & Wilgenburg, 2002;Rempel, Francis, Robinson, & Campbell, 2013), allowing these devices to be placed in remote locations and minimizing temporal biases in sound recording. Further, ARUs produce archival records that allow the listener to replay and verify identifications of species (or ask other listeners to do so) and can be deployed by people with limited bird knowledge.
Given that it is likely that ARU recordings will increasingly replace, or at least supplement, human listening, the key question is to what extent the recordings are comparable to human hearing. This is particularly important as one of the first steps to make this technology useful to conservation and/or research is to develop protocols, which requires knowledge of the strengths and limitations of the ARUs for capturing sounds under a range of conditions. This knowledge is also important for the development of methods of analysis of the data collected via ARUs, and to judge the validity of abundance estimates obtained from ARUs surveys.
Since the beginning of the 2000s, a number of studies have compared ARUs and humans during the common bird survey types (Salamon et al., 2016; Table 1). Most of these studies use simultaneous recording by ARUs and observers in natural settings. The challenge in analyzing these data is the lack of a gold standard: The machine recording is compared to the paper annotation of the human observers. Since the lack of human consistency is one of the drivers for ARU adoption, this seems problematic at best. In addition, detection ranges differ between these survey methods. The ability of humans to move their heads and therefore capture sounds from several directions means that even if recorders and humans get overall similar results in surveys, the way they achieve this would be different. Therefore, the protocols to be used by each method should be calibrated to achieve comparable results.
In this study, we compare humans and ARUs by presenting them simultaneously with birdcalls broadcast at various distances and locations. We then look at (a) the effect of distance, sound direction, relative altitude, and line of sight on the capacity of ARUs and people to record bird sounds, and (b) the effect of age, experience, and gender on the ability of observers to hear bird sounds. We used the calls of three of New Zealand's nocturnal species: two kiwi species (Apteryx owenii, little spotted and A. mantelli, brown) and an owl, the ruru (Ninox novaezelandiae). Kiwi is a flightless nocturnal ground insectivorous bird endemic to New Zealand, while the ruru is a small forest owl from Australasia.
Based on sound theory (Forrest, 1994), we predicted that: (a) Calls broadcasted from speakers in locations relatively lower than listening stations would be captured by recorders and humans while those broadcast from higher sites would not, as sound would travel above the recorders/people; (b) speakers located in line of sight of autonomous recorders/human observers would be heard better and there would be less obstruction of the sound waves; (c) low-frequency calls would be recorded more/better than high-frequency calls as the latter attenuate more in the forest environment; and (d) shorter distances between speaker and autonomous recorder/ human observer would result in better recordings.

| Experimental design
The experiment took place at Rawhiti, Northland, New Zealand (35.2330°S, 174.2606°E). It consisted of broadcasting prerecorded bird sounds from six broadcasting sites to be recorded by both human observers and ARUs located at seven different listening stations ( Figure 1), allowing direct comparison between them. Each human observer carried out the listening exercise at all seven listening stations, resulting in seven trials (Table 2). This enabled us to compare the effects of location without the confounding factors of differences between human observers.
Human observers were initially deployed to their first listening station. Each trial then followed the same format: Based on a sound signal (a shotgun blast), a series of bird calls were played from six broadcasting stations. At the end of the broadcast, another shotgun blast TA B L E 1 Field studies examining the differences between acoustic recorders and human observers, and details of equipment used for recording, habitat type, observers, performance measures, and what the author's considered pluses and minuses of Autonomous Recording Units when compared to human observers

Location and vegetation Observers Method
Performance measures

Minuses
Haselmayer and Quinn informed human observers of the end of the trial. The observers then had 10 min to move to their next listening station, and the next trial commenced. A double shot was fired at the end of the experiment to indicate the time to return to base.

| Broadcasts
The six broadcast sites were unknown to human observers, but the observers visited the seven listening stations during the day, prior to the experiment, to become familiar with their location along the track (Figure 1). Experimenters, with their broadcast equipment, were deployed to their locations before the human observers started the experiment to prevent observers knowing the locations of the broadcasts. Speakers were activated by experimenters at fixed times after the start of each trial (gunshot signal). For practical reasons, we used five different speaker combinations for broadcasts: three FoxPro models (Wildfire, FX5, and Firestorm): two Marantz 660 recorders coupled to a Saul Mineroff Portable Field Speaker (SME-AFS), and a Sony PCM-M10 recorder coupled with a SME-AFS.
However, prior to the experiment, all the speakers were adjusted to generate the same sound pressure level for a given birdcall.
Broadcasts from different speakers were not supposed to overlap and we expected that observers in most cases could hear sound from several of the speakers (i.e., if they were close to more than one speaker). In practice, some experimenters started the speakers slightly earlier or slightly late and thus some overlap of songs occurred. Each speaker broadcast the calls of three nocturnal birds known to the observers: two species of kiwi, which were not known to exist in the area, and ruru, which exist in low density (Table 3).
For kiwi, we used one male and one female call for each of the two species, and for ruru, we used a combination of trill and weow calls (Brighten, 2015) resulting in five calls being broadcast ( Figure 2).
Previous work indicated differences in transmission of bird sounds between day and night (Priyadarshani, Castro, Marsland, 2018) and so the experiment was conducted between 21:00 and 23:30, which is in the time range where the selected species naturally call. Calls were broadcast at natural volume (Section 2.4 below).
Each birdcall sequence was 88 s (1.47 min) long, and therefore, the total amount of hearing time was 7.33 min for each sequence. Each speaker played the songs in a different predefined random order to prevent observers from predicting bird order (Table 3); this was particularly important because the calls remained the same for the entire experiment. The order in which the speakers broadcast the calls was also randomized (Table 4) to prevent observers from predicting where sounds would come from. All speakers broadcast north and were located on the ground facing upwards at 45 degrees to simulate a kiwi calling from the forest floor (I. Castro, pers. obs.).

| Human observers
Two observers with different level of expertise were located 2-4 m apart at each of the seven listening stations (Figure 1). The two observers were out of sight of each other to prevent them

Location and vegetation Observers Method
Performance measures

Minuses
Yip et al.  Note. Expertise rank was self-assessed using the following categories: 1 = knows most NZ species sounds well; 2 = knows most NZ forest species sounds well including rare birds; 3 = knows a variety of common NZ species sounds well; 4 = knows only a few common species sounds well. 5mbc:

| Processing of song for broadcast
Each bird call was chosen from high-quality recordings of the species ( Figure 2). The files were denoised using wavelets (Priyadarshani, Marsland, Castro, & Punchihewa, 2016). The selected birdcalls were listened to by IC who is experienced in working with the chosen species in the field. Each song was broadcast to IC who indicated when the volume of the song sounded as if the bird was calling next to her. Once these levels were decided, the songs were concatenated using Praat (http://www.fon.hum.uva.nl/praat/) and a tone marker was added at the beginning of the sequence. This way all songs in the recording were at the estimated correct volume when compared to each other.
The broadcasting volume from each speaker was adjusted based on the volume of the initial tone until it was the same for all speakers.
One speaker was used to broadcast the song, and all other speakers were calibrated using a sound meter (Digitech QM1592 Professional Sound Level Meter) following manufacturer instructions: The sound meter was placed 20 cm from the ground on a tripod and 1.5 m from the speaker looking directly toward the speaker. Using this method, the volume for the tone ranged between 61 and 63 dB; for brown kiwi female between 75 and 79 dB; brown kiwi male between 79 and 87 dB; little spotted kiwi male between 77 and 81 dB; little spotted kiwi female between 76 and 82 dB; and ruru between 77 and 79 dB (Table 5).

| Data from the recorders
Sound recordings were stored as wav-files with a 32 kHz sampling rate and 16 bit data depth. We used AviaNZ version 1.0 for the visualization and analyses of sounds (AviaNZ team, Massey University, 2017) using a 256-sample Hann window. As a first step, a recording from one of the stations was scanned in AviaNZ for the shotgun sounds that defined the beginning and end of broadcast trials. All sounds were annotated for the whole experiment for a single recorder (with the help of other recorders when the calls were not registered or faded). Then, this was used as a template to annotate the rest of the recordings from other stations. For each broadcast call, we then recorded its presence to compare this to human recorded data. Three of the recorders NE3, NE4, and Ex2 did not work, despite previous testing, and so data from these recorders was not available for analyses. We replicated the data from recorders Ex3, Ex4, and NE2 to match detection with people at those stations who were under the recorders that did not work (i.e., NE3, NE4, and Ex2).

| Data from human observers
Data were initially matched with the expected sequences broadcasted using the identification, direction, distance, and time recorded by observers, where this was provided. Despite the instructions, some observers did not write any information about time, distance, or direction. In these cases, we used the ruru and brown kiwi calls to decide at what point in the sequence each call went, together with data from the other person at the station and the annotated data from the autonomous recorders. This last one was only used as a last resort as a guide to decide whether a sound may have been heard. Note. Trials were separated by a 10-min period, while observers moved from one station to another. Data were scored as binary variables based on whether individual observers detected or failed to detect individual broadcast calls and whether they successfully identified the species. One of the human observers' information was not used in the analyses because this individual did not follow any of the instructions, and his data were not comparable to that of the other human observers.

| Distances between the stations and speakers
GPS coordinates taken on site using a Garmin Rino and calibrated against map features were used to compute distance and direction using the calculator at http://www.movable-type.
co.uk/scripts/latlong.html; this uses the Haversine formula to calculate the shortest distance over the earth's surface between points, giving an as-the-crow-flies distance between the points: where φ is latitude, λ is longitude, and R is earth's radius (mean radius = 6,371 km).

| Altitude
We used the Google Earth Pro "show elevation profile" feature to obtain the altitude of each listening station and broadcasting site, and calculated the relative altitude or altitude difference between the recorder and the speaker (recorder altitude-speaker altitude).
Line of sight was deemed to have occurred when the broadcasting site was in direct line from the listening station without any geographical feature separating them.

| Broadcast direction in relation to listening station
The direction of the calls broadcasted in relation to the listening stations was calculated by measuring the angle between the two on a map in degrees, and giving a location (cardinal point) for the listening stations in relation to the broadcasting site (North, East, West, or South).

| Statistical analyses
We considered each individual bird call broadcast as a trial and treated the data as a series of Bernoulli trials, with the success (1)  where esp i is a linear expression of covariate factors that we aimed to fit: The majority of the terms in esp i (Equation 3) were hierarchically modeled as normally distributed with zero mean and very small precision. The exceptions were the terms for the individual ARU/human a = sin 2 Δ ∕2 + cos 1 ⋅ cos 2 ⋅ sin 2 Δ ∕2 .
The individual influence of each ARU or human observer on detection probability of broadcast sounds at the Rawhiti Experiment. Each person and each recorder corresponds to a different color line in the plots. This posterior probability density plot represents the distribution of the individual contribution covariate after the MCMC runs (most of their prior distributions were modeled as normally distributed with zero mean and very small precision). The vertical line, placed on 0, is there to help visualize the proportion of each covariate's posterior that is above or below this point. Covariates with posterior distributions completely above or below zero have more consistent effects on the detection probability observers. This covariate, for the human observers ( r 1:13 ), was hierarchically modeled with its own linear model that accounts for previous experience, age and gender: Each person's individual contribution covariate was modeled as normally distributed around their i (4).
The terms in Equation (5) are as follows:  Table 2). All of these factors are proxies for experience, and hence, significant proportional differences within each group would be interpreted as the contribution of the person's experience to her/his ability in detecting a bird call. The gen group is composed of only two classes and represents the observer's gender; a significant difference between the female and male covariate would be interpreted as a gender specific contribution to the person's ability in detecting a call. Finally, the age represents the influence of age on the ability of detecting calls. It is multiplied by the standardized (z = x− ) age of each observer (Table 2).
For the ARU terms, r 14:24 was still hierarchically modeled, but without its own linear model, since there were no known individual differences that we were testing between the recording units.
Thus, the r 14:24 were normally distributed with a normally distributed sample mean and very small precision. The sample mean in turn had a mean equal to zero and a very small precision.
In the full model: The station covariates matrix, st  The dir r i dir i matrix of covariates is intended to account for the effect of a call coming from a certain direction on each person/ARU detection probability. It is structured in a sectorial fashion, with 8 covariates covering the 360 possible degrees from whence a call could be coming, 45° at a time (e.g., a call coming from the E-NE sector ~70° would be in the second class, whereas one coming from the S-SW sector ~200° would be in the fifth one); a significant (positively or negatively) value on any of these covariates would be interpreted as a person/ARU being more/less able to detect a call that comes from a certain direction. Because the broadcast were all toward the North, calls coming from the South would be in direct line with the ARUs/human observers and our expectation is that this direction would have a higher detection probability.
F I G U R E 6 Influence of station covariate on the detection probability of human observers to broadcast calls during the Rawhiti Experiment. Each human observer corresponds to a different color line in the plots. This posterior probability density plot represents the distribution of the station covariate after the MCMC runs (most of their prior distributions were modeled as normally distributed with zero mean and very small precision). The vertical line, placed on 0, is there to help visualize the proportion of each covariate's posterior that is above or below this point. Covariates with posterior distributions completely above or below zero have more consistent effects on the detection probability F I G U R E 7 Influence of species-call broadcast on the detection probability of ARUs and people to those calls during the Rawhiti Experiment. BKF: brown kiwi female; BKM: brown kiwi male; LSKF: little spotted kiwi female; LSKM: little spotted kiwi male; RR: ruru. Each person and each recorder corresponds to a different color line in the plots. These posterior probability density plots represent the distribution of each species-call covariate after the MCMC runs (most of their prior distributions were modeled as normally distributed with zero mean and very small precision). The vertical line, placed on 0, is there to help visualize the proportion of each covariate's posterior that is above or below this point. Covariates with posterior distributions completely above or below zero have more consistent effects on the detection probability After 70,000 burn-in iterations, seven independent chains ran through JAGS (Plummer, 2016) in the R environment (R core team, 2018) using the coda.samples() command for 200,000 Markov chain Monte Carlo (MCMC) iterations with a thinning interval of 20 (i.e., retaining one value every twenty simulated steps for each variable) for a total of 70,000 assumed independent observations. We ran the effectiveSize() command from the coda package to check the actual number of independent samples from the posterior probability densities. Subsequently, we randomized a matrix of indexes with 10 columns by 10% of the dataset size (4,860 observation = 486) rows, and sequentially removed 10% of the data points at a time to cross-validate the model by checking the percentage of data points that were correctly estimated when the entry was deleted.

| RE SULTS
After the posterior sampling, the chain mixing was visually inspected and overall showed good mixing. The model described the data well; tenfold cross-validation showed that the methods correctly accounted for 82.366% of the data points. Although at least one variable had an extremely small effective sampling size, meaning a high level of autocorrelation for some variables, most showed independent sampling (Minimum = 28.21;1st Quantile = 60,195.35;Median = 67,734.89;Mean = 60,562.09;3rd Quantile = 70,000.00;Maximum = 73,613.36). Since most of the covariates were modeled as being zero mean with very small precision, we can consider those with high-density intervals (posterior probability density between the 1st and 5th quantile) completely above or below zero as significantly affecting the detection probability throughout the analysis. Station had a major influence on the detection probability of human observers (but not ARUs, which did not move during the experiment) with human observers having significantly lower detection probability when listening at station 6, and relatively higher at stations 1, 2, and 4 ( Figure 6).
There was a bias in the ARUs detection probability of some of the broadcast calls together with high variation in detection probability between ARUs (Figure 7). In general, ARUs had significantly lower detection probability for brown kiwi female (BKF) calls, and a higher detection probability for brown kiwi male (BKM) calls (Figure 7). Ruru calls were also less likely to be recorded by ARUs. People had similar detection probabilities for all calls (Figure 7). Direction of the broadcast was not expected to have a clear effect on the detection probability; however, it demonstrated a big influence, as illustrated in the variety of detection probabilities in Figure 8. We could have expected some particular directions to have an effect on the ARUs detection probability (since they are fixed in a location), but from the figure it seems that differences between individual recorders and people are more important than the direction of the broadcast. Note that human observers (and ARUs, but these were stationary) were better at hearing calls coming from specific directions. For example, human observers 1 and 2 had difficulties hearing sounds coming from the W-SW regardless of the station they were listening from.

| D ISCUSS I ON
Overall, we found that human observers and recorders were similar in the detection of sounds supporting some of the non-experimental studies (Table 1), although the variables we measured affected them differently. These differences may account for disagreements between TA B L E 6 Distances between stations and broadcast and altitudinal differences between stations and broadcast (=recorder altitude-speaker altitude; therefore, a positive value indicates that the speaker (bird) is lower than the recorder and vice versa) studies. Human observers were relatively homogeneous in their detection probability, with very little variability between individuals; this is despite wide differences in age and experience between human observers. In contrast, ARUs had more variability in detection probability, with some ARUs having detection probabilities significantly higher than any of the human observers in the study and some significantly lower. The individual contribution of each human observer to detection probability was also less variable than that of recorders. It is possible that less homogeneity of the ARUs resulted from the fact that the ARUs are highly susceptible to the surrounding objects in the environment, for example, different forest densities and obstacles.
Distance affected ARUs detection probability more than humans, with calls broadcast farther away generally having a lower detection probability, as we hypothesized. ARUs have been found to have a smaller hearing radius than humans do (Yip, Leston, Bayne, Solymos, & Grover, 2017), and this probably explains the greater effect of distance on ARUs found in this study. Other differences and inconsistencies in this relationship are probably due to (a) the location of the station in relation to the speaker as is suggested by the strong influence of station on human observers' detection probability; (b) the exact location of the ARU, as ARUs in the same area but a small distance apart had significantly different detection probabilities; and (c) human's directional filtering ability, which allows them to move their head in the direction of the sound.
To our knowledge, no other study has examined the effect of relative altitude between bird and recorder, and within the landscape (valley vs. hilltop) in the detection probability of humans and ARUs. In New Zealand, this is of special importance, as survey stations aimed at detecting kiwi are located at hilltops, assuming that this improves detection. Our results suggest that generally speaking, birds calling from hillsides and those relatively higher or lower from recording sites are less likely to be detected by ARUs, and to a lesser extent by human observers, than those at a similar altitude to listening stations. ARUs had better detection probability if broadcast was line of sight of the location of the ARU.
These differences between ARUs and humans are probably due to the immobility of the ARUs and human's directional filtering ability. As well as being able to move their heads, humans locate sound sources (above, below, front, and back) using different stimulus cues, such as interaural level difference, interaural time difference, and spectral cues, something ARUS cannot do. Humans were strongly affected by Trial, but this seems to be the result of the strong influence of station 6 on human observers' detection probability. Station 6 was located in a deep valley close to a small stream. Both humans and ARUs had difficulties detecting calls from this station. The sound of the stream was not enough to prevent ARUs and humans from recording the broadcast calls, so the depth of the valley was probably the feature that prevented sound reaching ARUs/humans. Overall, we conclude that listening stations would have better detection probability if located like station 4, in a hill overlooking and central to an area to be surveyed.
ARUs exhibited a recording frequency bias: Relatively lower frequency female brown kiwi and ruru trill and weow calls had lower detection probabilities. Yip et al. (2017) also found differences between the frequencies recorders detected when comparing a range of ARUs; some recorders were more attuned to higher frequencies and vice versa.
These authors argue that differences in detectability due to sound's frequencies will affect the distance at which recorders can detect sounds and of course comparability between human and ARU surveys. Our results support these conclusions and indicate that any calibrations will have to be not only ARU brand specific but also consider individual ARUs. Further, differences between ARUs at the same stations suggest that the exact location of the device is important in terms of what they can record, and that consideration should be given to this when selecting the recorder location and also having more than one device per listening station or more than one microphone per ARU. While some commercially available ARUs have two microphones, many have a single omnidirectional microphone. Having more than one microphone could also be used to enable the estimation of location/direction of the sounds by the ARUs, one of the most important criticisms of ARUs (Table 1).
In this experiment, we measured the overall detection probability, the individual contribution of each human observer and ARU to detection probability, and compared the effect of distance, relative altitude, location, species call, and trial on the detection probability of ARUs and humans. We found that human detection probability is more uniform between observers (despite big differences in age and experience of observers) than ARUs', but ARUs can have higher detection probabilities if positioned properly. The variables measured acted differently on ARUs and human observers. We think that the next step is to measure the effect of these variables on human identification capability as well as their effect on the data quality of ARUs, particular with respect to precise location of ARUs. This information is needed to understand human errors in surveys as well as to allow proper calibration between human surveys and ARU surveys, and to inform software production for the automatic identification of species.

ACK N OWLED G EM ENTS
This experiment was a major feat of organization and could have not been possible without the help of many people. Firstly, thanks to our